Sale!

# LIGN 167: Problem Set 2

\$30.00

Category:

LIGN 167: Problem Set 2

Collaboration policy: You may collaborate with up to two other students on this problem
set. You must write up your own answers to the problems; do not just copy and paste from
your collaborators. You must also submit your work individually. If you do not submit a
copy of the problem set under your own name, you will not get credit. When you submit
your work, you must indicate who you worked with, and what each of your individual
contributions were.
Getting started: We will be uploading a file called pset2.py to Piazza. This file will contain
some starter code for the problem set (some functions that you should call in your answers),
In this problem set you will be implementing gradient descent for optimizing real-valued
functions. For the rest of the class, we will be most interested in applying gradient descent
for optimizing statistical models, including neural networks.
In class we defined logistic regression in the following manner. We have a dataset that
consists of two parts: X = {~x1, …, ~xn} and Y = {y1, …, yn}. Each element ~xi
in X is
a k-dimensional vector: ~xi = (xi,1, …, xi,k). Each element yi
in Y is a Boolean variable:
yi ∈ {0, 1}.
Our goal is to predict the value of each yi from the corresponding input value ~xi
. We
want to find parameter values ~a and b that maximize the following probability:
P(Y |X, ~a, b) = Yn
i=1
P(yi
|~xi
, ~a, b) (1)
Here P(yi
|~xi
, ~a, b) is defined by the logistic distribution:
P(yi
|~xi
, ~a, b) = ( 1
1+e−(~a·~xi+b)
, if yi = 1
1 −
1
1+e−(~a·~xi+b)
, if yi = 0
(2)
We can equivalently write this distribution using the sigmoid function σ:
σ(x) = 1
1 + e−x
(3)
P(yi
|~xi
, ~a, b) = (
σ(~a · ~xi + b), if yi = 1
1 − σ(~a · ~xi + b), if yi = 0
(4)
The parameter b is a scalar, while the parameter ~a is a k-dimensional vector: ~a = (a1, …, ak)
1
As discussed in class, the problem of maximizing P(Y |X, ~a, b) can be transformed into
the problem of minimizing a particular loss function. We define the loss function L as
follows:
L(~a, b|X, Y ) = − log P(Y |X, ~a, b) (5)
= − logYn
i=1
P(yi
|~xi
, ~a, b) (6)
= −
Xn
i=1
log P(yi
|~xi
, ~a, b) (7)
We will use the following notation to denote the loss associated with datapoint ~xi
, yi
:
`i = − log P(yi
|~xi
, ~a, b) (8)
In this notation, the total loss L can be written:
L(~a, b|X, Y ) = Xn
i=1
`i (9)
The values of ~a, b that maximize P(Y |X, ~a, b) are the same values of ~a, b that minimize
L(~a, b|X, Y ). Thus we will apply gradient descent to L in order to find the (approximately)
optimal values of ~a, b.
We will assume throughout that b=0. Thus we will only be trying to find the
optimal value of ~a.
We previously derived a formula for the partial derivatives ∂`i
∂aj
:
∂`i
∂aj
= −(yi − P(yi = 1|~xi
, ~a))xi,j (10)
= −(yi − σ(~xi
· ~a))xi,j (11)
The partial derivative of L can therefore be written as:
∂L
∂aj
=
Xn
i=1
∂`i
∂aj
(12)
Problem 1. Write a function logistic_positive_prob that takes as input a vector x_i of
dimension k and a vector a of dimension k. These vectors are assumed to be NumPy arrays
of length k.
The function should return the probability P(yi = 1|~xi
, ~a), as defined by Equation 4.
(Remember that we have set b = 0 throughout the problem set.) As starter code, we have
provided an implementation of the sigmoid function σ in pset2.py (posted on Piazza).
Problem 2. Write a function logistic_derivative_per_datapoint, which takes four arguments: y_i (either 0 or 1), x_i (a k-dimensional NumPy array), a (a k-dimensional NumPy
array), and k (an integer between 0 and k − 1). It should return ∂`i
∂aj
, using formula 10 and
the answer to the previous problem.
2
Problem 3. Write a function logistic_partial_derivative. It should take four arguments:
y (a list of n 0’s or 1’s), x (a list of n NumPy arrays; each NumPy array should have length
k), a (a k-dimensional NumPy array), and k (an integer between 0 and k−1). The variables
y and x are each Python lists of length n; the i’th element of the lists, x[i] and y[i], represents
the i’th observation in our dataset, xi and yi
.
The function should return the partial derivative of the loss function, ∂L
∂aj
, as given in
Equation 12. It should do this using the function logistic_derivative_per_datapoint.
Problem 4. Write a function compute_logistic_gradient that takes three arguments: a,
y, and x. The arguments a, y, and x should have the same types as in Problem 3.
The function should return a NumPy Array of length k (the same length as input a).
This NumPy array should contain the gradient of the loss function L, i.e. ∇L. It should
use the function logistic_partial_derivative from Problem 3 to do this.
Problem 5. Write a function gradient_update, which takes three arguments: a, lr, and
gradient. The variable a is a NumPy array of length k, and it represents our current guess
for the parameter vector ~a. The variable lr is the learning rate, which is a real number
greater than 0. The variable gradient is a NumPy array of length k, which represents the
gradient at the current time step.
The function should return the updated value for the parameter vector a, after applying
the gradient update rule with learning rate equal to lr.
Problem 6. Write a function gradient_descent_logistic, which takes five arguments:
initial_a, lr, num_iterations, y, and x. The variables y and x should have the same
types as in previous problems. The variable initial_a is a NumPy array of length k, which
is our initial guess for the value of the parameters a. The variable lr is again the learning
rate. The variable num_iterations is an integer greater than 0, which is the number of
iterations that we will run gradient descent for.
The function should run the gradient descent algorithm for num_iterations iterations.
These iterations of gradient descent should optimize the logistic regression parameter vector
~a, given the input dataset y and x. It should return a NumPy array which represents the
final estimate of the parameter vector ~a.
(Use the gradient_update function that you wrote in the previous problem.)
Problem 7. In the starter code that you’ve been given (in pset2.py), there is a function
create_dataset. This function returns two lists, x and y. The variable x is a list of 100
2-dimensional NumPy arrays, and y is a list of Boolean values sampled from the logistic
distribution, given the elements of x as input. The output y is generated assuming that the
parameter vector ~a equals (10,10).
Call create_dataset once to create values of x and y. Using matplotlib.pyplot.scatter,
plot the elements of x, along with the labels y. The plot should be two-dimensional,
with color used to indicate the value of each element yi
in y. See here for an example:
https://stackoverflow.com/questions/17682216/scatter-plot-and-color-mapping-in-python
Problem 8. Call create_dataset a number of times to create some datasets. On each of
these datasets, call gradient_descent_logistic to get an estimate for the parameter vector
~a. Use a learning rate lr = 0.01, initial_a equal to (-1,-1), and num_iterations = 1000.
What values of ~a are returned by your gradient descent algorithm? Does this suggest
that gradient descent is accurately solving the classification problem?
3
Important: For this question and the other free-response questions below,
Problem 9. Repeat Problem 8, but change initial_a to different values (for example (-
1,0) or (0,-1)). What do you notice about the estimated value of ~a that is returned each
time? What does this tell you about the behavior of gradient descent on logistic regression?
(It is recommended that you call create_dataset a number of times for each value of
initial_a, in order to get a clear picture of what is going on.)
Problem 10. Repeat Problem 8, but change the learning rate lr to different values. For
example, try lr = 0.001 and lr = 0.1. What do you notice about the estimated value of ~a
that is returned each time. What does this tell you about the behavior of gradient descent
on logistic regression?
Problem 11. A common technique in statistical modeling, including neural networks, is
called regularization. Regularization involves adding an additional penalty term to the loss
function. L2-regularization (also known as ridge regression) involves the following modification to Equation 5 (note that we are dropping the parameter b, which we have assumed
is set to 0):
Lridge(~a|X, Y ) = −
Xn
i=1
log P(yi
|~xi
, ~a) + λkak
2
(13)
= −
Xn
i=1
log P(yi
|~xi
, ~a) + λ
X
k
i=1
a
2
i
(14)
Here we have used the definition of the norm kak
2 =
Pk
i=1 a
2
i
. The term λ > 0 is a real
number.
Write a function logistic_l2_partial_derivative, which takes four arguments: y, x, a,
and j. These four arguments should be of the same types as in Problem 3, where you wrote
logistic_partial_derivative. The function should return the partial derivative of this new
loss function, ∂Lridge
∂aj
.
Your function should use logistic_derivative_per_datapoint, which you defined in
Problem 2.
You need to choose a value of λ in order to compute the partial derivative. Set λ equal
to 0.1 for right now. You will be changing this value in later problems.
Problem 12. Write a function compute_logistic_l2_gradient, which takes three arguments: a, y, and x. The arguments a, y, and x should have the same types as in Problem
4.
The function should return a NumPy Array of length k (the same length as input a).
This NumPy array should contain the gradient of the new regularized loss function Lridge,
i.e. ∇Lridge. It should use the function logistic_l2_partial_derivative from Problem 11
to do this.
Problem 13. In this problem we will be writing a function which generalizes the gradient
descent function that you wrote in Problem 6. In Problem 6, you wrote out the gradient
descent algorithm, for the specific case of (ordinary) logistic regression. In this problem, you
will write a gradient descent algorithm which can work for optimizing arbitrary functions.
4
Define a function gradient_descent, which takes six arguments: initial_a, lr, num_iterations,
y, x, and gradient_fn. The first five variables should have the same types as in Problem
6.
The last argument, gradient_fn, should be a function which takes some inputs, and
in Problem 4 and compute_logistic_l2_gradient in Problem 12.
You should assume that gradient_fn takes three arguments, a, y, and x, as in Problems
4 and 12. You can also assume that it returns a NumPy array with the same length as a.
iterations. These iterations of gradient descent should optimize the parameter vector ~a, given
the input dataset y and x. Each gradient update should use the function gradient_fn,
which has been given as an argument, in order to compute the gradient.
The function gradient_descent should return a NumPy array which represents the final
estimate of the parameter vector ~a.
(Note: this problem is giving you a small introduction to functional programming in
Python. If you are confused about how one function can take another function as an argument, see the first part of these lecture notes: https://courses.csail.mit.edu/6.01/spring08/handouts/week2/week2-
notes.pdf)
Problem 14. Call create_dataset a number of times to create some datasets. On each of
these datasets, use gradient descent to optimize the parameter vector ~a, given the Lridge
loss function defined in Equation 13.
More specifically, on each of your datasets, call gradient_descent (from Problem 13)
using the compute_logistic_l2_gradient function (from Problem 12). This will return an
estimate for the parameter vector ~a, given the Lridge loss function.
Use a learning rate lr = 0.01, initial_a equal to (-1,-1), and num_iterations = 1000.
Compare the estimated values of ~a that are returned here to the values that were returned
in Problem 8 (which used the same parameter settings). What is different about the values
of ~a that are returned here? Why does this difference exist?
Problem 15. Within Problem 11, when you were defining logistic_l2_partial_derivative,
you set the value λ equal to 0.1. In this problem, we want to explore the effect of changing
λ.
Decrease λ in logistic_l2_partial_derivative to 0.01. Repeat what you did in Problem
14, with this new value of λ. What do you notice about the estimated values of ~a, as
compared to Problem 14?
Now increase λ, first to 1 and then to 10. What changes about the value of the estimated
~a? What does λ seem to be controlling?
At the end of this process, please remember to change the value of λ back to 0.1
in your definition of logistic_l2_partial_derivative. We want to be able to grade you
fairly!
5 LIGN 167: Problem Set 2
\$30.00
Hello
Can we help?