CSE 512: Homework

Extra practice problems, ungraded

1. Gradients. Compute the gradients of the following functions. Give the exact dimension of the output.

(a) Linear regression. f(x) = 1

40 kAx − bk

2

2

, A ∈ R

20×10

(b) Sigmoid. f(x) = σ(c

T x), c ∈ R

5

, σ(s) = 1

1+exp(−x)

. Hint: Start by showing that σ

0

(s) = σ(s)(1 − σ(s)).

2. Convex or not convex. Are the following sets convex or not convex? Justify your answer.

(a) S = range(A) := {x : Az = x for some z}

(b) S = {x : x ≤ −1} ∪ {x : x ≥ 1} (Read: either x ≤ −1 or x ≥ 1.)

3. Am I positive semidefinite? A symmetric matrix X is positive semidefinite if for all u, u

T Xu ≥ 0. For each of the

following, either prove that the matrix is positive semidefinite, or find a counterexample.

(a) X =

1 1 1

1 1 1

1 1 1

(b) X =

−1 −1 −1

−1 1 −1

−1 −1 1

4. Convex or not convex. (1pt, 0.125 points each.) From lecture, we know that there are three ways of checking

whether a function is convex or not.

For any function, we can check if it satisfies the definition of convexity:

f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y), ∀x, y, ∀0 ≤ θ ≤ 1.

For any differentiable function, we can check the first-order condition

f(x) − f(y) ≥ ∇f(y)

T

(x − y).

For any twice-differentiable function, we can check the second-order condition

∇2

f(x) is positive semidefinite, i.e. u

T ∇2

f(x)u ≥ 0 ∀u.

Use any of these ways to determine whether or not each function is convex. (You only need to use one of these

rules per function. Pick the one you think gets you to the answer the fastest!)

(a) f(x) = 1

2

(x

2

[1] − 2x[1]x[2] + x

2

[2])

(b) f(x) = |x| Hint: Again, remember the triangle inequality.

(c) f(x) = log(exp(x[1]) + exp(x[2]))

(d) f(x) = c

T x

1

Main assignment, graded

1. Gradient properties. (1 pt, 0.5 pts each.) Prove the following two properties of gradients:

(a) Linearity. If h(x) = αf(x) + βg(x), then ∇h(x) = α∇f(x) + β∇g(x).

(b) Chain rule. Show that if g(v) = f(Av), then ∇g(v) = AT ∇f(Av).

2. Gradients. (1 pts, 0.5 pts each.) Compute the gradients of the following functions. Give the exact dimension of

the output.

(a) Quadratic function. f(x) = 1

2

x

T Qx + p

T x + r, Q ∈ R

12×12 and Q is symmetric (Q[i,j] = Q[j,i]).

(b) Softmax function. f(x) = 1

µ

log(P8

i=1 exp(µx[i])), x ∈ R

8

, µ is a positive scalar

3. Convex or not convex. (1pt) Are the following sets convex or not convex? Justify your answer.

(a) (0.3 pts) S = {x :

P

i

xi = 0}

(b) (0.2 pts) S = {(x, y) : x

2 + y

2 = 1}

(c) (0.2 pts) S = {x : |x| ≤ 1}. Hint: Remember the triangle inequality.

4. Am I positive semidefinite? (1 pt, 0.5 pts each) A symmetric matrix X is positive semidefinite if for all u,

u

T Xu ≥ 0. For each of the following, either prove that the matrix is positive semidefinite, or find a counterexample.

(a) X =

1 2 3

2 1 4

3 4 1

(b) X =

1 0 0

0 2 0

0 0 3

5. Convex or not convex. (1pt, 0.25 points each.)

Use any of the three ways (definition, first order condition, or second order condition) to determine whether or not

each function is convex. (You only need to use one of these rules per function. Pick the one you think gets you to

the answer the fastest!)

(a) f(x) = 1/x, for x > 0

(b) f(x) = kxk∞

(c) f(x) = x

3

[3] + x

2

[2] + x[1]

(d) f(x) = kAx − bk

2

2

6. Polyfit via linear regression. (3 pts )

Download weatherDewTmp.mat. Plot the data. It should look like the following

2

We want to form a polynomial regression of this data. That is, given w = weeks and d = dew readings, we

want to find θ1, …, θp as the solution to

minimize

θ∈Rp

1

2

Xm

i=1

(θ1 + θ2wi + θ3w

2

i + · · · + θpw

p−1

i − di)

2

. (1)

Form X and y such that (1) is equivalent to the least squares problem

minimize

θ∈Rp

1

2

kXθ − yk

2

2

. (2)

That is, for w the vector containing the week number, and y containing the dew data, form

X =

1 w1 w

2

1 w

3

1

· · · w

p−1

1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 wm w

2

m w

3

m · · · w

p−1

m

.

(a) Linear regression.

i. Write down the normal equations for problem (2).

ii. Fill in the code to solve the normal equations for θ, and use it to build a predictor. To verify your code

is running correctly, the number after check number should be 1.341.

iii. Implement a polynomial fit of orders p = 1, 2, 3, 10, 100, for the weather data provided. Include a figure

that plots the original signal, overlaid with each polynomial fit. Comment on the “goodness of fit” for

each value of p.

(b) Ridge regression. Oftentimes, it is helpful to add a regularization term to (2), to improve stability. In other

words, we solve

minimize

θ∈Rp

1

2

kXθ − yk

2

2 +

ρ

2

kθk

2

2

. (3)

for some ρ > 0.

i. Again, write down the normal equations for (3). Your equation should be of form Aθ = b for some matrix

A and vector b that you specify.

ii. Write the code for solving the ridge regression problem and run it. To verify your code is running correctly,

the number after check number should be 1.206.

iii. Using ρ = 1.0, plot the weather data with overlaying polynomial fits with ridge regression. Provide these

plots for p = 1, 2, 3, 10, 100. Comment on the “goodness of fit” and the stability of the fit, and also

compare with the plots generated without using the extra penalty term.

(c) Conditioning.

i. An unconstrained quadratic problem is any problem that can be written as

minimize

θ

1

2

θ

T Qθ + c

T

θ + r (4)

for some symmetric positive semidefinite matrix Q, and some vector c and some scalar r. Show that the

ridge regression problem (3) is an unconstrained quadratic problem by writing down Q, c, and r in terms

of X and y such that (4) is equivalent to (3). Show that the Q you picked is positive semidefinite.

ii. In your code, write a function that takes in X and y, constructs Q as specified in the previous problem,

and returns the condition number of Q. Report the condition number κ(Q) for varying values of p and

ρ, by filling in the following table. Here, m = 742 is the total number of data samples. Report at least 2

significant digits. Comment on how much ridge regression is needed to affect conditioning.

p ρ = 0 ρ = m ρ = 10m ρ = 100m

1

2

5

10

3

iii. Under the same experimental parameters as the previous question, run ridge regression for each choice of

p and ρ, and fill in the table with the mean squared error of the fit:

mean squared error =

1

m

Xm

i=1

(x

T

i

θ − y[i])

2

where xi

is the ith row of X. Comment on the tradeoff between using larger ρ to improve conditioning

vs its affect on the final performance.

p ρ = 0 ρ = m ρ = 10m ρ = 100m

1

2

5

10

(d) Forcasting. Picking your favorite set of hyperparameters (p, ρ), forecast the next week’s dew point temperature.

Plot the forecasted data over the current observations. Do you believe your forecast? Why?

7. (2 pts ) Logistic regression for Binary MNIST

(a) (0.5 pt) What is the gradient of the logistic loss function

f(θ) = −

1

m

Xm

i=1

log(σ(yix

T

i

θ)), σ(s) = 1

1 + e−s

where yi ∈ {−1, 1}? (Hint: Check out problem 1 again.)

(b) Coding gradient descent. Do not use scikit-learn or other built in tools for this exercise. Please only

use the packages that are already imported in the notebook. 1

Open mnist logreg.ipynb. We will use logistic regression to diffrentiate 4’s from 9’s, a notoriously tricky

problem. Run the first box to see what the data looks like.

In the second box I have set up the problem for you by pulling out the train and test set, selecting out

only the data samples related to 4’s and 9’s. I have not altered the data in any other way. While other

normalization tricks will help you improve the accuracy, for the purposes of this exercise we will forgo

them, so that it’s easy to compare everyone’s solutions.

Fill in the next box by providing code that will return the loss function value and the gradient. Make

sure that everything is normalized, e.g. don’t forget the 1/m term in the front of our loss function. Run

the script. If done correctly, you should see

45.192, 12343.177

Write a script that returns the classification accuracy given θ.

Use gradient descent to minimize the logistic loss for this classification problem. Use a step size of 10−6

.

(1 pt) Run for 1500 iterations. In your report, give the plot of the train loss and train/test misclassification

rate, plotted as a function of iterations. Report the final train and test accuracy values.

(c) Coding stochastic gradient descent. Do not use scikit-learn or other built in tools for this exercise.

Please only use the packages that are already imported in the notebook. Now, fill in the next box a function

that takes in θ and a minibatch B as either a list of numbers or as an np.array, and returns the minibatch

gradient

∇Bf(θ) = 1

|B|

X

i∈B

∇fi(θ)

where fi(θ) is the contribution to the gradient from datapoint i:

fi = − log(σ(yix

T

i

θ)).

Run the script. If done correctly, you should see the number 5803.5 printed out.

1This is to test your understanding of the basic machine learning concepts, like calculating accuracy and logistic loss; in the future you can

use whatever tools you’d like.

4

(d) Write a script to run stochastic gradient descent over logistic regression. When coding up the minibatching,

make sure you cycle through an entire training set once before moving on to the next epoch. Additionally,

use time() to record the runtime, and compare the performance of gradient descent and stochastic gradient

descent, using a minibatch size of 50 data samples and running for 50000 iterations. Return a plot that

compares the objective loss, train accuracy, and test accuracy between the two optimization methods, as a

function of runtime. Comment on the pros and cons of the two methods.

Important Remember that calculating the loss function and train/test accuracy requires making full passes

through the data. If you do this at each iteration, you will not see any runtime benefit between stochastic

gradient descent and gradient descent. Therefore I recommend you log these values every 10 iterations for

gradient descent, and every 100 iterations for stochastic gradient descent.

5

Challenge!

1. (1 pt.) Gradient descent for ridge regression. Consider the problem

minimize

x

F (x)

z }| {

1

2

kAx − bk

2

2

| {z }

=:f(x)

+

ρ

2

kxk

2

2

(5)

where A ∈ R

m×n and n > m. Justify all answers.

(a) Define C = AT A. Recall that an eigenvalue of a symmetric matrix C is λ where Cv = λv for some vector v.

Show that if λ

2

max is the largest eigenvalue of C and λ

2

min the minimum eigenvalue of C, then for any vector u,

λ

2

minkuk2 ≤ kCuk2 ≤ λ

2

maxkuk2.

Hint: The eigenvalue decomposition of a symmetric matrix can be written as C = V ΛV

T where V =

v1 v2 · · · vn

contain the eigenvalues vi of C, and Λ = diag(λ1, λ2, · · · , λn) contain the corresponding eigenvalues λi

. (That is, Cvi = λivi

.) Under certain conditions which we will just assume 2

then V is an

orthonormal matrix, e.g. V

T V = V V T = I. Then, we can use this to form projections, e.g. pick any vector

u. Then V V T u = V

T V u = u.

(b) Show that since n > m, then if C = AT A then λmin(C) = 0. Hint: The nonzero eigenvalues of AAT and of

AT A are the same.

(c) We say a function f : R

n → R is L-smooth if its gradient is L-Lipschitz, e.g. for some L > 0,

k∇f(x) − ∇f(y)k2 ≤ Lkx − yk2, ∀x, y.

Is f(x) as defined in (5) L-smooth? What about F(x)?

(d) We say a function f : R

n → R is µ-strongly convex if it is tangent to a quadratic function that is strictly

under it; that is, for some µ > 0,

f(x) − f(y) ≥ ∇f(y)

T

(x − y) + µ

2

kx − yk

2

2

, ∀x, y.

If this is only true for µ = 0, we say f is convex, but not strongly convex.

Is the function f(x) in (5) µ-strongly convex? What about F(x)? Hint: This problem requires less work if

you first answer for F(x), and then take ρ = 0 for f(x).

(e) Linear convergence. We will now show that gradient descent on minimizing F(x) converges linearly. First,

recall that gradient descent iterates as

x

(t+1) = x

(t) − α∇F(x

(t)

)

for some step size α > 0.

i. Show that for any point x

∗ where ∇F(x

∗

) = 0,

kx

(t+1) − x

∗

k2 ≤ ckx

(t) − x

∗

k2

for some c < 1. What is c?

ii. Use this to argue that gradient descent on (5) converges with linear complexity, e.g. the error f(x

(t)

) −

f(x

∗

) = O(c

t

). 3

2eigenvalues must all have algebraic multiplicity = geometric multiplicity

3

It’s a weird convention, but we say O(c

t

) is a linear rate because the loglog graph looks linear. I don’t make the rules, I just share them

with you.

6