CSC413/2516

Homework 2 – Version 1.1

Submission: You must submit your solutions as a PDF file through MarkUs1

. You can produce

the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.

See the syllabus on the course website2

for detailed policies. You may ask questions about the assignment on Piazza3

. Note that 10% of the homework mark may be removed for a lack of neatness.

The teaching assistants for this assignment are Ian Shi and Andrew Jung.

mailto:[email protected]

1 Optimization

This week, we will continue investigating the properties of optimization algorithms, focusing on

stochastic gradient descent and adaptive gradient descent methods. For a refresher on optimization,

please refer to: https://csc413-2020.github.io/assets/readings/L04.pdf.

We will continue using the linear regression model established in Homework 1. Given n pairs

of input data with d features and scalar labels (xi

, ti) ∈ R

d × R, we wish to find a linear model

f(x) = wˆ

T x with wˆ ∈ R

d

such that the squared error on training data is minimized. Given a data

matrix X ∈ R

n×d and corresponding labels t ∈ R

n

, the objective function is defined as:

L =

1

n

kXwˆ − tk

2

2

(1)

1.1 Stochastic Gradient Descent (SGD)

SGD performs optimization by taking a stochastic estimate of the gradient from a single training

example. This process is iterated until convergence is reached. Let xi ∈ R

d

, 1 ≤ i ≤ n be a single

training datum taken from the data matrix X. Li denotes the loss with respect to xi

, the update

for a single step of SGD at time t with scalar learning rate η is:

wt+1 ← wt − η∇wtLi(xi

, wt) (2)

SGD iterates by randomly drawing training samples and updating model weights using the above

equation until convergence is reached.

1.1.1 Minimum Norm Solution [2pt]

Recall Question 3.4 from Homework 1. For an overparameterized linear model, d > n, gradient

descent (GD) starting from zero initialization finds the unique minimum norm solution w∗

such

that Xw∗ = t. Let w0 = 0, d > n. Assume SGD also converges to a solution wˆ such that

Xwˆ = t. Show that SGD solution is identical to the minimum norm solution w∗ obtained by

gradient descent, i.e., ˆw = w∗

.

Hint: Reuse the properties shown in Homework 1 Q3.4. Is xi contained in span of X? Do the

update steps of SGD ever leave the span of X?

1

https://markus.teach.cs.toronto.edu/csc413-2020-01

2

https://csc413-2020.github.io/assets/misc/syllabus.pdf

3

https://piazza.com/class/k58ktbdnt0h1wx?cid=1

1

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 2

1.1.2 Mini-batch SGD [0pt]

Recall that mini-batch SGD performs stochastic gradient descent by considering the gradient of

mini-batches of data B ∈ R

b×d where 1 < b << n and B is taken from the rows of X. Under the

assumptions in Question 1.1.1, does mini-batch stochastic gradient descent obtain the minimum

norm solution on convergence?

1.2 Adaptive Methods

We will next consider the behavior of adaptive gradient descent methods. In particular, we will

investigate the AdaGrad4 method. Let wi denote the i-th parameter. A scalar learning rate η is

used. At time t for parameter i, the update step for AdaGrad is shown by:

wi,t+1 = wi,t −

η

√

Gi

, t +

∇wi,tL(wi,t) (3)

Gi,t = Gi,t−1 + (∇wi,tL(wi,t))2

(4)

The term is a fixed small scalar used for numerical stability. Intuitively, Adagrad can be thought of

as adapting the learning rate in each dimension to efficiently move through badly formed curvatures

(see lecture slides/notes).

1.2.1 Minimum Norm Solution [0pt]

Consider the overparameterized linear model (d > n) for the loss function defined in Section 1.

Assume the AdaGrad optimizer converges to a solution. Provide a proof or counterexample for

whether AdaGrad always obtains the minimum norm solution.

Hint: Compute the 2D case from HW1. Let x1 = [2, 1], w0 = [0, 0], t = [2].

1.2.2 General case [0pt]

Consider the result from the previous section. Does this result hold true for other adaptive methods

(RMSprop, Adam) in general? Why might making learning rates independent per dimension be

desireable?

2 Gradient-based Hyper-parameter Optimization

In this problem, we will implement a simple toy example of gradient-based hyper-parameter optimization, introduced in Lecture 3 (slides 21).

Often in practice, hyper-parameters are chosen by trial-and-error based on a model evaluation

criterion. Instead, gradient-based hyper-parameter optimization computes gradient of the evaluation

criterion w.r.t. the hyper-parameters and use this gradient to directly optimize for the best set of

hyper-parameters. For this problem, we will optimize for the learning rate of gradient descent in a

linear regression problem, like in homework 1.

Similar to homework 1, a linear model will be used for this problem. Specifically, given n pairs

of input data with d features and scalar label (xi

, ti) ∈ R

d × R, we wish to find a linear model

f(x) = wˆ

>x with wˆ ∈ R

d

that minimizes the squared error of prediction on the training samples.

4

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

2

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 2

Using the concise notation for the data matrix X ∈ R

n×d and the corresponding label vector t ∈ R

n

,

the squared error loss can be written as:

L =

1

n

kXwˆ − tk

2

2

.

Starting with the initial weights w0, gradient descent (GD) updates w0 with a learning rate η

for t number of iterations. Let’s denote the weights after t GD iterations as wt

, the loss as Lt

, and

its gradient as ∇wt

. The goal is to find the optimal learning rate by following the gradient of Lt

w.r.t. the learning rate η.

2.1 Computation Graph of Learning Rates [2pt]

2.1.1

Consider a case of 2 GD iterations. Draw the computation graph to obtain the final loss L2 in

terms of w0,L0, ∇w0L0, w1,L1, ∇w1L1, w2, and η.

2.1.2

Then, consider a case of t GD updates. What is the memory complexity for the forward-propagation

to compute Lt

in terms of t? What is the memory complexity for using the standard backpropagation to compute the gradient w.r.t. the learning rate, ∇ηLt

in terms of t?

2.1.3

Explain one potential problem for applying gradient-based hyper-parameter optimization in more

realistic examples where models often take many iterations to converge.

2.2 Learning Learning Rates [2pt]

In this section, we will take a closer look at the gradient w.r.t. the learning rate. Let’s start with

the case with only one GD iteration, where GD updates the model weight from w0 to w1.

2.2.1

Write down the expression of w1 in terms of w0, η, t and X. Then, using the expression to derive

the loss L1 after single GD iteration in terms of η.

Hint: if the expression gets too messy, introduce a constant vector a = Xw0 − t

2.2.2

Determine if this L1 is convex w.r.t. the learning rate η.

Hint: a function is convex if its second order derivative is positive

2.2.3

Write down the derivative of L1 w.r.t. η and use it to find the optimal learning rate η

∗

that

minimizes the loss after one GD iteration.

3

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 2

2.3 Multiple Inner-loop Iterations [0pt]

2.3.1

Derive the expression of the loss Lt after t gradient descent updates.

Hint: proof by induction and binomial coefficients can be useful

2.3.2

Determine if this Lt

is in general convex w.r.t. the learning rate η?

2.3.3

For the previous 2D over-parameterized case from homework 1, describe the convexity of Lt w.r.t.

η. What is the optimal η?

3 Convolutional Neural Networks

The last set of questions aims to build basic familiarity with Convolutional Neural Networks. To

refresh your knowledge, please refer to the reading on CNNs: https://csc413-2020.github.io/

assets/readings/L05a.pdf and https://csc413-2020.github.io/assets/readings/L05b.pdf

3.1 Convolutional Filters [1pt]

Given the input matrix I and filter J shown below, 1) Write down the values of the resulting matrix

(I ∗ J) (the convolution operation as defined in the Lec 5 slides). Assume we have use zero padding

around the input. 2) What feature does this convolutional filter detect?

I =

0 1 1 0 0

0 1 1 1 0

1 1 1 1 0

0 1 1 1 0

0 0 1 0 0

J =

0 −1 0

−1 4 −1

0 −1 0

I ∗ J =

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

3.2 Size of ConvNets [1pt]

Consider a conv net with 4 conv layers like in the diagram below. All 4 conv layers have kernel

size of 3 × 3. The number after the hyphen specifies the number of output channels or units of a

layer (e.g. Conv3-64 layer has 64 output channels and FC-1024 has 1024 output units). All the

Max Pool in the diagram has size of 2 × 2. Assume zero padding for conv layers and stride 2 for

Max Pool.

Size of the RGB input image is 112 × 112 (3 channels).

Calculate the number of parameters for this conv net including the bias units.

4