CSC413/2516

Homework 1 – Version 1.4

Submission: You must submit your solutions as a PDF file through MarkUs1

. You can produce

the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.

See the syllabus on the course website2

for detailed policies. You may ask questions about the

assignment on Piazza3

. Note that 10% of the homework mark (worth 1 pt) may be removed for a

lack of neatness.

The teaching assistants for this assignment are Denny Wu and Jonathan Lorraine.

mailto:[email protected]

1 Hard-Coding Networks

The reading on multilayer perceptrons located at https://csc413-2020.github.io/assets/readings/

L02a.pdf may be useful for this question.

1.1 Verify Sort [1pt]

In this problem, you need to find a set of weights and biases for a multilayer perceptron which

determines if a list of length 4 is in sorted order. More specifically, you receive four inputs x1, . . . , x4,

where xi ∈ R, and the network must output 1 if x1 ≤ x2 ≤ x3 ≤ x4, and 0 otherwise. You will use

the following architecture:

All of the hidden units and the output unit use a hard threshold activation function:

φ(z) = I(z ≥ 0) =

1 if z ≥ 0

0 if z < 0

Please give a set of weights and biases for the network which correctly implements this function

(including cases where some of the inputs are equal). Your answer should include:

• A 3 × 4 weight matrix W(1) for the hidden layer

• A 3-dimensional vector of biases b

(1) for the hidden layer

• A 3-dimensional weight vector w(2) for the output layer

1

https://markus.teach.cs.toronto.edu/csc413-2020-01

2

https://csc413-2020.github.io/assets/misc/syllabus.pdf

3

https://piazza.com/class/k58ktbdnt0h1wx?cid=1

1

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

• A scalar bias b

(2) for the output layer

You do not need to show your work.

1.2 Perform Sort [1pt]

Describe how to implement a sorting function ˆf : R

4 → R

4 where ˆf(x1, x2, x3, x4) = (ˆx1, xˆ2, xˆ3, xˆ4)

where (ˆx1, xˆ2, xˆ3, xˆ4) is (x1, x2, x3, x4) in sorted order. In other words, ˆx1 ≤ xˆ2 ≤ xˆ3 ≤ xˆ4, and each

xˆi

is a distinct xj . Implement ˆf using a feedforward or recurrent neural network with elementwise

activations. Do not explicitly give the weight matrices or biases for the entire function. Instead,

describe how to compose smaller, modular networks. You may multiply the outputs of two nodes

if you wish – ex., some hidden unit activation by the output of a thresholding function.

Hint: There are multiple solutions. You could brute-force the answer by using copies of Verify

Sort on permutations of the input, or you could implement a more scalable sorting algorithm where

each hidden layer i is the algorithms state at step i.

1.3 Universal Approximation Theorem [1pt]

We are going to build an intuition behind a simple Universal Approximation theorem, which shows

that some class of function approximators can approximate a particular class of functions arbitrarily

well.

In the reading we saw that neural networks can be universal approximators on binary functions,

because with fixed input dimension there is a finite number of potential inputs and a network can

memorize a different output for each input. But, what can we do if we have an uncountably infinite

set of potential inputs like R? Here, our class of function approximators will be neural networks

with a single hidden layer with a threshold function as the activation, and fixed choice of some

concave f.

Suppose f : I → R, where I = [a, b] ⊂ R and a ≤ b is a closed interval. Also, let ˆfτ : I → R be

some function approximator from our network where τ is a description of our networks architecture

and weights. Here, τ is a tuple of (n,W0 ∈ R

n×1

, b0 ∈ R

n

,W1 ∈ R

1×n

, b

1 ∈ R), where n is the

hidden layer size, W0 & b0 describe the input to hidden parameters, and W1 & b1 describe the

hidden to output parameters. This is visualized in Figure 1.

The difference between our functions is defined as kf − ˆfτ k =

R

I

|f(x)− ˆfτ (x)| dx. Our activation

is an indicator function a(y) = I(y ≥ 0), where I(s) is 1 when the boolean value s is true and 0

otherwise. The output is computed as ˆfτ (x) = W1a(W0x + b0) + b1. Here, applying a to a vector

means a(x) = [a(x1), a(x2), . . . , a(xn)].

We want to show that there exist a series of neural networks {τi}

N

i=1 such that:

∀ > 0, ∃M : ∀m > M, kf − ˆfτmk < (1)

1.3.1

Consider a bump function g(h, a, b, x) = h·I(a ≤ x ≤ b) visualized in Figure 2. Given some (h, a, b)

show ∃τ :

ˆfτ (x) = g(h, a, b, x). You answer should be a specific choice of n, W0, b0, W1, and b1,

which will be functions of the selected (h, a, b), where h ∈ R, a ∈ R, and b ∈ R.

1.3.2

Given f(x) = −x

2 + 1 where I = [−1, 1] and some initial function ˆf0(x) = 0 which is identically 0,

construct a new function ˆf1(x) = ˆf0(x) + g(h1, a1, b1, x) such that kf − ˆf1k < kf − ˆf0k, with the g

2

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

.

.

.

x

ˆfτ (x)

Input (W0,b0)

Hidden

Layer (W1,b1) Output

Figure 1: A neural network that has an input x ∈ R with a single hidden layer that has n units, and

an output ˆfτ (x) ∈ R. The network description is τ = (n,W0 ∈ R

n×1

, b0 ∈ R

n

,W1 ∈ R

1×n

, b

1 ∈

R).

0

a b

h

Figure 2: A bump function g(h, a, b, x) is shown in red as a function of x, for some choice of (h, a, b).

defined in 1.3.1. Note that h1, a1, and b1 are going to depend on our choice of f,

ˆf0 and I. Plot f

& ˆf1, write down h1, a1, and b1, and justify why kf − ˆf1k < kf − ˆf0k.

1.3.3

Describe a procedure which starts with ˆf0(x) = 0 and a fixed N then construct a series {

ˆfi}

N

i=0

where ˆfi+1(x) = ˆfi(x)+g(hi+1, ai+1, bi+1, x) which satisfies kf − ˆfi+1k < kf − ˆfik. Use the definition

of g from 1.3.1 and the choice of f from 1.3.2. Plot f, ˆf1,

ˆf2, & ˆf3, write down how to generate

hi+1, ai+1, bi+1, and justify why kf − ˆfi+1k < kf − ˆfik.

1.3.4

Not for marks – do not submit. Describe how to recover τi from your ˆfi

. Generalize your series

to work for {

ˆfi}∞

i=0, and show that limm→∞ kf − ˆfτmk = c. We need c = 0 for arbitrarily good

approximations – does your solution for {

ˆfi}∞

i=0 force c = 0? Can you generalize your procedure to

work for any concave f? What about an f which isn’t unimodal, like a piecewise continuous f?

3

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

2 Backprop

The reading on backpropagation located at https://csc413-2020.github.io/assets/readings/

L02b.pdf may be useful for this question.

2.1 Computational Graph [1pt]

Consider a neural network with N input units, N output units, and K hidden units. The activations

are computed as follows:

z = W(1)x + b

(1)

h = ReLU(z)

y = x + W(2)h + b

(2)

,

y

0 = softmax(y),

where ReLU(z) = max(z, 0) denotes the ReLU activation function, applied elementwise, and

softmax(y) = exp(y)

PM

i=1 exp(yi)

.

The cost will involve both h and y

0

:

J = R − S

R = r

>h

S =

X

N

k=1

I(t = k)y

0

k

for given vectors r and class label t with input x.

2.1.1

Draw the computation graph relating x, t, z, h, y, y

0

, r, R, S, and J .

2.1.2

Derive the backprop equations for computing x¯ =

∂J

∂x

. You may use softmax0

to denote the

derivative of the softmax function (so you don’t need to write it out explicitly).

2.2 Vector-Jacobian Products (VJPs) [1pt]

Consider the function f : R

n → R

n where f(x) = vvT x, and v ∈ R

n×1 and x ∈ R

n×1

. Here, we

will explore the relative costs of evaluating Jacobians and vector-Jacobian products. We denote

the Jacobian of f with respect to x as J ∈ R

n×n

.

2.2.1

Compute J as defined in 2.2 for n = 3 and v

T = [1, 2, 3] – i.e, write down the values in J =

j1,1 j1,2 · · · j1,n

j2,1 j2,2 · · · j2,n

.

.

.

.

.

.

.

.

.

.

.

.

jn,1 jn,2 · · · jn,n

.

4

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

2.2.2

What is the time and memory cost of evaluating the Jacobian of function f in terms of n?

2.2.3

Describe how to evaluate J

T y where y ∈ R

n with a time and memory cost that is linear in n,

where J is defined as in 2.2. Then, compute z = J

T y where v

T = [1, 2, 3] and y

T = [1, 1, 1] – i.e.,

write down the entries in z

T = [z1, . . . , zn].

3 Linear Regression

The reading on linear regression located at https://csc413-2020.github.io/assets/readings/

L01a.pdf may be useful for this question.

Given n pairs of input data with d features and scalar label (xi

, ti) ∈ R

d × R, we wish to

find a linear model f(x) = wˆ

>x with wˆ ∈ R

d

that minimizes the squared error of prediction on

the training samples defined below. This is known as an empirical risk minimizer. For concise

notation, denote the data matrix X ∈ R

n×d and the corresponding label vector t ∈ R

n

. The

training objective is to minimize the following loss:

min

wˆ

1

n

Xn

i=1

(wˆ

>xi − ti)

2 = min

wˆ

1

n

kXwˆ − t|

2

2

.

We assume X is full rank: X>X is invertible when n > d, and XX> is invertible otherwise.

Note that when d > n, the problem is underdetermined, i.e. there are less training samples than

parameters to be learned. This is analogous to learning an overparameterized model, which is

common when training of deep neural networks.

3.1 Deriving the Gradient [1pt]

Write down the gradient of the loss w.r.t. the learned parameter vector wˆ .

3.2 Underparameterized Model [1pt]

3.2.1

First consider the underparameterized d < n case. Write down the solution obtained by gradient

descent assuming training converges. Show your work. Is the solution unique?

Hint: Set the gradient derived in part 3.1 to 0 and manipulate the equality.

3.2.2

Assume that ground truth labels are generated by a linear target: ti = w∗>xi

. Show that the

solution in part 3.2.1 achieves perfect generalization when d < n, i.e. ∀x ∈ R

d

, (w∗>x−wˆ

>x)

2 = 0.

3.3 Overparameterized Model: 2D Example [1pt]

3.3.1

Now consider the overparameterized d > n case. We first illustrate that there exist multiple

empirical risk minimizers. For simplicity we let n = 1 and d = 2. Choose x1 = [2, 1] and t1 = 2,

5

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

i.e. the one data point and all possible wˆ lie on a 2D plane. Show that there exists infinitely many

wˆ satisfying wˆ

>x1 = y1 on a real line. Write down the equation of the line.

3.3.2

We know that multiple empirical risk minimizers exist in overparameterized linear regression and

there is only one true solution. Thus, it seems unlikely that gradient descent will generalize if it

returns an arbitrary minimizer. However, we will show that gradient descent tends to find certain

solution with good properties. This phenomenon, know as implicit regularization, helps explain

the success in using gradient-based methods to train overparameterized models, like deep neural

networks.

First consider the 2-dimensional example in the previous part: starting from zero initialization

i.e. wˆ (0) = 0, what is the direction of the gradient? You should write down a unit-norm vector.

Does the direction change along the trajectory? Based on this geometric intuition, which solution

– along the line of solutions – does gradient descent find? Provide a pictorial sketch or a short

description of your reasoning.

3.3.3

Give a geometric argument that among all the solutions on the line, the gradient descent solution

from the previous part has the smallest Euclidean norm.

Hint: Use the Pythagorean Theorem.

3.4 Overparameterized Model: General Case [1pt]

3.4.1

Now we generalize the previous geometric insight developed to general d > n. Show that gradient

descent from zero initialization i.e. wˆ (0) = 0 finds a unique minimizer if it converges. Write down

the solution and show your work.

Hint: Recall the result on the direction of the gradient in the previous 2D example, which

suggests that the gradient vector is always spanned by the rows of X. What does this tell you

about the gradient descent solution?

3.4.2

Given the gradient descent solution from the previous part wˆ and another zero-loss solution wˆ 1,

evaluate (wˆ − wˆ 1)

>wˆ . Use this quantity to show that among all the empirical risk minimizers for

d > n, the gradient descent solution has the smallest Euclidean norm.

3.5 Benefit of Overparameterization

3.5.1

Visualize and compare underparameterized with overparameterized polynomial regression: https:

//colab.research.google.com/drive/1NCSHp-gkh1kGhcySImbJo6AqeKoz-HcR. Include your code

snippets for the fit_poly function in the write-up. Does overparameterization (higher degree polynomial) always lead to overfitting, i.e. larger test error?

6

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1

3.5.2

Not for marks. What are some potential benefits of the minimum-norm solution?

Hint: readings on SVM might be helpful: https://www.cs.toronto.edu/~urtasun/courses/

CSC411_Fall16/15_svm.pdf.

7