CSC413/2516

Homework 4 – Version 1.1

Submission: You must submit your solutions as a PDF file through MarkUs1

. You can produce

the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.

See the syllabus on the course website2

for detailed policies. You may ask questions about the

assignment on Piazza3

. Note that 10% of the homework mark (worth 1 pt) may be removed for a

lack of neatness.

The teaching assistants for this assignment are Sajad Norouzi and John Chen.

mailto:[email protected]

Clarifications

All the modifications to the original assignment will be marked with red in the main body of the

assignment. A complete list of the modifications can be found below:

• In question 1.2.1 the upper bound in 0 ≤ σmax(

∂xn

∂x1

) ≤ (

1

2

)

2

is changed to

0 ≤ σmax(

∂xn

∂x1

) ≤ (

1

2

)

n

1

https://markus.teach.cs.toronto.edu/csc413-2020-01

2

https://csc413-2020.github.io/assets/misc/syllabus.pdf

3

https://piazza.com/class/k58ktbdnt0h1wx?cid=1

1

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

1 Architectural Choice v.s. Vanishing / Exploding Gradients

For any successful deep learning system, choosing the right network architecture is as important

as choosing a good learning algorithm. In this question, we will explore how various architectural

choices can have a significant impact on learning. We will analyze the learning performance from

the perspective of vanishing /exploding gradients as they are backpropagated from the final layer

to the first.

1.1 Warmup: A Single Neuron RNN

Consider an n layered fully connected network that has scalar inputs and outputs. For now, assume

that all the hidden layers have a single unit, and that the weight matrices are set to 1 (because

each hidden layer has a single unit, the weight matrices have a dimensionality of R

1×1

).

1.1.1 Effect of Activation – Sigmoid [1pt]

Lets say we’re using the sigmoid activation. Let x be the input to the network and let f : R

1 → R

1

be the function the network is computing. Do the gradients necessarily have to vanish or explode as

they are backpropagated? Answer this by showing that 0 ≤ | ∂f(x)

∂x | ≤ (

1

4

)

n

, where n is the number

of layers in the network.

1.1.2 Effect of Activation – Tanh [1pt]

Instead of sigmoid, now lets say we’re using the tanh activation (otherwise the same setup as in

1.1.1). Do the gradients necessarily have to vanish or explode this time? Answer this by deriving

a similar bound as in Sec 1.1.1 for the magnitude of the gradient.

1.2 Matrices and RNN

We will now analyze the recurrent weight matrices under Singular Value Decomposition. SVD is

one of the most important results in all of linear algebra. It says that any real matrix M ∈ R

mxn

can be written as M = UΣV

T where U ∈ R

mxm and V ∈ R

nxn are square orthogonal matrices, and

Σ ∈ R

mxn is a rectangular diagonal matrix with nonnegative entries on the diagonal (i.e. Σii ≥ 0

for i ∈ {1, . . . , min(m, n)} and 0 otherwise). Geometrically, this means any linear transformation

can be decomposed into a rotation/flip, followed by scaling along orthogonal directions, followed

by another rotation/flip.

1.2.1 Gradient through RNN [1pt]

Let say we have a very simple RNN-like architecture that computes xt+1 = tanh(W xt). You can

view this architecture as a deep fully connected network that uses the same weight matrix at each

layer. Suppose the largest singular value of the weight matrix is σmax(W) = 1

2

. Show that the

largest singular value of the input-output Jacobian has the following bound: 0 ≤ σmax(

∂xn

∂x1

) ≤ (

1

2

)

n

.

(Hint: if C = AB, then σmax(C) ≤ σmax(A)σmax(B). Also, the input-output Jacobian is the

multiplication of layerwise Jacobians).

2

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

1.2.2 Gradient through Residual / GRU / LSTM Layers [0pt]

Suppose all the layers of a neural network you’re training has the following operation: zt+1 =

zt+ft(zt) where zt and zt+1 are activations in consecutive layers. ft

is any continuous, differentiable

function. This is true for ResNets, GRUs and LSTMs.

In this part, we are dealing with a ResNet like model that computes zt+1 = zt + ft(zt) at each

layer. Let {σ1, . . . , σn} be a list of all the singular values (in ascending order) of the Jacobian

for the nonlinear part of the model ft (i.e. ∂ft(zt)

∂zt

. For simplicity, assume that the Jacobian

matrix has the following properties: the first half of the singular values are bounded above by

σsmall, which is much smaller than 1 (i.e. ∀i ∈ {1, . . . ,

n

2

}, σi ≤ σsmall(

∂ft(zt)

∂zt

) 1. Suppose

the latter half of the singular values are lower bounded by σbig, which is much larger than 2 (i.e.

∀i ∈ {n

2

, . . . , n}, σi(

∂ft(zt)

∂zt

) ≥ σbig 2).

Show that σmin(

∂zt+1

∂zt

) ≥ 1 − σsmall and σmax(

∂zt+1

∂zt

) ≥ σbig − 1.

(Hint: σi(A + B) ≥ |σi(A) − σmax(B)|)

1.2.3 Benefits of Residual Connections [1pt]

From the results of 1.2.2, do residual connections/GRU/LSTM layers solve the vanishing gradients

problem? Do they also solve the exploding gradients problem?

1.3 Batch Normalization

Batch normalization is one of the most commonly used techniques to improve training of deep neural

networks. The method is also quite simple: at every layer, normalize the input by subtracting the

mean and dividing by the standard deviation of the mini-batch. The motivation is that data

normalization usually helps machine learning models, so it may help at every hidden layer.

1.3.1 Gradients of Batch Norm [0pt]

Figure 1: The forward pass computation of batch normalization. xi

is the ith input vector in a

mini-batch of size m. yi

is the output vector for xi

.

Recall the forward computation of the batch normalization layer in Fig 1. Derive the Jacobian

of the batch normalization layer for the ith training example, ∂yi

∂xi

. You should simplify your answer

using µB and σB.

3

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

1.3.2 Batch Normalization and ResNet [1pt]

We now apply batch normalization to neural networks with residual connections. Consider the

following two ResNet architectures for the placement of the batch norm and the residual connection

in Fig. 2. “Weight” layer can be either a matrix multiplication or convolution.

Which architecture is easier to learn in terms of exploding / vanishing gradient? Provide a brief

justification for your answer.

Figure 2: Two different form of ResNet Blocks

2 Auto-regressive Models

In sequence modeling, we aim to learn a distribution of a large amount of data in order to generate

new samples, or evaluate the likelihood of any given data point. One approach to achieve this goal

is by using auto-regressive models. Let’s say we’re working with image data. In an auto-regressive

model, the joint distribution of pixels over a H × W image x is defined as the following product of

conditional distributions, where xi

is a single pixel.

p(x) =

H

Y×W

i=1

p(xi

| x1, …, xi−1). (2.1)

For simplicity, we can assume each conditional distribution is a Gaussian with a fixed variance, so

the model predicts the mean of the next pixel given all the previous pixels. When designing an

autoregressive model, we also need to define an ordering for the pixels, which is usually a raster

scan order for images, and the natural sequential ordering in texts or audios.

In this question, we will look at several autoregressive architectures and analyze their computational complexity.

2.1 WaveNet

WaveNet, Oord et al. [2016], is an architecture proposed to model the distribution of raw audio

waveforms. WaveNet consists of several 1-D convolutional layers where the last layer of the network

4

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

has the same dimensionality as the input corresponding to the mean of each conditional distribution

in Eq. 2.1.

WaveNet has two key innovations over classical convolutional models. First, to make sure that

at each time step the model just see the values from previous time steps, a causal convolution is

used. Causal convolution can be implemented similar to conventional convolution, but with zeros

at the position after the middle of filter (look at Fig 3.b).

In the standard causal convolution, the receptive field (context window) of neurons increases

linearly with the number of layers. This is problematic when working with waveforms because the

model needs to be very deep to capture any long-term context from the input sequence. To make

receptive fields grow exponentially, dilated convolution is proposed by introducing zeros in between

kernel values. For example in Fig. 3, the second layer has dilation factor of three. The WaveNet

architecture starts with dilation factor of 1 and increase it by 1 for each subsequent layer. (For

dilated convolution, see Lecture 9.)

(a) Standard 1-D Conv (b) WaveNet Causal Conv (c) Causal Dilated Conv

Figure 3: Different variants of convolutional layer showed in 1D

For the following questions, assume 1 pixel zero padding and stride of 1. We consider a convolution filter size of 3, i.e. there are two incoming weights for the causal convolution neurons in the

case of a single channel. Denote the number of input channels and the number of filters at each

layer to be k. The input Waveform has a length of t. For simplicity, we assume there are no bias

parameters.

2.1.1 Connections [0pt]

Consider a d layer WaveNet model, what is the total number of connections for a WaveNet without

dilated convolution? What is the total number of connection when using dilated convolution? Give

your answers in terms of d, k, t. You only need to give big-O, not an exact count.

2.1.2 Parallelism [0pt]

Suppose that in each step we compute as many matrix-vector multiplications in parallel as we can,

what is the minimum number of the sequential operations to compute the output of a WaveNet

with/without dilation? Give your answers in terms of d, k, t. You only need to give big-O, not an

exact count.

2.1.3 Discussion [0pt]

What is the benefit of using WaveNet over RNNs and Transformers? Discuss the pros and cons of

these three models in terms of their computational, memory complexity, parallelization potential

and the size of their context windows.

5

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

2.2 PixelCNN

When it comes to 2-D data, PixelCNN, Van den Oord et al. [2016], is a seminal work for modeling

the distribution of images using convolutional architecture in an autoregressive manner. Similar to

WaveNet, PixelCNN masks each convolutional filter to see just pixels appeared before the current

pixel in raster scan order. A visualization of the 2-D causal convolutional filters is shown in Fig. 4.

They stack several convolutional layers with masked filters and assume that the last layer gives the

mean of conditional distribution for each pixel.

For the following questions, assume zero padding and stride of 1. We consider a 2-D convolution

filter size of 3 as in Fig. 4. Denote the number of input channels and the number of filters at each

layer to be k. The input image size is H × W. For simplicity, we assume there are no bias

parameters.

Figure 4: 2-D masked (causal) convolutional filters

2.2.1 Connections [1pt]

Consider a d layer PixelCNN model, what is the total number of connections? Give your answers

in terms of d, k, H, W. You only need to give big-O, not an exact count.

2.2.2 Parallelism [1pt]

Suppose that in each step we compute as many matrix-vector multiplications in parallel as we can,

what is the minimum number of the sequential operations to compute the output of a Pixel CNN

in terms of d, k, H, W. You only need to give big-O, not an exact count.

2.3 Multidimensional RNN

One of the predecessors to the PixelCNN architecture was the multidimensional RNN (MDRNN).

This is like the RNNs we discussed in lecture, except that instead of a 1-D sequence, we have

a 2-D grid structure. Analogously to how ordinary RNNs have an input vector and a hidden

vector for every time step, MDRNNs have an input vector and hidden vector for every grid square.

Each hidden unit receives bottom-up connections from the corresponding input square, as well as

recurrent connections from its north and west neighbors as follows:

The activations are computed as follows:

h

(i,j) = φ

W>

inx

(i,j) + W>

Wh

(i−1,j) + W>

Nh

(i,j−1)

.

6

CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 4

Figure 5: a single layer MDRNN

Denote the number of input channels and the number of recurrent neurons at each layer to be

k. The input image size is H × W. For simplicity, we assume there are no bias parameters.

2.3.1 Connections [1pt]

Consider a d layer MDRNN model, what is the total number of connections? Give your answers in

terms of d, k, H, W. You only need to give big-O, not an exact count.

2.3.2 Parallelism [0pt]

Suppose that in each step we compute as many matrix-vector multiplications in parallel as we can,

what is the minimum number of the sequential operations we need to compute the output in terms

of d, k, H, W. You only need to give big-O, not an exact count.

2.3.3 Discussion [1pt]

What is the benefit of using PixelCNN over MDRNN. Discuss the pros and cons of the two models

in terms of their computational, memory complexity, parallelization potential and the size of their

context windows.

References

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,

Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for

raw audio. arXiv preprint arXiv:1609.03499, 2016.

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing

systems, pages 4790–4798, 2016.

7