EE239AS Homework #3

Neural Networks & Deep Learning

100 points total.

For this HW, you should not be performing any tensor matrix multiplies or calculating 4D tensor

derivatives. You will need to know the following backpropagated derivative for matrix-matrix multiplies, which you can use without proof:

Consider Z = XY. If we have an upstream derivative, ∂L/∂Z, then they are backpropagated

through this operation as:

dL

∂X

=

∂L

∂Z

YT

dL

∂Y

= XT ∂L

∂Z

1. (15 points) Backpropagation for autoencoders. In an autoencoder, we seek to reconstruct the original data after some operation that reduces the data’s dimensionality. We may

be interested in reducing the data’s dimensionality to gain a more compact representation of

the data.

For example, consider x ∈ R

n

. Further, consider W ∈ R

m×n where m < n. Then Wx

is of lower dimensionality than x. One way to design W so that Wx still contains key

features of x is to minimize the following expression

L =

1

2

WTWx − x

2

with respect to W. (To be complete, autoencoders also have a nonlinearity in each layer,

i.e., the loss is 1

2

f(WT f(Wx)) − x

2

. However, we’ll work with the linear example.)

(a) (3 points) In words, describe why this minimization finds a W that ought to preserve

information about x.

(b) (3 points) Draw the computational graph for L.

(c) (3 points) In the computational graph, there should be two paths to W. How do we

account for these two paths when calculating ∇WL? Your answer should include a

mathematical argument.

(d) (6 points) Calculate the gradient: ∇WL.

2. (20 points) Backpropagation for Gaussian-process latent variable model. An important component of unsupervised learning is visualizing high-dimensional data in lowdimensional spaces. One such nonlinear algorithm to do so is from Lawrence, NIPS 2004,

called GP-LVM. GP-LVM optimizes the maximum-likelihood of a probabilistic model. We

won’t get into the details here, but rather to the bottom line: in this paper, a log-likelihood

1

has to be differentiated with respect to a matrix to derive the optimal parameters.

To do so, we will use apply the chain rule for multivariate derivatives via backpropagation.

The log-likelihood is:

L = −c −

D

2

log |K| − 1

2

tr(K−1YYT

)

where K = αXXT + β

−1

I and c is a constant. To solve this, we’ll take the derivatives with

respect to the two terms with dependencies on X:

L1 = −

D

2

log |αXXT + β

−1

I|

L2 = −

1

2

tr

(αXXT + β

−1

I)

−1YYT

Hint: To receive full credit, you will be required to show all work. You may use the following

matrix derivatives without proof:

∂trX

∂X

= I

∂ log |K|

∂K

= K−T

∂L

∂K

= −K−T ∂L

∂K−1 K−T

That is, the first equation tells you how to backpropagate through tr(·), the second equation

tells you how to backpropagate through log | · |, and the third equation tells you how to

backpropagate through a matrix inversion. log | · | refers to the log of a determinant.

(a) (3 points) Draw a computational graph for L1.

(b) (6 points) Compute ∂L1

∂X

.

(c) (3 points) Draw a computational graph for L2.

(d) (6 points) Compute ∂L2

∂X

.

(e) (2 points) Compute ∂L

∂X

.

3. (40 points) 2-layer neural network. Compete the two-layer neural network Jupyter notebook. Print out the entire workbook and relevant code and submit it as a pdf to gradescope.

Download the CIFAR-10 dataset, as you did in HW #2.

4. (25 points) General FC neural network. Compete the FC Net Jupyter notebook. Print

out the entire workbook and relevant code and submit it as a pdf to gradescope.