LIGN 167: Problem Set 3

Collaboration policy: You may collaborate with up to two other students on this problem

set. You must write up your own answers to the problems; do not just copy and paste from

your collaborators. You must also submit your work individually. If you do not submit a

copy of the problem set under your own name, you will not get credit. When you submit

your work, you must indicate who you worked with, and what each of your individual

contributions were.

Getting started: We will be uploading a file called pset3.py to Piazza. This file will contain

some starter code for the problem set (some functions that you should call in your answers),

as well as function signatures for your answers. Please use these function signatures for

creating your functions.

In this problem set we will be implementing backpropagation for a multi-layer perceptron.

This network is illustrated in Figure 1, and has the following mathematical definition. The

vector ~r0

is defined in terms of the input x, which is a scalar, and the weight matrix W0

:

~r0 =

r

0

0

r

0

1

r

0

2

=

w

0

0

· x

w

0

1

· x

w

0

2

· x

= W0

· x (1)

Here we are using the following definition of W0

:

W0 =

w

0

0

w

0

1

w

0

2

(2)

The first hidden layer ~h

0

is defined by applying a non-linearity (ReLU) to ~r0

:

~h

0 =

h

0

0

h

0

1

h

0

2

=

ReLU(r

0

0

)

ReLU(r

0

1

)

ReLU(r

0

2

)

= ReLU(~r0

) (3)

The next layer, ~r1

, is defined as follows:

~r1 =

r

1

0

r

1

1

r

1

2

=

w

1

0,0

· h

0

0 + w

1

0,1

· h

0

1 + w

1

0,2

· h

0

2

w

1

1,0

· h

0

0 + w

1

1,1

· h

0

1 + w

1

1,2

· h

0

2

w

1

2,0

· h

0

0 + w

1

2,1

· h

0

1 + w

1

2,2

· h

0

2

= W1

·

~h

0

(4)

1

r0

0 r0

1 r0

2 h0 h 2

0 h 1

0

0

h1

0 h1

1 h1

r 2

1

0 r1

1 r1

2

w0

0 w0

1 w0

2

w2

0 w2

1 w2

2

ypred

W1

W0

W2

Figure 1: Our multi-layer perceptron.

2

The matrix in this equation is defined by:

W1 =

w

1

0,0 w

1

0,1 w

1

0,2

w

1

1,0 w

1

1,1 w

1

1,2

w

1

2,0 w

1

2,1 w

1

2,2

(5)

The second hidden layer ~h

1

is defined by applying a ReLU to ~r1

:

~h

1 =

h

1

0

h

1

1

h

1

2

=

ReLU(r

1

0

)

ReLU(r

1

1

)

ReLU(r

1

2

)

= ReLU(~r1

) (6)

Finally, the output ypred, which is a scalar value, is defined by:

ypred = w

2

0

· h

1

0 + w

2

1

· h

1

1 + w

2

2

· h

1

2 = W2

·

~h

1

(7)

We have a dataset that consists of two parts: X = {x1, …, xn} and Y = {y1, …, yn}.

Each xi and yi

is a scalar. The loss associated with a datapoint xi

, yi

is defined by:

`i = (ypred,i − yi)

2

(8)

Here we are writing ypred,i for the neural network’s prediction given input xi

. The total loss

L can be written:

L(θ|X, Y ) = Xn

i=1

`i (9)

The parameter term θ captures all of the model parameters that are being learned, in this

case: W0

, W1

, and W2

.

In the starter code that we have provided, we have given you an implementation of the

forward direction of the neural network. That is, the provided function mlp will compute

the output ypred of the network given a particular input x. In the problems, you will

be implementing the ]emphbackwards direction for the network, calculating the partial

derivatives of the loss function with respect to the weight parameters.

The function mlp in the starter code returns a Python dictionary called variable_dict.

The dictionary contains the value of all of the nodes in the network, after giving the network

a particular input value xi

. We will be using this variable_dict throughout the rest of the

problem set. You should spend some time reading through the code for mlp, to understand

how it is constructed.

Problem 1. In this problem we will begin implementing the backpropagation algorithm,

starting from the top of the network. You should write a function d_loss_d_ypredicted,

which calculates the partial derivative ∂`i

∂ypred

. The loss `i

is defined by Equation 8.

The function should take two arguments: variable_dict and y_observed. variable_dict

is a dictionary containing the values of all of the nodes of the network, for a particular input

value xi (as discussed above). y_observed is a real number, which equals the value yi

observed for the input xi

.

Hint: retrieve the network’s predicted value ypred by calling variable_dict[y_predicted].

3

Problem 2. Write a function d_loss_d_W2 which takes two arguments, variable_dict

and y_observed. variable_dict is a dictionary of network node values, and y_observed is

a real number, as in the previous problem.

The function should compute the partial derivative ∂`i

∂W2

, which is defined as follows:

∂`i

∂W2

=

∂`i

∂w2

0

∂`i

∂w2

1

∂`i

∂w2

2

(10)

These three partial derivatives should be returned as a 1 × 3 NumPy array, in the same

order as shown in the equation above.

Hint: call d_loss_d_ypredicted from the previous problem, and retrieve the network’s

value for the layer ~h

1

from variable_dict. Then take partial derivatives of Equation 7.

Problem 3. Write a function d_loss_d_h1, which takes three arguments: variable_dict,

W2, and y_observed. The arguments variable_dict and y_observed are the same as previous problems. The argument W2 is a 1 × 3 NumPy array, which represents the weight

matrix W2

from Equation 7.

The function should compute the partial derivative ∂`i

∂h1

, which is defined as follows:

∂`i

∂h1

=

∂`i

∂h1

0

∂`i

∂h1

1

∂`i

∂h1

2

(11)

These three partial derivatives should be returned as a 1 × 3 NumPy array, in the same

order as the equation above. (For the remainder of the problems, when a NumPy array is

being returned, it should be in the same order as the corresponding equation.)

Problem 4. Write a function relu_derivative, which takes a single argument x. The value

x is a real number.

It should return the derivative dReLU

dx

(x), where the ReLU function is defined by:

(

x, if x > 0

0, if otherwise

(12)

Problem 5. Write a function d_loss_d_r1, which takes three arguments: variable_dict,

W2, and y_observed. These arguments should be the same as in Problem 3. The function

should compute the partial derivative ∂`i

∂r1

, which is defined as follows:

∂`i

∂r1

=

∂`i

∂r1

0

∂`i

∂r1

1

∂`i

∂r1

2

(13)

These values should be returned as a 1 × 3 NumPy array.

Hint: Take partial derivatives in Equation 6, and use the function relu_derivative that

you defined in Problem 4.

Problem 6. Write a function d_loss_d_W1, which takes three arguments: variable_dict,

W2, and y_observed. These arguments should be the same as in Problem 3. The function

4

should compute a matrix of partial derivatives ∂`i

∂W1

:

∂`i

∂W1

=

∂`i

w1

0,0

∂`i

w1

0,1

∂`i

w1

0,2

∂`i

w1

1,0

∂`i

w1

1,1

∂`i

w1

1,2

∂`i

w1

2,0

∂`i

w1

2,1

∂`i

w1

2,2

(14)

These partial derivatives should be returned as a NumPy array of dimension 3 × 3.

To do this you should take partial derivatives in Equation 4.

Hint: This is not necessary, but it may be convenient to use the NumPy function

np.outer, which computes the outer product of two (one-dimensional) arrays.

Problem 7. Write a function d_loss_d_h0, which takes four arguments: variable_dict,

W1, W2, and y_observed. The arguments variable_dict, W2, and y_observed should be

the same as in previous problems. The argument W1 is a 3 × 3 matrix which represents the

weight matrix W1

.

The function should compute the partial derivative ∂`i

∂h0

, which is defined as follows:

∂`i

∂h0

=

∂`i

∂h0

0

∂`i

∂h0

1

∂`i

∂h0

2

(15)

These partial derivatives should be returned as a 1 × 3 NumPy array.

Do this by taking partial derivatives in Equation 4.

Problem 8. Write a function d_loss_d_r0, which takes four arguments: variable_dict,

W1, W2, and y_observed. These four arguments should be the same as in Problem 7. The

function should compute the partial derivative ∂`i

∂r0

, which is defined as follows:

∂`i

∂r0

=

∂`i

∂r0

0

∂`i

∂r0

1

∂`i

∂r0

2

(16)

These partial derivatives should be returned as a 1 × 3 NumPy array.

Do this by taking partial derivatives in Equation 3.

Problem 9. Write a function d_loss_d_W0, which takes four arguments: variable_dict,

W1, W2, and y_observed. These four arguments should be the same as in Problems 6 and

8.

The function should compute the partial derivative ∂`i

∂W0

, which is defined as follows:

∂`i

∂W0

=

∂`i

∂w0

0

∂`i

∂w0

1

∂`i

∂w0

2

(17)

These three partial derivatives should be returned as a 1 × 3 NumPy array

Do this by taking partial derivatives in Equation 1.

5

Comments on the problems: You have now computed the partial derivatives ∂`i

∂W0 ,

∂`i

∂W1 , and ∂`i

∂W2 . This is all that you need in order to perform gradient descent and optimize

the weight parameters.

We have also included PyTorch code for the model in the starter code. Entirely optional:

by slightly extending the starter code, you can compute gradients, and verify your solutions

to the problems.

6