Submission: Submit both your report (a single PDF file) and all codes on Quercus.
The purpose of this assignment is to investigate the classification performance of neural networks.
You will be implementing a neural network model using Numpy, followed by an implementation
in Tensorflow. You are encouraged to look up TensorFlow APIs for useful utility functions, at:
• Full points are given for complete solutions, including justifying the choices or assumptions
you made to solve each question. A written report should be included in the final submission.
• Programming assignments are to be solved and submitted individually. You are encouraged
to discuss the assignment with other students, but you must solve it on your own.
• Please ask all questions related to this assignment on Piazza, using the tag assignment2.
The dataset that we will use in this assignment is a permuted version of notMNIST1
, which contains
28-by-28 images of 10 letters (A to J) in different fonts. This dataset has 18720 instances, which
can be divided into different sets for training, validation and testing. The provided file is in .npz
format which is for Python. You can load this file as follows.
with np.load(“notMNIST.npz”) as data:
Data, Target = data [“images”], data[“labels”]
randIndx = np.arange(len(Data))
Data = Data[randIndx]/255.
Target = Target[randIndx]
trainData, trainTarget = Data[:10000], Target[:10000]
validData, validTarget = Data[10000:16000], Target[10000:16000]
testData, testTarget = Data[16000:], Target[16000:]
return trainData, validData, testData, trainTarget, validTarget, testTarget
Since you will be investigating multi-class classification, you will need to convert the data into a
one-hot encoding format. The code snippet below will help you with that.
def convertOneHot(trainTarget, validTarget, testTarget):
newtrain = np.zeros((trainTarget.shape, 10))
newvalid = np.zeros((validTarget.shape, 10))
newtest = np.zeros((testTarget.shape, 10))
for item in range(0, trainTarget.shape):
newtrain[item][trainTarget[item]] = 1
for item in range(0, validTarget.shape):
newvalid[item][validTarget[item]] = 1
for item in range(0, testTarget.shape):
newtest[item][testTarget[item]] = 1
return newtrain, newvalid, newtest
1 NEURAL NETWORKS USING NUMPY [20 PTS.]
1 Neural Networks using Numpy [20 pts.]
In this part, you will be tasked with implementing and training a neural network to classify the letters using Numpy and Gradient Descent with Momentum. The network you will be implementing
has the following structure:
• 3 layers – 1 input, 1 hidden with ReLU activation and 1 output with Softmax:
– Input Layer: x (F units, here: F = 784)
– Hidden Layer: h = ReLU(Wh x + bh) (H units)
– Output Layer: p = softmax(o), where o = Wo h + bo (K units, here K = 10)
• Cross Entropy Loss: L = −
k=1 yk log(pk),
where y = [y1, y2, · · · , yK]
> is the one-hot coded vector of the label.
During the training process, it may be beneficial to save weights to a file during the training
process – the function numpy.savetext may be useful. As an estimate of the running time, training
the Numpy implementation should not take longer than an hour (tested on an Intel i7 3770K at
3.40 GHz and 16 GB of RAM). For this part only, Tensorflow implementations will not be
1.1 Helper Functions [6 pt.]
To implement the neural network described earlier, you will need to implement the following
vectorized (i.e. no for loops, they must rely on matrix/vector operations) helper functions. Include
the snippets of your Python code in the report.
1. ReLU(): This function will accept one argument and return Numpy array with the ReLU
activation and the equation is given below. [0.5 pt]
ReLU(x) = max(x, 0)
2. softmax(): This function will accept one argument and return a Numpy array with the
softmax activations of each of the inputs and the equation is shown below. [0.5 pt]
, j = 1, · · · , K for K classes.
Important Hint: In order to prevent overflow while computing exponentials, you should first
subtract the maximum value of z from all its elements.
3. compute(): This function will accept 3 arguments: a weight matrix, an input vector, and a
bias vector and return the product between the weights and input, plus the biases (i.e. a
prediction for a given layer). [0.5 pt]
1.2 Backpropagation Derivation [8 pts.] 1 NEURAL NETWORKS USING NUMPY [20 PTS.]
4. averageCE(): This function will accept two arguments, the targets (e.g. labels) and predictions – both are matrices of the same size. It will return a number, average the cross entropy
loss for the dataset (i.e. training, validation, or test). For K classes, the formula is shown
below. [0.5 pt]
Average CE = −
is the true one-hot label for sample n, pk is the predicted class probability (i.e.
softmax output for the k
th class) of sample n, and N is the number of examples.
5. gradCE(): This function will accept two arguments, the targets (i.e. labels y) and the input
to the softmax function (i.e. o). It will return the gradient of the cross entropy loss with
respect to the inputs to the softmax function: ∂L/∂o. Show the analytical expression
in your report. [2 pt.]
1.2 Backpropagation Derivation [8 pts.]
To train the neural network, you will need to implement the backpropagation algorithm. For the
neural network architecture outlined in the assignment description, derive the following analytical
expressions and include them in your report:
, the gradient of the loss with respect to the output layer weights. [2 pt.]
• Shape: (H × 10), with H units
, the gradient of the loss with respect to the output layer biases. [2 pt.]
• Shape: (1 × 10)
, the gradient of the loss with respect to the hidden layer weights. [2 pt.]
• Shape: (F × H), with F features, H units
, the gradient of the loss with respect to the hidden layer biases. [2 pt.]
• Shape: (1 × H), with H units.
Hints: The labels y have been one hot encoded. You will also need the derivative of the ReLU()
function in order to backpropagate the gradient through the activation.
1.3 Learning [6 pts.] 1 NEURAL NETWORKS USING NUMPY [20 PTS.]
1.3 Learning [6 pts.]
Construct the neural network and train it for 200 epochs with a hidden unit size of H = 1000. First,
initialize your weight matrices following the Xaiver initialization scheme (zero-mean Gaussians
with variance 2
units in+units out ) and your bias vectors to zero, each with the shapes as outlined in
section 1.2. Using these matrices, compute a forward pass of the training data and then, using
the gradients derived in section 1.2, implement the backpropagation algorithm to update all of the
network’s weights and biases. The optimization technique to be used for backpropagation will be
Gradient Descent with momentum and the equation is shown below.
νnew ← γνold + α
W ← W − νnew
For the ν matrices, initialize them to the same size as the hidden and output layer weight matrix
sizes, with a very small value (e.g. 10−5
). Additionally, initialize your γ values to values slightly
less than 1 (e.g. 0.9 or 0.99) and set α = 0.1 for the average loss. (Note that you need to scale the
learning rate if you are using the total loss).
Plot the training and validation loss in one figure, and the training and validation accuracy curves
in a second figure and include them in your report. For the accuracy metric, the np.argmax()
function will be helpful.
2 NEURAL NETWORKS IN TENSORFLOW [OPTIONAL]
2 Neural Networks in Tensorflow [optional]
In this part, you will be implementing a Convolutional Neural Network, the most popular technique
for image recognition, using Tensorflow. It is recommended that you train the neural network
using a GPU (although this is not required.) The neural network architecture that you will be
implementing is as follows:
1. Input Layer
2. A 3 × 3 convolutional layer, with 32 filters, using vertical and horizontal strides of 1.
3. ReLU activation
4. A batch normalization layer
5. A 2 × 2 max pooling layer
6. Flatten layer
7. Fully connected layer (with 784 output units, i.e. corresponding to each pixel)
8. ReLU activation
9. Fully connected layer (with 10 output units, i.e. corresponding to each class)
10. Softmax output
11. Cross Entropy loss
2.1 Model Implementation
Implement the described neural network architecture described at the beginning of this section.
You will find the tf.nn.conv2d, tf.nn.relu, tf.nn.batch_normalization and tf.nn.max_pool utility functions useful. For the convolutional layer, initialize each filter with the Xaiver scheme.
Initialize your weight and biases for the other layers like you did in Part 1 (but with Tensorflow
tensors). For the padding parameter, set it to the SAME method. For the batch normalization layer,
the tf.nn.moments function will be useful for obtaining the mean and variance.
You are allowed to use the built-in cross-entropy loss function. Include your Python code snippets
in your report.
2.2 Model Training
Train your implemented model using SGD for a batch size of 32, for 50 epochs and the Adam
optimizer for learning rate of α = 1 × 10−4
, making sure to shuffle your training data after each
epoch. Your objective function will be to minimize the cross entropy loss. Plot the training and
validation loss/accuracy curves and include them in your report.
2.3 Dropout 2 NEURAL NETWORKS IN TENSORFLOW [OPTIONAL]
A popular method to control overfitting in very deep neural networks is to apply dropout to certain
layers in the model. Add a dropout layer after step 7, described in section 2 and test it with keeping
probabilities (1− dropout rate) p = [0.9, 0.75, 0.5] while holding all other parameters constant as in
section 2.2 with no regularization and plot the training and validation accuracy/loss for 50 epochs
and include them in your report.