CSC311

Homework 3

Submission: You will need to submit three files:

• Your answers to all of the questions, as a PDF file titled hw3_writeup.pdf. You can produce

the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is readable. If

you need to split your writeup into multiple files, that’s OK, as long as we can figure out

what you did.

• The completed Python files naive_bayes.py and q4.py.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3

days. After that, no submissions will be accepted.

Homeworks are individual work. See the Course Information handout1

for detailed policies.

1. [5pts] Backprop

In this question, you will derive the backprop updates for a particular neural net architecture.

The network is similar to the multilayer perceptron architecture from lecture, and has one

linear hidden layer. However, there are two architectural differences:

• In addition to the usual vector-valued input x, there is a vector-valued “context” input

η. (The particular meaning of η isn’t important for your derivation, but think of it as

containing additional task information, such as whether to focus on the left or the right

half of the image.) The hidden layer activations are modulated based on η; this means

they are multiplied by a value which depends on η.

• The network has a skip connection which sends information directly from the input to

the output of the network.

The loss function is squared error. The forward pass equations and network architecture are

as follows (the symbol represents elementwise multiplication, and σ denotes the logistic

function):

z = Wx

s = Uη

h = z σ(s)

y = v

>h + r

>x

L =

1

2

(y − t)

2

x

h

y

W

v

r

⌘

U

The model parameters are matrices W and U, and vectors v and r. Note that there is only

one output unit, i.e. y and t are scalars.

1

http://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/syllabus.pdf

1

CSC311 Fall 2021 Homework 3

(a) [1pt] Draw the computation graph relating x, z, η, s, h, and the model parameters.

(b) [4pts] Derive the backprop formulas to compute the error signals for all of the model

parameters, as well as x and η. Also include the backprop formulas for all intermediate

quantities needed as part of the computation. You may leave the derivative of the logistic

function as σ

0

rather than expanding it out explicitly.

2. [13pts] Fitting a Na¨ıve Bayes Model

In this question, we’ll fit a Na¨ıve Bayes model to the MNIST digits using maximum likelihood. In addition to the mathematical derivations, you will complete the implementation

in naive_bayes.py. The starter code will download the dataset and parse it for you: Each

training sample (t

(i)

, x

(i)

) is composed of a vectorized binary image x

(i) ∈ {0, 1}

784, and

1-of-10 encoded class label t

(i)

, i.e., t

(i)

c = 1 means image i belongs to class c.

Given parameters π and θ, Na¨ıve Bayes defines the joint probability of the each data point

x and its class label c as follows:

p(x, c | θ,π) = p(c | θ,π)p(x | c, θ,π) = p(c |π)

Y

784

j=1

p(xj | c, θjc).

where p(c |π) = πc and p(xj = 1 | c, θ,π) = θjc. Here, θ is a matrix of probabilities for each

pixel and each class, so its dimensions are 784 × 10, and π is a vector with one entry for

each class. (Note that in the lecture, we simplified notation and didn’t write the probabilities

conditioned on the parameters, i.e. p(c|π) is written as p(c) in lecture slides).

For binary data (xj ∈ {0, 1}), we can write the Bernoulli likelihood as

p(xj | c, θjc) = θ

xj

jc (1 − θjc)

(1−xj )

, (1)

which is just a way of expressing p(xj = 1|c, θjc) = θjc and p(xj = 0|c, θjc) = 1 − θjc in

a compact form. For the prior p(t |π), we use a categorical distribution (generalization of

Bernoulli distribution to multi-class case),

p(tc = 1 |π) = p(c |π) = πc or equivalently p(t |π) = Π9

j=0π

tj

j where X

9

i=0

πi = 1,

where p(c |π) and p(t |π) can be used interchangeably. You will fit the parameters θ and π

using MLE and MAP techniques. In both cases, your fitting procedure can be written as a

few simple matrix multiplication operations.

(a) [3pts] First, derive the maximum likelihood estimator (MLE) for the class-conditional

pixel probabilities θ and the prior π. Derivations should be rigorous.

Hint 1: We saw in lecture that MLE can be thought of as ‘ratio of counts’ for the data,

so what should ˆθjc be counting?

Hint 2: Similar to the binary case, when calculating the MLE for πj for j = 0, 1, …, 8,

write p(t

(i)

|π) = Π9

j=0π

t

(i)

j

j

and in the log-likelihood replace π9 = 1 − Σ

8

j=0πj , and then

take derivatives w.r.t. πj . This will give you the ratio ˆπj/πˆ9 for j = 0, 1, .., 8. You know

that ˆπj ’s sum up to 1.

(b) [1pt] Derive the log-likelihood log p(t|x, θ,π) for a single training image.

2

CSC311 Fall 2021 Homework 3

(c) [3pt] Fit the parameters θ and π using the training set with MLE, and try to report

the average log-likelihood per data point 1

N Σ

N

i=1 log p(t

(i)

|x

(i)

, θˆ,πˆ), using Equation (1).

What goes wrong? (it’s okay if you can’t compute the average log-likelihood here).

(d) [1pt] Plot the MLE estimator θˆ as 10 separate greyscale images, one for each class.

(e) [2pt] Derive the Maximum A posteriori Probability (MAP) estimator for the classconditional pixel probabilities θ, using a Beta(3, 3) prior on each θjc. Hint: it has

a simple final form, and you can ignore the Beta normalizing constant.

(f) [2pt] Fit the parameters θ and π using the training set with MAP estimators from previous part, and report both the average log-likelihood per data point, 1

N Σ

N

i=1 log p(t

(i)

|x

(i)

, θˆ,πˆ),

and the accuracy on both the training and test set. The accuracy is defined as the fraction of examples where the true class is correctly predicted using ˆc = argmaxc

log p(tc =

1|x, θˆ,πˆ).

(g) [1pt] Plot the MAP estimator θˆ as 10 separate greyscale images, one for each class.

3. [7pts] Categorial Distribution. In this problem you will consider a Bayesian approach to

modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is

a discrete distribution over K outcomes, which we’ll number 1 through K. The probability

of each category is explicitly represented with parameter θk. For it to be a valid probability

distribution, we clearly need θk ≥ 0 and P

k

θk = 1. We’ll represent each observation x as

a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this

model, the probability of an observation can be written in the following form:

p(x|θ) = Y

K

k=1

θ

xk

k

.

Suppose you observe a dataset,

D = {x

(i)

}

N

i=1.

Denote the count for outcome k as Nk =

Pn

i=1 x

(i)

k

. Recall that each data point is in the

1-of-K encoding, i.e., x

(i)

k = 1 if the ith datapoint represents an outcome k and x

(i)

k = 0

otherwise. In the previous assignment, you showed that the maximum likelihood estimate for

the counts was:

ˆθk =

Nk

N

.

(a) [2pts] For the prior, we’ll use the Dirichlet distribution, which is defined over the set of

probability vectors (i.e. vectors that are nonnegative and whose entries sum to 1). Its

PDF is as follows:

p(θ) ∝ θ

a1−1

1

· · · θ

ak−1

K .

Determine the posterior distribution p(θ | D). Based on your answer, is the Dirichlet

distribution a conjugate prior for the categorial distribution?

(b) [3pts] Still assuming the Dirichlet prior distribution, determine the MAP estimate of

the parameter vector θ. For this question, you may assume each ak > 1.

Hint: Remember that you need to enforce the constraint that P

k

θk = 1. You can do

this using the same parameterization trick you used in Question 2. Alternatively, you

could use Lagrange multipliers, if you’re familiar with those.

3

CSC311 Fall 2021 Homework 3

(c) [2pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,

x

(N+1), drawn from the same distribution as the previous N outcomes. Your friend does

not want to reveal the value of x

(N+1) to you. So, you want to use your Bayesian model

to predict what you think x

(N+1) is likely to be. The “proper” Bayesian predictor is the

so-called posterior predictive distribution:

p(x

(N+1)|D) = Z

p(x

(N+1)|θ)p(θ|D) dθ

What is the probability that the N +1 outcome was k, i.e., the probability that x

(N+1)

k =

1, under your posterior predictive distribution? Hint: A useful fact is that if θ ∼

Dirichlet(a1, . . . , aK), then

E[θk] = P

ak

k

0 ak

0

.

Report your answers to the above questions.

4. [5pts] Gaussian Discriminant Analysis. For this question you will build classifiers to

label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector

of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale

and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to

which character was written in the image. There are 700 training cases and 400 test cases for

each digit; they can be found in the data directory in the starter code.

A skeleton (q4.py) is is provided for each question that you should use to structure your code.

Starter code to help you load the data is provided (data.py). Note: the get digits by label

function in data.py returns the subset of digits that belong to a given class.

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full

covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density is given by,

p(x | y = k, µ, Σk) = (2π)

−d/2

|Σk|

−1/2

exp

−

1

2

(x − µk

)

T Σ

−1

k

(x − µk

)

(2)

You should take p(y = k) = 1

10

. You will compute parameters µkj and Σk for k ∈ (0…9), j ∈

(1…64). You should implement the covariance computation yourself (i.e. without the aid of

np.cov). Hint: To ensure numerical stability you may have to add a small multiple of the

identity to each covariance matrix. For this assignment you should add 0.01I to each matrix.

(a) [3pts] Using the parameters you fit on the training set and Bayes rule, compute the

average conditional log-likelihood, i.e. 1

N

PN

i=1 log(p(y

(i)

| x

(i)

, θ)) on both the train and

test set and report it. Hint: for numerical stability, you will want to use np.logaddexp,

as discussed in Lecture 4.

(b) [1pt] Select the most likely posterior class for each training and test data point as your

prediction, and report your accuracy on the train and test set.

(c) [1pt] Compute the leading eigenvectors (largest eigenvalue) for each class covariance

matrix (can use np.linalg.eig) and plot them side by side as 8 by 8 images.

Report your answers to the above questions, and submit your completed Python code for

q4.py.

4