CSC311

Homework 4

Submission: You need to submit the following files through MarkUs1

:

• Your solutions to Questions 1, 2, 3 as a PDF file, hw4_writeup.pdf.

• Your completed Python code for Question 2, as q2.py.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3

days. After that, no submissions will be accepted.

Computing: To install Python and required libraries, see the instructions on the course web page.

Homeworks are individual work. See the Course Information handout2

for detailed policies.

1. [2pts] Marginalization and Log-Sum-Exp. In this question you will learn how to compute log marginal probabilities in a numerically stable way. Suppose that you have a generative model p(x, i) for labeled data (x, i) where i is a label that can be one of 0, 1, 2, . . . , k.

Recall that the marginal probability p(x) can be computed using the following formula:

p(x) = X

k

i=0

p(x, i) (0.1)

When x is a high-dimensional data point, it is typical for the marginal p(x) and the joint

p(x, i) to be extremely small numbers that cannot be represented in floating point. For this

reason, we usually report and compute with log probabilities.

If we want to compute log(p(x)) only given access to log(p(x, i)), then we can use what is

called a log-sum-exp:

log X

k

i=0

exp(ai)

!

(0.2)

where ai ∈ R are real numbers. In our example, if ai = log(p(x, i)), then expression (??)

would correspond to the log marginal probability log(p(x)) of x.

Unfortunately, computing log-sum-exp naively can lead to numerical instabilities. The numerical instabilities in log-sum-exp are caused by problems that arise when trying to compute

exponentials using floating point numbers. Two things can go wrong:

(a) Underflow. If a[i] is very small, then np.exp(a[i]) will evaluate to 0.

(b) Overflow. If a[i] is very large, then np.exp(a[i]) will evaluate to inf.

The cause of underflow and overflow is that floating point numbers cannot represent numbers

arbitrarily close to 0 nor arbitrarily large numbers.

1

https://markus.teach.cs.toronto.edu/csc311-2020-09

2

http://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/syllabus.pdf

1

CSC311 Fall 2020 Homework 4

(a) [1pt] We have provided code in q1.py with two implementations of log-sum-exp: a

naive, numerically unstable implementation and a numerically stable one. Modify the

elements of a so that logsumexp unstable returns -inf, and modify the elements of

b so that logsumexp unstable returns inf. Report the two vectors, a and b, in your

write-up.

(b) [1pt] Prove that our numerically stable implementation is correct by proving that

log X

k

i=0

exp(ai)

!

= log X

k

i=0

exp

ai −

k

max

j=1

{aj}

!

+

k

max

j=1

{aj} (0.3)

Briefly explain why the numerically stable version is robust to underflow and overflow.

Report your answers to the above questions.

2. [3pts] Gaussian Discriminant Analysis. For this question you will build classifiers to

label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector

of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale

and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to

which character was written in the image. There are 700 training cases and 400 test cases for

each digit; they can be found in a4digits.zip.

A skeleton (q2.py) is is provided for each question that you should use to structure your code.

Starter code to help you load the data is provided (data.py). Note: the get digits by label

function in data.py returns the subset of digits that belong to a given class.

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full

covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density is given by,

p(x | y = k, µ, Σk) = (2π)

−d/2

|Σk|

−1/2

exp

−

1

2

(x − µk

)

T Σ

−1

k

(x − µk

)

(0.4)

You should take p(y = k) = 1

10

. You will compute parameters µkj and Σk for k ∈ (0…9), j ∈

(1…64). You should implement the covariance computation yourself (i.e. without the aid of

’np.cov’). Hint: To ensure numerical stability you may have to add a small multiple of the

identity to each covariance matrix. For this assignment you should add 0.01I to each matrix.

(a) [1pt] Using the parameters you fit on the training set and Bayes rule, compute the

average conditional log-likelihood, i.e. 1

N

PN

i=1 log(p(y

(i)

| x

(i)

, θ)) on both the train and

test set and report it. Hint: you will want to use the log-sum-exp we discussed in

Question 1 to your code.

(b) [1pt] Select the most likely posterior class for each training and test data point as your

prediction, and report your accuracy on the train and test set.

(c) [1pt] Compute the leading eigenvectors (largest eigenvalue) for each class covariance

matrix (can use np.linalg.eig) and plot them side by side as 8 by 8 images.

Report your answers to the above questions, and submit your completed Python code for

q2.py.

2

CSC311 Fall 2020 Homework 4

3. [3pts] Categorial Distribution. In this problem you will consider a Bayesian approach to

modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is

a discrete distribution over K outcomes, which we’ll number 1 through K. The probability

of each category is explicitly represented with parameter θk. For it to be a valid probability

distribution, we clearly need θk ≥ 0 and P

k

θk = 1. We’ll represent each observation x as

a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this

model, the probability of an observation can be written in the following form:

p(x|θ) = Y

K

k=1

θ

xk

k

.

Suppose you observe a dataset,

D = {x

(i)

}

N

i=1.

Denote the count for outcome k as Nk =

Pn

i=1 x

(i)

k

. Recall that each data point is in the

1-of-K encoding, i.e., x

(i)

k = 1 if the ith datapoint represents an outcome k and x

(i)

k = 0

otherwise. In the previous assignment, you showed that the maximum likelihood estimate for

the counts was:

ˆθk =

Nk

N

.

(a) [1pts] For the prior, we’ll use the Dirichlet distribution, which is defined over the set of

probability vectors (i.e. vectors that are nonnegative and whose entries sum to 1). Its

PDF is as follows:

p(θ) ∝ θ

a1−1

1

· · · θ

ak−1

K .

What is the probability distribution of the posterior distribution p(θ | D)?

(b) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the

parameter vector θ. For this question, you may assume each ak > 1.

(c) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,

x

(N+1), drawn from the same distribution as the previous N outcomes. Your friend does

not want to reveal the value of x

(N+1) to you. So, you want to use your Bayesian model

to predict what you think x

(N+1) is likely to be. The “proper” Bayesian predictor is the

so-called posterior predictive distribution:

p(x

(N+1)|D) = Z

p(x

(N+1)|θ)p(θ|D) dθ

What is the probability that the N +1 outcome was k, i.e., the probability that x

(N+1)

k =

1, under your posterior predictive distribution? Hint: A useful fact is that if θ ∼

Dirichlet(a1, . . . , aK), then

E[θk] = P

ak

k

0 ak

0

.

Report your answers to the above questions.

3