CSC311

Homework 3

Submission: You need to submit the following files through MarkUs1

:

• Your answers to Questions 1, 2, and 3, and outputs requested for Question 2 and 3, as a PDF

file titled hw3_writeup.pdf. You can produce the file however you like (e.g. LATEX, Microsoft

Word, scanner), as long as it is readable.

• Python file naive_bayes.py completed for Questions 2 and 3.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3

days. After that, no submissions will be accepted.

Computing: To install Python and required libraries, see the instructions on the course web page.

Homeworks are individual work. See the Course Information handout2

for detailed policies.

1. [7pts] AdaBoost.

(a) [5pts] The goal of this question is to show that the AdaBoost algorithm changes the

weights in order to force the weak learner to focus on difficult data points. Here we

consider the case that the target labels are from the set {−1, +1} and the weak learner

also returns a classifier whose outputs belongs to {−1, +1} (instead of {0, 1}). Consider

the t-th iteration of AdaBoost, where the weak learner is

ht ← argmin

h∈H

X

N

i=1

wiI{h(x

(i)

) 6= t

(i)

},

the w-weighted classification error is

errt =

PN

i=1 wiI{ht(x

(i)

) 6= t

(i)}

PN

i=1 wi

,

and the classifier coefficient is αt =

1

2

log 1−errt

errt

. (Here, log denotes the natural logarithm.) AdaBoost changes the weights of each sample depending on whether the weak

learner ht classifies it correctly or incorrectly. The updated weights for sample i is

denoted by w

0

i

and is

w

0

i ← wi exp

−αtt

(i)ht(x

(i)

)

.

Show that the error w.r.t. (w

0

1

, . . . , w0

N ) is exactly 1

2

. That is, show that

err0

t =

PN

i=1 w

0

i

I{ht(x

(i)

) 6= t

(i)}

PN

i=1 w0

i

=

1

2

.

1

https://markus.teach.cs.toronto.edu/csc311-2020-09

2

http://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/syllabus.pdf

1

CSC311 Fall 2020 Homework 3

Note that here we use the weak learner of iteration t and evaluate it according to the new

weights, which will be used to learn the t+ 1-st weak learner. What is the interpretation

of this result?

Hints:

i. Start from err0

t and divide the summation to two sets of E = {i : ht(x

(i)

) 6= t

(i)}

and its complement Ec = {i : ht(x

(i)

) = t

(i)}.

ii. Note that

P

i∈E wi

PN

i=1 wi

= errt

.

(b) [2pts] Recall from lecture that we can rewrite the 0-1 loss as: I[h(x

(n)

) 6= t

(n)

] =

1

2

(1 − h(x

(n)

) · t

(n)

). Use this identity to prove that the weight update from part (a):

w

0

i ← wi exp

−αtt

(i)ht(x

(i)

)

.

is proportional, up to a constant factor, to weight update:

w

0

i ← wi exp

2αtI{ht(x

(i)

) 6= t

(i)

}

.

What is the constant factor? Since we normalize the weights when computing errt

, these

two updates are equivalent.

2. [13pts] Fitting a Na¨ıve Bayes Model

In this question, we’ll fit a Na¨ıve Bayes model to the MNIST digits using maximum likelihood. The starter code will download the dataset and parse it for you: Each training sample

(t

(i)

, x

(i)

) is composed of a vectorized binary image x

(i) ∈ {0, 1}

784, and 1-of-10 encoded class

label t

(i)

, i.e., t

(i)

c = 1 means image i belongs to class c.

Given parameters π and θ, Na¨ıve Bayes defines the joint probability of the each data point

x and its class label c as follows:

p(x, c | θ,π) = p(c | θ,π)p(x | c, θ,π) = p(c |π)

Y

784

j=1

p(xj | c, θjc).

where p(c |π) = πc and p(xj = 1 | c, θ,π) = θjc. Here, θ is a matrix of probabilities for each

pixel and each class, so its dimensions are 784 × 10, and π is a vector with one entry for

each class. (Note that in the lecture, we simplified notation and didn’t write the probabilities

conditioned on the parameters, i.e. p(c|π) is written as p(c) in lecture slides).

For binary data (xj ∈ {0, 1}), we can write the Bernoulli likelihood as

p(xj | c, θjc) = θ

xj

jc (1 − θjc)

(1−xj )

, (0.1)

which is just a way of expressing p(xj = 1|c, θjc) = θjc and p(xj = 0|c, θjc) = 1 − θjc in

a compact form. For the prior p(t |π), we use a categorical distribution (generalization of

Bernoulli distribution to multi-class case),

p(tc = 1 |π) = p(c |π) = πc or equivalently p(t |π) = Π9

j=0π

tj

j where X

9

i=0

πi = 1,

2

CSC311 Fall 2020 Homework 3

where p(c |π) and p(t |π) can be used interchangeably. You will fit the parameters θ and π

using MLE and MAP techniques. In both cases, your fitting procedure can be written as a

few simple matrix multiplication operations.

(a) [3pts] First, derive the maximum likelihood estimator (MLE) for the class-conditional

pixel probabilities θ and the prior π. Derivations should be rigorous.

Hint 1: We saw in lecture that MLE can be thought of as ‘ratio of counts’ for the data,

so what should ˆθjc be counting?

Hint 2: Similar to the binary case, when calculating the MLE for πj for j = 0, 1, …, 8,

write p(t

(i)

|π) = Π9

j=0π

t

(i)

j

j

and in the log-likelihood replace π9 = 1 − Σ

8

j=0πj , and then

take derivatives w.r.t. πj . This will give you the ratio ˆπj/πˆ9 for j = 0, 1, .., 8. You know

that ˆπj ’s sum up to 1.

(b) [1pt] Derive the log-likelihood log p(t|x, θ,π) for a single training image.

(c) [3pt] Fit the parameters θ and π using the training set with MLE, and try to report the

average log-likelihood per data point 1

N Σ

N

i=1 log p(t

(i)

|x

(i)

, θˆ,πˆ), using Equation (0.1).

What goes wrong? (it’s okay if you can’t compute the average log-likelihood here).

(d) [1pt] Plot the MLE estimator θˆ as 10 separate greyscale images, one for each class.

(e) [2pt] Derive the Maximum A posteriori Probability (MAP) estimator for the classconditional pixel probabilities θ, using a Beta(3, 3) prior on each θjc. Hint: it has

a simple final form, and you can ignore the Beta normalizing constant.

(f) [2pt] Fit the parameters θ and π using the training set with MAP estimators from previous part, and report both the average log-likelihood per data point, 1

N Σ

N

i=1 log p(t

(i)

|x

(i)

, θˆ,πˆ),

and the accuracy on both the training and test set. The accuracy is defined as the fraction of examples where the true class is correctly predicted using ˆc = argmaxc

log p(tc =

1|x, θˆ,πˆ).

(g) [1pt] Plot the MAP estimator θˆ as 10 separate greyscale images, one for each class.

3. [4pts] Generating from a Na¨ıve Bayes Model

Defining a joint probability distribution over the data lets us generate new data, and also lets

us answer all sorts of queries about the data. This is why these models are called Generative

Models. We will use the Na¨ıve Bayes model trained in previous question to generate data.

(a) [1pt] True or false: Given this model’s assumptions, any two pixels xi and xj where

i 6= j are independent given c.

(b) [1pt] True or false: Given this model’s assumptions, any two pixels xi and xj where

i 6= j are independent after marginalizing over c.

(c) [2pts] Using the parameters fit using MAP in Question 1, produce random image samples from the model. That is, randomly sample and plot 10 binary images from the

marginal distribution p(x|θˆ,πˆ). Hint: To sample from p(x | θˆ,πˆ), first sample random

variable c from p(c |πˆ) using np.random.choice, then depending on the value of c,

sample xj from p(xj | c, ˆθjc) for j = 1, …, 784 using np.random.binomial(1,..). These

functions can take matrix probabilities as input, so your solution to this part should be

a few lines of code.

3

CSC311 Fall 2020 Homework 3

(d) [Optional] One of the advantages of generative models is that they can handle missing data, or be used to answer different sorts of questions about the model. Derive

p(xbottom|xtop, θ,π), the marginal distribution of a single pixel in the bottom half of

an image given the top half, conditioned on your fit parameters. Hint: you’ll have to

marginalize over c.

(e) [Optional] For 20 images from the training set, plot the top half the image concatenated

with the marginal distribution over each pixel in the bottom half. i.e. the bottom half of

the image should use grayscale to represent the marginal probability of each pixel being

1 (darker for values close to 1).

4