CSC311

Homework 2

.

Submission: You need to submit five files through MarkUs1

:

• Your answers to Questions 1, 2, 3, and 4, as a PDF file titled hw2_writeup.pdf. You can

produce the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is

readable.

• Python files run_knn.py, logistic.py, and run_logistic_regression.py completed for

Question 3.

• Python files q4.py completed for Question 4.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3

days. After that, no submissions will be accepted.

Computing: To install Python and required libraries, see the instructions on the course web page.

Homeworks are individual work. See the Course Information handout2

for detailed policies.

1. [9pts] Expected Loss and Bayes Optimality You are running an email service, and

one of your key features is a spam filter. Every email is either spam or non-spam, which

we represent with the target t ∈ {Spam, NonSpam}. You need to decide whether to keep it

in the inbox or remove it to the spam folder. We represent this with the decision variable

y ∈ {Keep, Remove}. We’d like to remove spam emails and keep non-spam ones, but the

customers will be much more unhappy if we remove a non-spam email than if we keep a spam

email. We can represent this with the following loss function L(y, t):

NonSpam Spam

Keep 0 1

Remove 100 0

Your studies indicate that 10% of the emails are spam, i.e. Pr(t = Spam) = 0.1.

(a) [2pts] Evaluate the expected loss E[L(y, t)] for the policy that keeps every email (y =

Keep), and for the policy that removes every email (y = Remove).

(b) [2pts] Now suppose you get to observe a feature vector x for each email, and using

your knowledge of the joint distribution p(x, t), you infer p(t| x). What is the Bayes

optimal decision rule? I.e., say how to determine the Bayes optimal decision y∗ from the

conditional probability Pr(t = Spam | x).

1

https://markus.teach.cs.toronto.edu/csc311-2021-09

2

http://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/syllabus.pdf

1

CSC311 Fall 2021 Homework 2

(c) [4pts] After some analysis, you’ve found two words that are indicative of an email being

spam: “Viagra” and “Nigeria”. You define two input features: x1, which takes the value

1 if the email contains the word “Viagra” and 0 otherwise, and x2, which is the same

for “Nigeria”. For a spam email, these features have the following joint distribution:

x2 = 0 x2 = 1

x1 = 0 0.4 0.3

x1 = 1 0.2 0.1

For a non-spam email, these features have the following joint distribution:

x2 = 0 x2 = 1

x1 = 0 0.998 0.001

x1 = 1 0.001 0

Determine the Bayes optimal decision rule, i.e. determine the Bayes optimal decision y∗

for each value of x. Justify your answer.

(d) [1pts] What is the expected loss E[L(y∗, t)]?

2. [4pts] Feature Maps. Suppose we have the following 1-D dataset for binary classification:

x t

-1 1

1 0

3 1

(a) [2pts] Argue briefly (at most a few sentences) that this dataset is not linearly separable.

(Your argument should resemble the one we used in lecture to prove XOR is not linearly

separable.)

(b) [2pts] Now suppose we apply the feature map

ψ(x) =

ψ1(x)

ψ2(x)

=

x

x

2

.

Assume we have no bias term, so that the parameters are w1 and w2. Write down the

constraint on w1 and w2 corresponding to each training example, and then find a pair

of values (w1, w2) that correctly classify all the examples. Remember that there is no

bias term.

3. [15pts] kNN vs. Logistic Regression. In this problem, you will compare the performance

and characteristics of different classifiers, namely k-Nearest Neighbors and Logistic Regression. You will complete the provided code in q3/ and experiment with the completed code.

You should understand the code instead of using it as a black box.

The data you will be working with is a subset of MNIST hand-written digits, 4s and 9s, represented as 28×28 pixel arrays. We show the example digits in figure 1. There are two training

sets: mnist_train, which contains 80 examples of each class, and mnist_train_small, which

contains 5 examples of each class. There is also a validation set mnist_valid that you should

use for model selection, and a test set mnist_test that you should use for reporting the final

performance. Optionally, the code for visualizing the datasets is located at plot_digits.py.

2

CSC311 Fall 2021 Homework 2

Figure 1: Example digits. Top and bottom show digits of 4s and 9s, respectively.

3.1. k-Nearest Neighbors. Use the supplied kNN implementation to predict labels for

mnist_valid, using the training set mnist_train.

(a) [2pts] Implement a function run_knn in run_knn.py that runs kNN for different

values of k ∈ {1, 3, 5, 7, 9} and plots the classification rate on the validation set

(number of correctly predicted cases, divided by total number of data points) as a

function of k. Report the plot in the write-up.

(b) [2pts] Comment on the performance of the classifier and argue which value of k you

would choose. What is the classification rate for k

∗

, your chosen value of k? Also

report the classification rate for k

∗ + 2 and k

∗ − 2. How does the test performance

of these values of k correspond to the validation performance3

?

3.2. Logistic Regression. Read the provided code in run_logistic_regression.py and

logistic.py. You need to implement the logistic regression model, where the cost is

defined as:

J =

1

N

X

N

i=1

LCE(y

(i)

, t(i)

) = 1

N

X

N

i=1

−t

(i)

log y

(i) − (1 − t

(i)

) log(1 − y

(i)

)

,

where N is the total number of data points.

(a) [4pts] Implement functions logistic_predict, evaluate, and logistic located

at logistic.py.

(b) [5pts] Complete the missing parts in a function run_logistic_regression located

at run_logistic_regression.py. You may use the implemented functions from

part (a). Run the code on both mnist_train and mnist_train_small. Check

whether the value returned by run_check_grad is small to make sure your implementation in part (a) is correct. Experiment with the hyperparameters for the

learning rate, the number of iterations (if you have a smaller learning rate, your

model will take longer to converge), and the way in which you initialize the weights.

If you get NaN/Inf errors, you may try to reduce your learning rate or initialize with

3

In general, you shouldn’t peek at the test set multiple times, but we do this for this question as an illustrative

exercise.

3

CSC311 Fall 2021 Homework 2

smaller weights. For each dataset, report which hyperparameter settings you found

worked the best and the final cross entropy and classification error on the training,

validation, and test sets. Note that you should only compute the test error once you

have selected your best hyperparameter settings using the validation set.

(c) [2pts] Examine how the cross entropy changes as training progresses. Generate and

report 2 plots, one for each of mnist_train and mnist_train_small. In each plot,

you need show two curves: one for the training set and one for the validation set.

Run your code several times and observe if the results change. If they do, how would

you choose the best parameter settings?

4. [8pts] Locally Weighted Regression.

(a) [2pts] Given {(x

(1), y(1)), ..,(x

(N)

, y(N)

)} and positive weights a

(1), …, a(N)

show that

the solution to the weighted least squares problem

w∗ = arg min

1

2

X

N

i=1

a

(i)

(y

(i) − wT x

(i)

)

2 +

λ

2

||w||2

(1)

is given by the formula

w∗ =

XT AX + λI

−1 XT Ay (2)

where X is the design matrix (defined in class) and A is a diagonal matrix where Aii =

a

(i)

(b) [2pts] Locally reweighted least squares combines ideas from k-NN and linear regression.

For each new test example x we compute distance-based weights for each training example a

(i) =

exp(−||x−x

(i)

||2/2τ

2

P

)

j

exp(−||x−x(j)

||2/2τ

2)

, computes w∗ = arg min 1

2

PN

i=1 a

(i)

(y

(i) − wT x

(i)

)

2 +

λ

2

||w||2 and predicts ˆy = x

T w∗

. Complete the implementation of locally reweighted least

squares by providing the missing parts for q4.py.

Important things to notice while implementing: First, do not invert any matrix, use

a linear solver (numpy.linalg.solve is one example). Second, notice that P

exp(Ai)

j

exp(Aj ) =

P

exp(Ai−B)

j

exp(Aj−B)

but if we use B = maxj Aj it is much more numerically stable as P

exp(Ai)

j

exp(Aj )

overflows/underflows easily. This is handled automatically in the scipy package with the

scipy.misc.logsumexp function4

.

(c) [2pt] Randomly hold out 30% of the dataset as a validation set. Compute the average

loss for different values of τ in the range [10,1000] on both the training set and the

validation set. Plot the training and validation losses as a function of τ (using a log

scale for τ ).

(d) [2pt] How would you expect this algorithm to behave as τ → ∞? When τ → 0? Is this

what actually happened?

4

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.misc.logsumexp.html