Machine Learning Foundations

Homework #2

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions

(without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source

code to designated places. You are encouraged to (but not required to) include a README to help the

TAs check your source code. Any programming language/platform is allowed.

Discussions on course materials and homework solutions are encouraged. But you should write the final

solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but

not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework

solutions and/or source codes to your classmates at any time.

You should write your solutions in English or Chinese with the common math notations introduced in

class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every homework set would come with a full credit of 200 points, with some possible bonus points.

1. (80 points) Go register for the Coursera version of the first part of the class ( https://www.

coursera.org/learn/ntumlone-mathematicalfoundations/ ) and solve its homework 2. The

registration should be totally free. Then, record the highest score that you get within up to 3

trials. Please print out a snapshot of your score as an evidence. (Hint: The problems below are

simple extensions of the Coursera problems.)

2. (20 points) Consider the “negative thick line in R

2” hypothesis set, which includes hypothesis that

returns +1 when |w0 + w1x1 + w2x2| ≤ θ and −1 elsewhere. Show that the VC-Dimension of the

hypothesis set is no less than 4. Please provide proof of your answer.

3. (20 points) Consider the “triangle waves” hypothesis set on R, , which is given by

H = {hα | hα(x) = sign(|αx mod 4 − 2| − 1), α ∈ R}

Here (x mod 4) is a number x − 4k for some integer k such that x − 4k ∈ [0, 4). For instance,

(11.26 mod 4) is 3.26, and (−11.26 mod 4) is 0.74. What the VC-Dimension of such an H? Please

prove your answer.

4. (20 points) For any two hypothesis sets H1 and H2 that come with non-empty intersection, prove

that dvc(H1 ∩ H2) ≤ dvc(H2). (Hint: This may partially help you solve Questions 14 and 15 on

Coursera.)

5. (20 points) Consider H1 as the positive-ray hypothesis set (as discussed in class), and H2 as the

negative-ray hypothesis set (which contains the negation of each hypothesis in H1). We showed

that mH1

(N) = N + 1 in class. Write down mH1∪H2

(N) and use that to calculate dvc(H1 ∪ H2).

(Hint: This may partially help you solve Question 15 on Coursera.)

1 of 2

Machine Learning Foundations (NTU, Fall 2018) instructor: Hsuan-Tien Lin

6. (20 points) For Problems 7–8, you will play with the decision stump algorithm. In class, we

taught about the learning model of “positive and negative rays” (which is simply one-dimensional

perceptron) for one-dimensional data. The model contains hypotheses of the form:

hs,θ(x) = s · sign(x − θ).

The model is frequently named the “decision stump” model and is one of the simplest learning

models. As shown in class, for one-dimensional data, the VC dimension of the decision stump

model is 2.

In fact, the decision stump model is one of the few models that we could easily minimize Ein

efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most

2N dichotomies (see page 22 of lecture 5 slides), and thus at most 2N different Ein values. We

can then easily choose the dichotomy that leads to the lowest Ein, where ties an be broken by

randomly choosing among the lowest Ein ones. The chosen dichotomy stands for a combination

of some ”spot” (range of θ) and s, and commonly the median of the range is chosen as the θ that

realizes the dichotomy.

In the next problem, you are asked to implement such and algorithm and run your program on an

artificial data set. We shall start by generating a one-dimensional data by the procedure below:

(a) Generate x by a uniform distribution in [−1, 1].

(b) Generate y by f(x) = ˜s(x) + noise where ˜s(x) = sign(x) and the noise flips the result with

20% probability.

For any decision stump hs,θ with θ ∈ [−1, 1], express Eout(hs,θ) as a function of θ and s. Write

down your derivation steps.

7. (20 points, *) Generate a data set of size 20 by the procedure above and run the one-dimensional

decision stump algorithm on the data set. Record Ein and compute Eout with the formula above.

Repeat the experiment 1000 times and plot a histogram of Ein − Eout. Describe your findings.

Bonus: More about the Bounding Function

8. (Bonus 20 points) In class, we have shown that

B(N, k) ≤

k

X−1

i=0

N

i

Show that in fact the equality holds.

2 of 2