CS 446 / ECE 449 — Homework 3

Version 1.0

Instructions.

• Everyone must submit individually at gradescope under hw3 and hw3code.

• The “written” submission at hw3 must be typed, and submitted in any format gradescope accepts

(to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like;

but it must be typed!

• When submitting at hw3, gradescope will ask you to mark out boxes around each of your answers;

please do this precisely!

• When submitting to hw3code, only upload hw3.py and hw3 utils.py. Additional files will be ignored.

Version History.

1. Initial version.

1

1. Ensemble Methods

In this question, you will implement several ensemble methods including Bagging and AdaBoost on a

simple dataset. The methods will learn a binary classification of 2D datapoints in [−1, 1]2

.

We have provided a few helper functions in hw3 utils.py.

• visualization() visualizes a dataset and the ensemble’s predictions in 2D space.

• get dataset fixed() returns a simple dataset with pre-defined datapoints. Please use this dataset

for the plot.

• get dataset random() returns a simple dataset by random construction. You may play with it and

test your algorithm.

You will need to implement functions and classes defined in hw3.py. When uploading to Gradescope,

please pack the two files hw3 utils.py and hw3.py (without folder) together into one zip file.

(a) Weak learner

To begin with, you will implement a weak learner to do the binary classification.

A decision stump is a one-level decision tree. It looks at a single feature, and then makes a classification

by thresholding on this feature. Given a dataset with positive weights assigned to each datapoint, we

can find a stump that minimizes the weighted error:

L =

Xn

i=1

w

(i)

· 1(y

(i)

6= ˆy

(i)

)

Here w

(i)

is the weight of the i-th datapoint, and the prediction yˆ

(i)

is given by thresholding on the

k-th feature of datapoint x

(i)

:

yˆ

(i) =

(

s, if x

(i)

k ≥ t

−s, otherwise

For the 2D dataset we have, the parameters of this stumps are the sign s ∈ {+1, −1}, the feature

dimension k ∈ {1, 2}, and the threshold t ∈ [−1, 1]. In this question, your task is to find out the best

stump given the dataset and weights.

Learning a decision stump requires learning a threshold in each dimension and then picking the best

one. To learn a threshold in a dimension, you may simply sort the data in the chosen dimension, and

calculate the loss on each candidate threshold. Candidates are midpoints between one point and the

next, as well as the boundaries of our range of inputs.

Please implement the Stump class in hw3.py. You may define your own functions inside the class,

but do not change the interfaces of init () and predict(). Please read template file for further

instructions.

(b) Weak learner’s predictions

Now test your implementation of Stump on the dataset given by get dataset fixed(). Suppose all

the datapoints are equally weighted. Please answer the following questions in your written submission:

• What is your decision function?

• How many datapoints are mis-classified?

• Using the helper function visualization(), include a visualization of your stump’s predictions.

(c) Bagging

As we have learned from the class, we can utilize ensemble methods to create a strong learner from

weak learners we have for part (a). Please complete bagging() in hw3.py. This function should take

the whole dataset as input, and sample a subset from it in each step, to build a list of different weak

learners.

Please do not change the random sampling of sample indices, and use the default random seed=0,

so that your code can behave consistently in the autograder.

2

(d) AdaBoost

Now please implement AdaBoost algorithm. As we have learned in class, in each step of AdaBoost, it

• Finds the optimal weak learner according to current data weights

• Acquires the weak learner’s predictions

• Calculates the weight for this weak learner

• Updates the weights for datapoints

Complete adaboost() in hw3.py. It should return a series of weak learners and their weights.

(e) Visualization

Run your Bagging and AdaBoost algorithms on the fixed dataset given by get dataset fixed().

Set the number of weak classifiers to 20, and for Bagging, set the number of samples to 15 for learning

each classifier. Please answer the following questions in your written submission:

• Are they performing better than the individual weak learner in (b)?

• Include visualizations for both algorithms in your written submission.

Solution.

3

2. Learning Theory.

(a) VC Dimensions. In this problem, we’ll explore VC dimensions! First, a few definitions that we

will use in this problem. For a feature space X , let F be a set of binary classifier of the form

f : X → {0, 1}. F is said to shatter a set of k distinct points {x

(i)}

k

i=1 ⊂ X if for each set of label

assignments (y

(i)

)

k

i=1 ∈ {0, 1}

k

to these points, there is an f ∈ F which makes no mistakes when

classifying D.

The VC Dimension of F is the largest non-negative integer k ∈ such that there is a set of k points

that F can shatter. Even more formally, let V C(F) denote the VC Dimension of F. It can be defined

as:

V C(F) = max

k

k s.t. ∃{x

(i)

}

k

i=1 ⊂ X , ∀(y

(i)

)

k

i=1 ∈ {0, 1}

k

, ∃f ∈ F, ∀i : f(x

(i)

) = y

(i)

The intuition here is that VC dimension captures some kind of complexity or capacity of a set of

functions F.

Note: The straightforward proof strategy to show that the VC dimension of a set of classifiers is k

is to first show that for a set of k points, the set is shattered by the set of classifiers. Then, show

that any set of k + 1 points cannot be shattered. You can do that by finding an assignment of labels

which cannot be correctly classified using F.

Notation: We denote 1condition(·) : X → {0, 1} to be the indicator function, i.e., 1condition(x) = 1 if

the condition is true for x and 1condition(x) = 0 otherwise.

We will now find the VC dimension of some basic classifiers.

i. 1D Affine Classifier

Let’s start with a fairly simple problem. Consider X = R and Y = {0, 1}. Affine classifiers are of

the form:

Faffine = {1wx+b≥0(·) : X → R | w, b ∈ R},

Show what is V C(Faffine) and prove your result.

Hint: Try less than a handful of points.

ii. General Affine Classifier

We will now go one step further. Consider X = R

k

for some dimensionality k ≥ 1, and Y = {0, 1}.

Affine classifiers in k dimensions are of the form

F

k

affine = {1w>x+b≥0

(·) : X → R | w ∈ R

k

, b ∈ R}

Show what is V C(F

k

affine) and prove your result.

Hint: Note that w>x + b can be written as [x

> 1]

w

b

. Moreover, consider to put all data

points into a matrix, e.g.,

X =

(x

(1))

> 1

(x

(2))

> 1

.

.

.

.

.

.

.

iii. Decision Trees

Consider X = R

k

for some dimensionality k ≥ 1, and Y = {0, 1}. Show that the VC dimension

of the axis-aligned (coordinate-splits) decision trees is infinite.

Hint: Consider an arbitrary dataset, and show that a decision tree can be constructed to exactly

fit that dataset.

(b) Rademacher Complexity. Recall from class that the generalization error bound scales with the

complexity of the function class F, which, in turn, can be measured via Rademacher complexity. In

this question we will compute the Rademacher complexity of linear functions step by step. Let’s

consider a dataset {x

(i)}

n

i=1 ⊂ R

k with the norm bounded by kx

(i)k2 ≤ R and the set of linear

classifiers F = {x 7→ w>x | w ∈ R

k

, kwk2 ≤ W}.

4

i. For a fixed sign vector = (1, …, n) ∈ {±1}

n show that:

max

f∈F

1

n

Xn

i=1

if(x

(i)

) ≤ Wkxk2

where x is defined as x =

1

n

Pn

i=1 x

(i)

i

.

Hint: Cauchy-Schwarz inequality.

ii. Assume i

is distributed i.i.d. according to Pr[i = +1] = Pr[i = −1] = 1/2. Show that

E

h

kxk

2

i

≤

R2

n

iii. Assume i follows the same distribution as previous problem. Recall the definition of Rademacher

complexity:

Rad(F) = E

max

f∈F

1

n

Xn

i=1

if(x

(i)

)

Show that the Rademacher complexity of the set of linear classifiers is:

Rad(F) ≤

RW

√

n

Hint: Jensen’s inequality.

Solution.

5