CS 446 / ECE 449 — Homework 4

Version 1.0

Instructions.

.

• Everyone must submit individually at gradescope under hw4 and hw4code.

• The “written” submission at hw4 must be typed, and submitted in any format gradescope accepts (to

be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but

it must be typed!

• When submitting at hw4, gradescope will ask you to mark out boxes around each of your answers;

please do this precisely!

• Please make sure your NetID is clear and large on the first page of the homework.

• Your solution must be written in your own words.

• When submitting to hw4code, only upload hw4.py and hw4 utils.py. Additional files will be ignored.

1

1. Principal Component Analysis

(a) In the lecture, we mainly discuss the case where the data centers around the origin point (see slides

14/21). If the data does not center around the origin point, the PCA algorithm will first subtract

the mean of all the data, then calculate the projection w, and eventually add the mean back (see

slides 15/21). In this question, we will prove that subtracting the mean is reasonable for PCA.

However, since proving the general theorem is beyond the scope of this course, we will focus on the

2-dimensional case.

For a line on the xy-plane, we denote it as the set of roots/solutions to f(p) = 0, where f(p) = v

⊤p+b.

Provided N points on the xy plane: P = [p

(1)

, …, p

(N)

], and p

(i) = (x

(i)

, y(i)

). We want to find the

optimal line fe= 0 with parameters ve and eb that minimizes the sum of the squared distances between

the points and the line as the equation below.

ve,eb = arg min

v,b

X

N

i=1

(

v

⊤p

(i) + b

∥v∥2

)

2

(1)

Prove that if the optimal line exists, then the mean of the N points is on the line.

Hint. Without loss of generality, consider the case where v is a unit vector. Note that v is the

normal vector of the line.

(b) For each of the following statements, specify whether the statement is true or false. If you think the

statement is wrong, explain in 1 to 2 sentences why it is wrong.

• True or False: As shown in the figure below, PCA seeks a line such that the sum of the vertical

distances from the points to the line is minimized.

• True or False: PCA seeks a projection that best represents the data in a least-squares sense.

• True or False: The principal components are not orthogonal to each other.

• True or False: Solving PCA using SVD might result in a solution which corresponds to a local

minimum.

(c) In PCA, assume we have

X⊤X =

12 0 0 0

0 6 0 0

0 0 20 0

0 0 0 10

Compute the largest eigenvalue of X⊤X and its corresponding eigenvector.

Solution.

2

2. K-Means

We are given a dataset D = {x

(i)}

N

i=1 of d-dimensional points x

(i) ∈ R

d which we are interested in

partitioning them into K clusters, each having a cluster center µk ∈ R

d

(k ∈ [K]) via the K-Means

algorithm. Note that we denote [K] = {1, . . . , K} and similarly for [N]. Recall (from our lecture notes)

that the algorithm optimizes the following cost function:

min

µk∈Rd,k∈[K]

min

rik∈{0,1},i∈[N],k∈[K]

X

N

i=1

X

K

k=1

1

2

rik∥x

(i) − µk∥

2

2

s.t. X

K

k=1

rik = 1 ∀k ∈ [K]. (2)

(a) Given fixed cluster centers µk, ∀k ∈ [K], what is the optimal assignments rik for the optimization in

(2)? (assume that there is no ∥x

(i) − µp∥

2

2 = ∥x

(i) − µq∥

2

2

for ∀p ̸= q)

(b) Given fixed assignments rik, ∀i ∈ [N], k ∈ [K], we assume each cluster center has at least one data

point assigned to it, i.e., PN

i=1 rik > 0. What are the optimal cluster centers µk for the optimization

(2)?

(c) Implement the K-Means (with K = 2) algorithm for a 2-dimensional dataset and submit your code

to hw4code on gradescope. For the given dataset, visualize the clusters at the first three steps and

attach the plots in your written (typed) report. Please see hw4.py for details.

Solution.

3

3. Gaussian Mixture Models

Consider a one-dimensional Gaussian mixture model with K components (k ∈ {1, . . . , K}), each having

mean µk, variance σ

2

k

, and mixture weight πk. Further, we are given a dataset D = {x

(i)}

N

i=1, where

x

(i) ∈ R.

(a) What is the log-likelihood of the data, i.e., log p(D | µk, σk, πk, 1 ≤ k ≤ K), according to the Gaussian

Mixture Model, assuming the data samples x

(i) are i.i.d.?

(b) Recall the Expectation-Maximization procedure for Gaussian mixture models from our lecture where

rik = pψ(t) (z

(i) = ek | x

(i)

). Prove that PK

k=1 rik = 1.

(c) Recall the Expectation-Maximization procedure for Gaussian mixture models from our lecture where

Lagrangian is

L =

X

N

i=1

X

K

k=1

rik[

1

2σ

2

k

(x

(i) − µk)

2 +

1

2

log(2πσ2

k

) − log πk] + λ(

X

K

k=1

πk − 1).

Show how to derive πk =

1

N

PN

i=1 rik from setting ∂L

∂λ = 0 and ∂L

∂πk

= 0 (1 ≤ k ≤ K).

(d) A nice property of Gaussian mixture is the universal approximation property. Loosely speaking, given

any distribution, there exist a Gaussian mixture that can approximate it up to arbitrary accuracy.

Let us verify this powerful property on discrete distributions. Consider n points z1, z2, . . . , zn ∈ R,

and a discrete distribution characterized by a probability mass function q on R:

q(x) = (

qi

if x = zi

,

0 otherwise,

where qi > 0 and Pn

i=1 qi = 1.

Construct a Gaussian mixture on R that approximates the distribution characterized by the probability

mass function q . It is sufficient to describe the construction and show intuitively why it is a good

approximation. You don’t need to rigorously prove your results.

(e) We have seen the Gaussian mixture with finitely many components (K components), where πk

(k ∈ {1, . . . , K}) are the mixture weights. Noting that PK

k=1 πk = 1 and πk ≥ 0, we can see that the

mixture weights represent a probability mass function. Therefore, we can generalize the Gaussian

mixture to combine infinitely many components by using a probability density function.

Formally, let µ : R → R and σ : R → R+ be two functions representing the means and variances,

respectively. Let π : R → R+ be a probability density function on R representing the mixture weights.

Denote the density of the generalized Gaussian mixture as

p(x) = Z ∞

−∞

1

p

2πσ(t)

2

exp

−

(x − µ(t))2

2σ(t)

2

π(t) dt. (3)

Prove that p is a valid density function, i.e., ∀x ∈ R : p(x) ≥ 0 and R ∞

−∞ p(x)dx = 1 (assuming p(x)

and R ∞

−∞ p(x)dx exist).

Hint: See the Fubini’s Theorem, but you don’t need to worry about the mathematical details.

Solution.

4