CS 446 / ECE 449 — Homework 1

Instructions.

• Everyone must submit individually at gradescope under hw1 and hw1code.

• The “written” submission at hw1 must be typed, and submitted in any format gradescope accepts

(to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like;

but it must be typed!

• When submitting at hw1, gradescope will ask you to mark out boxes around each of your answers;

please do this precisely!

• When submitting to hw1code, only upload hw1.py and hw1 utils.py. Additional files will be ignored.

1

1. Linear Regression.

(a) Consider a linear regression problem with a dataset containing N data points {(x

(i)

, y(i)

)}

N

i=1, where

x

(i) ∈ R

d

. The accumulated loss function is given by:

LOLS(w) = 1

2

||Xw − y||2

2

where y ∈ R

N , X ∈ R

N×(d+1) and w ∈ R

d+1

.

i. Find the Hessian matrix of LOLS(w).

Hint: You may want to use the fact that ||Xw − y||2

2 = (Xw − y)

>(Xw − y)

ii. Recall that a twice-continuously differential function f(x) is strictly convex i.f.f. its Hessian is

positive definite for all x. Prove that if N is less than the input dimension d, LOLS(w) can not

be strictly convex.

iii. No matter what X is, prove that for ∀w1, w2 ∈ arg minw∈Rd+1 LOLS(w), we have Xw1 = Xw2.

Note that w1, w2 can be different.

Hint: Use the convexity of the loss function and the convex combinations of w1 and w2.

(b) Consider the same dataset with an L2-norm regularization added to the OLS loss function. Linear

regression with L2 regularization is also called Ridge regression. Recall the composite loss function of

ridge regression:

Lridge(w) = 1

2

||Xw − y||2

2 +

λ

2

||w||2

2

i. One advantage of ridge regression is that for a positive regularization constant (λ > 0), the

matrix (X>X + λI) is always invertible. Prove that the matrix (X>X + λI) is invertible by

showing that it’s positive definite.

ii. Knowing that Lridge(w) is a convex function, show that the estimator for ridge regression is:

wˆ = (X>X + λI)

−1X>y

Solution.

2

2. Programming – Linear Regression.

Recall that the empirical risk in the linear regression method is defined as Rb(w) :=

1

2N

PN

i=1(w>x

(i)−y

(i)

)

2

,

where x

(i) ∈ R

d

is a data point and y

(i)

is an associated label.

(a) Implement linear regression using gradient descent in the linear gd(X, Y, lrate, num iter)

function of hw1.py.

The arguments for this function are: X as the training features, a tensor with shape N × d; Y as the

training labels, an N × 1 tensor; lrate as learning rate (step size), a float number; and num iter

as the number of iterations for gradient descent to run. The objective of this function is to find

parameters w that minimize the empirical risk Rb(w) using gradient descent (only gradient descent).

To keep consistent with the standard program and get correctly scored, prepend a column of ones

to X in order to accommodate the bias term in w, thus w should has d + 1 entries. Then use w = 0

as the initial parameters, and return

Hint. If your are new to machine learning or programming with pytorch, we offer some kind

suggestions. First, try using the vector/matrix operations provided in pytorch and avoid using

for-loops. This will improve both the efficiency and style of your program. Second, create your

own test cases for debugging before submission. With very few samples in your own test case, it is

convenient to compare the program output with your manual calculation. Third, to avoid matrix

computation error, remember to check the shapes of tensors regularly.

Library routines: torch.matmul (@), torch.tensor.shape, torch.tensor.t, torch.cat,

torch.ones, torch.zeros, torch.reshape.

(b) Implement linear regression by using the pseudo inverse to solve for w in the

linear normal(X,Y) function of hw1.py.

The arguments for this function are: X as the training features, a tensor with shape N × d tensor;

Y as the training labels, an N × 1 tensor. To keep consistent with the standard program and get

correctly scored, prepend a column of ones to X in order to accommodate the bias term in w, thus

w should has d + 1 entries.

Library routines: torch.matmul (@), torch.cat, torch.ones, torch.pinverse.

(c) Implement the plot linear() function in hw1.py. Follow the steps below.

Use the provided function hw1 utils.load reg data() to generate a training set X and training

labels Y. Then use linear normal() to calculate the regression results w. Eventually plot the points

of dataset and regressed curve. Return the plot as output. Note that plot linear() should return

the figure object and you should include the visualization in your written submission.

Hint. If your are new to plotting machine learning visualizations, we offer some kind suggestions.

matplotlib.pyplot is an “extremely” useful tool in machine learning, and we commonly refer to it

as plt. Please first get to know the most basic usages by examples from its official website (such

as scatter plots, line plots, etc.). As for our programming question specifically, you may divide and

conquer it by first plotting the points in the dataset, then plotting the linear regression curve.

Library routines: torch.matmul (@), torch.cat, torch.ones, plt.plot, plt.scatter,

plt.show, plt.gcf where plt refers to the matplotlib.pyplot library.

Solution.

3

3. Programming – Logistic Regression.

Recall the empirical risk Rb for logistic regression (as presented in lecture 3):

Rblog(w) = 1

N

X

N

i=1

log(1 + exp(−y

(i)w>x

(i)

)).

Here you will minimize this risk using gradient descent.

(a) In your written submission, derive the gradient descent update rule for this empirical risk by

taking the gradient. Write your answer in terms of the learning rate (step size) η, previous parameters

w, new parameters w0

, number of examples N, and training examples x

(i)

. Show all of your steps.

(b) Implement the logistic() function in hw1.py. You are given as input a training set X, training

labels Y, a learning rate (step size) lrate, and number of gradient updates num iter. Implement

gradient descent to find parameters w that minimize the empirical risk Rblog(w). Perform gradient

descent for num iter updates with a learning rate (step size) of lrate. Same as previous questions,

initialize w = 0, return w as output, and prepend X with a column of ones.

Library routines: torch.matmul (@), torch.tensor.t, torch.exp.

(c) Implement the logistic vs ols() function in hw1.py. Use hw1 utils.load logistic data() to

generate a training set X and training labels Y. Run logistic(X,Y) from part (b) taking X and Y as

input to obtain parameters w (use the defaults for num iter and lrate). Also run linear gd(X,Y)

from Problem 2 to obtain parameters w. Plot the decision boundaries for your logistic regression

and least squares models along with the data X. [Note: As we learned in the class that the decision

rule of Least Squares and Logistic Regression for predicting the class label is sign(wˆ

>x), the decision

boundary can be obtained from wˆ

>x = 0, i.e., for d = 2, we have x2 = −(wˆ0 + wˆ1 × x1)/wˆ2.] Include

the visualizations in your written submission. Which model appears to classify the data better?

Explain in the written submission that why you believe it is better for this problem.

Library routines: torch.linspace, plt.scatter, plt.plot, plt.show, plt.gcf.

Solution.

4

4. Convexity, Lipschitz Continuity, and Smoothness

(a) Convexity

i. Show that if a function f : R

n → R is convex, then for any matrix A ∈ R

n×m and vector b ∈ R

n,

the function g : R

m → R defined by g(x) = f(Ax + b) is convex, where x ∈ R

m.

ii. Prove that if the differentiable function f is λ-strongly convex and the differentiable function g

is convex then f + g is λ-strongly convex.

iii. Given m convex functions {fi

: R

n → R}

m

i=1, denote

f(x) = max

i∈[m]

fi(x),

where [m] = {1, 2, . . . , m}. Prove that f is convex.

(b) Lipschitzness and Smoothness

We say a function f : R

n → R

d

is ρ-Lipschitz if ∀x1, x2 ∈ R

n, it holds that kf(x1) − f(x2)k2 ≤

ρkx1 − x2k2.

i. Prove that if f : R

n → R

m and g : R

m → R

d are ρ-Lipschitz functions, then the composite

g ◦ f : R

n → R

d defined by (g ◦ f)(x) = g(f(x)) is ρ

2

-Lipschitz.

ii. Given a differentiable function f : R

n → R whose gradient is β-Lipschitz, prove that for

∀x, y ∈ R

n we have

f(y) − f(x) ≤ ∇f(x)

>(y − x) + β

2

ky − xk

2

2

.

Hint: You are not required to follow the hints, but please consider them if you have no idea

for proof. (1) Define a tool function φ(t) = f((1 − t)x + ty), thus f(y) − f(x) = φ(1) − φ(0) =

R 1

0

(figure it out); (2) If you get stuck at the final steps, taking a look at the Cauchy–Schwarz

inequality may be helpful.

Solution.

5