ECE 4710J
Homework #4
Properties of Simple Linear Regression
1. (10 points) In lecture, we spent a great deal of time talking about simple linear regression. To briefly summarize, the simple linear regression model assumes that given a
single observation x, our predicted response for this observation is ˆy = θ0 + θ1x. (Note:
In this problem we write (θ0, θ1) instead of (a, b) to more closely mirror the multiple linear
regression model notation.)
We saw that the θ0 = ˆθ0 and θ1 = ˆθ1 that minimize the average L2 loss for the simple
linear regression model are:
ˆθ0 = ¯y − ˆθ1x¯
ˆθ1 = r
σy
σx
Or, rearranging terms, our predictions ˆy are:
yˆ = ¯y + rσy
x − x¯
σx
(a) (4 points) As we saw in lecture, a residual ei
is defined to be the difference between
a true response yi and predicted response ˆyi
. Specifically, ei = yi − yˆi
. Note that
there are n data points, and each data point is denoted by (xi
, yi).
Prove, using the equation for ˆy above, that Pn
i=1 ei = 0 (meaning the sum of the
residuals is zero).
Answer.
(b) (3 points) Using your result from part (a), prove that ¯y = ¯yˆ.
Answer.
1
Homework #4 2
(c) (3 points) Prove that (¯x, y¯) is on the simple linear regression line.
Answer.
Homework #4 3
Geometric Perspective of Least Squares
2. (10 points) We also viewed both the simple linear regression model and the multiple
linear regression model through linear algebra. The key geometric insight was that if
we train a model on some design matrix X and true response vector Y, our predicted
response Yˆ = Xˆθ is the vector in span(X) that is closest to Y (Yˆ is the orthogonal
projection of Y onto the span(X)).
In the simple linear regression case, our optimal vector θ is ˆθ = [ ˆθ0,
ˆθ1]
T
, and our design
matrix is
X =
1 x1
1 x2
.
.
.
.
.
.
1 xn
=
 
1 ⃗x
 
This means we can write our predicted response vector as Yˆ = X
ˆθ0
ˆθ1
= ˆθ01 + ˆθ1⃗x.
Note, in this problem, ⃗x refers to the nlength vector [x1, x2, …, xn]
T
. In other words, it
is a feature, not an observation.
For this problem, assume we are working with the simple linear regression model,
though the properties we establish here hold for any linear regression model that contains
an intercept term.
(a) (4 points) Using the geometric properties from lecture, prove that Pn
i=1
ei = 0.
Hint: Recall, we define the residual vector as e = Y − Yˆ, and e = [e1, e2, …, en]
T
.
Answer.
(b) (3 points) Explain why the vector ⃗x (as defined in the problem) and the residual
vector e are orthogonal. Hint: Two vectors are orthogonal if their dot product is 0.
Answer.
(c) (3 points) Explain why the predicted response vector Yˆ and the residual vector e
are orthogonal.
Answer.
Homework #4 4
Properties of a Linear Model With No Constant Term
Suppose that we don’t include an intercept term in our model. That is, our model is now
yˆ = γx,
where γ is the single parameter for our model that we need to optimize. (In this equation,
x is a scalar, corresponding to a single observation.)
As usual, we are looking to find the value ˆγ that minimizes the average L2 loss (mean squared
error) across our observed data {(xi
, yi)}, i = 1, . . . , n:
R(γ) = 1
n
Xn
i=1
(yi − γxi)
2
The normal equations derived in lecture no longer hold. In this problem, we’ll derive a
solution to this simpler model. We’ll see that the least squares estimate of the slope in this
model differs from the simple linear regression model, and will also explore whether or not
our properties from the previous problem still hold.
3. (5 points) Use calculus to find the minimizing ˆγ. That is, prove that
γˆ =
P
P
xiyi
x
2
i
Note: This is the slope of our regression line, analogous to ˆθ1 from our simple linear
regression model.
Answer.
Homework #4 5
4. (10 points) For our new simplified model, our design matrix X is:
X =
x1
x2
.
.
.
xn
=

⃗x

.
Therefore our predicted response vector Yˆ can be expressed as Yˆ = ˆγ⃗x. (⃗x here is defined
the same way it was in Question 2.)
Earlier in this homework, we established several properties that held true for the simple
linear regression model that contained an intercept term. For each of the following four
properties, state whether or not they still hold true even when there isn’t an intercept
term. Be sure to justify your answer.
(a) (2 points) Pn
i=1
ei = 0.
Answer.
(b) (3 points) The column vector ⃗x and the residual vector e are orthogonal.
Answer.
(c) (3 points) The predicted response vector Yˆ and the residual vector e are orthogonal.
Answer.
(d) (2 points) (¯x, y¯) is on the regression line.
Answer.
Homework #4 6
MSE “Minimizer”
5. (15 points) Recall from calculus that given some function g(x), the x you get from
solving dg(x)
dx = 0 is called a critical point of g – this means it could be a minimizer or a
maximizer for g. In this question, we will explore some basic properties and build some
intuition on why, for certain loss functions such as squared L2 loss, the critical point of
the empirical risk function (defined as average loss on the observed data) will always be
the minimizer.
Given some linear model f(x) = γx for some real scalar γ, we can write the empirical
risk of the model f given the observed data {xi
, yi}, i = 1, . . . , n as the average L2 loss,
also known as mean squared error (MSE):
1
n
Xn
i=1
(yi − γxi)
2
.
(a) (2 points) Let’s break the function above into individual terms. Complete the
following sentence by filling in the blanks using one of the options in the parenthesis
following each of the blanks:
The mean squared error can be viewed as a sum of n (linear/quadratic/logarithmic/exponential) terms, each of which can be treated as a function of
(xi/yi/γ).
Answer.
(b) (4 points) Let’s investigate one of the n functions in the summation in the MSE.
Define gi(γ) = 1
n
(yi − γxi)
2
for i = 1, . . . , n. Recall from calculus that we can use
the 2nd derivative of a function to describe its curvature about a certain point (if
it is facing concave up, down, or possibly a point of inflection). You can take the
following as a fact: A function is convex if and only if the function’s 2nd derivative
is nonnegative on its domain. Based on this property, verify that gi
is a convex
function.
Answer.
(c) (3 points) Briefly explain in words why given a convex function g(x), the critical
point we get by solving dg(x)
dx = 0 minimizes g. You can assume that dg(x)
dx is a
function of x (and not a constant).
Answer.
Homework #4 7
(d) (4 points) Now that we have shown that each term in the summation of the MSE
is a convex function, one might wonder if the entire summation is convex given that
it is a sum of convex functions.
Let’s look at the formal definition of a convex function. Algebraically speaking,
a function g(x) is convex if for any two points (x1, g(x1)) and (x2, g(x2)) on the
function,
g(cx1 + (1 − c)x2) ≤ cg(x1) + (1 − c)g(x2)
for any real constant 0 ≤ c ≤ 1.
The above definition says that, given the plot of a convex function g(x), if you
connect 2 randomly chosen points on the function, the line segment will always lie
on or above g(x) (try this with the graph of y = x
2
).
i. (2 points) Using the definition above, show that if g(x) and h(x) are both
convex functions, their sum g(x) + h(x) will also be a convex function.
Answer.
ii. (2 points) Based on what you have shown in the previous part, explain intuitively why the sum of n convex functions is still a convex function when
n > 2.
Answer.
(e) (2 points) Finally, using the previous parts, explain why in our case that, when we
solve for the critical point of the MSE by taking the gradient with respect to the
parameter and setting the expression to 0, it is guranteed that the solution we find
will minimize the MSE.
Answer.
Congratulations! You have finished Homework 4!
Sale!