Assignment 1 SENG 474 / CSC 578D Data Mining


5/5 - (2 votes)

Assignment 1
SENG 474 / CSC 578D
Data Mining

After that point, no submissions will be
Show your work for all questions.
Different marking schemes will be used for undergrad (SEng 474) and grad (CSc 578D)
students. Undergrad students do not have to answer the grad questions.
All code questions use the python/numpy/pytorch/matplotlib. You may install these
libraries on your own computer, or you can work in the lab.
For each question you should hand the attached code skeleton, including your
implementations in it. Include graphs and written answers in your pdf.
1: Neural Networks
(SEng 474; CSc 578D: 20 points)
Use the attached code skeleton to complete this question.
(a) Using PyTorch, build a network that can properly classify a point in 2D space as
being inside or outside of the unit circle. See skeleton code provided. Plot the
percent correct on training data against iteration number. [10 pts]
(b) Generate a new test batch and run your final model on it. Plot the points in 2D
and color them based on their predicted label. We suggest using the provided
plot_decision_boundary function. [5 pts]
(c) The simple neural network architecture provided in the skeleton code cannot
properly predict for this task. Why? Plotting the predicted labels (as in b) for the
simple architecture may help you to answer this question. [5 pts]
2. Gradient Descent: Linear Regression
(SENG 474: 30 points, CSC578D: 40 points. SEng 474; CSc 587D: 5 Bonus points)
In this part of the assignment we will implement a linear regression model, using
the Boston houses dataset. We will implement two techniques for minimizing the cost
function: batch gradient descent, and stochastic gradient descent.
You are provided with a skeleton python program and you have to
implement the missing parts.
First let’s recall some theory and mathematical notation1
. We express our dataset as
a matrix X ∈ R
n×m where each row is a training sample and each column represents
a feature, and a vector y ∈ R
n of target values. In order to make the following notation
easier, we defined a training sample as vector as x = [1, x1, x2, . . . , xm]. Basically
we add a column of ones to X. The purpose of linear regression is to estimate the
vector of parameters w = [w0, w1, . . . , wm] such that the hyper-plane x · wT fits, in
an optimal way, our training samples. Once we have obtained w we can predict the
value of testing sample, xtest, simply by plugging it into the equation yˆ = xtest · wT
We estimate w by minimizing a cost function over the entire dataset, which is defined
as follows
E(w) = 1

yi − xi
· wT
we apply gradient descent techniques to find the parameters vector w that minimizes
the cost function E(w). This is achieved by an iterative algorithm that starts with an
initial guess for w and repeatedly performs the update
wj := wj − κ

E(w) j = 0, 1, . . . , m
where κ is the learning rate parameter.
In this question you are asked to implement some gradient descent techniques. We
are going to set a maximum number of iteration, max_iter in the code, and analyze
what happens to the cost function at each iteration by plotting the error function at each
SEng 474; CSc 587D: 30 pts
(a) Implement the batch gradient descent algorithm. Plot the cost function for each
iteration. Test it with a few different learning rates. E.g. κ = 0.001, 0.01, 0.1.
Explain what is happening. [15 pts]
(b) Implement the stochastic gradient descent algorithm. Plot the cost function with
different learning rate. Compare it with the batch gradient descent. [15 pts]
1Matrices are represented by a capital letter, vectors with a lower case bold letter
CSc 587D only, 10 pts
(c) For the code above, we used a fixed learning rate. For this question, you should
research alternative methods for setting the learning rate. Describe in detail one
method that uses a learning rate schedule, and one method that is adaptive. Explain not only the update rule, but the intuition behind why the approach works.
[5 pts]
SEng 474; CSc 587D: Bonus Question, 5 pts
(d) Now we introduce L2 regularization. Explain why it is used. Implement it and
repeat (b) with different values of λ i.e 10, 100, 1000. Remember that with
regularization the cost function becomes
E(w) = 1


(i) − x
· wT
+ λ

Note that by convention we don’t regularize w0. [5 pts]
3. Gradient Descent: Logistic Regression
(SENG 474; CSC578D: 15 points)
For this question, we will implement logistic regression, and test it using the breast
cancer dataset and the code skeleton in For this question, you only need
to implement batch gradient descent. This time, plot the log likelihood function for
each iteration:
L(w) = X
yi(w0xi) −
ln(1 + e
) (1)
Test it with a few different learning rates. E.g. κ = 0.001, 0.01, 0.1. Explain what
is happening. [15 pts]

PlaceholderAssignment 1 SENG 474 / CSC 578D Data Mining
Open chat
Need help?
Can we help?