COMP 4211: Machine Learning

Programming Assignment 1

1 Objective

The objective of this programming assignment is twofold:

1. To acquire a better understanding of supervised learning methods by using a public-domain

software package called scikit-learn.

2. To evaluate the performance of several supervised learning methods by conducting empirical study on three data sets.

2 Major Tasks

The assignment consists of the following tasks:

1. To learn to use the linear regression model for regression.

2. To learn to use the logistic regression model for classification.

3. To learn to use the single-hidden-layer neural network model for classification.

4. To conduct empirical study using different supervised learning methods.

5. To write up a report.

Each of these tasks will be elaborated in the following subsections.

2.1 Regression Method

Linear regression is a basic model for regression which is expressed in the form f(x; w) =

w0 + w1x1 + · · · + wdxd, where w denotes the parameters to be learned from data. Note that

this basic model has no hyperparameter to set.

1

2.2 Classification Methods

2.2.1 Logistic Regression

Learning of the logistic regression model should use a gradient-descent algorithm by minimizing

the cross-entropy loss.1

It requires that the step size parameter η be specified. Try out a

few values (<1) and choose one that leads to stable convergence. You may also decrease η

gradually during the learning process to enhance convergence. A common criterion used for

early stopping is when the improvement between iterations does not exceed a small threshold

or when the number of iterations has reached a prespecified maximum.

2.2.2 Single-hidden-layer Neural Networks

Neural network classifiers generalize logistic regression by introducing one or more hidden layers.

The learning algorithm for them is similar to that for logistic regression as described above.

For the single-hidden-layer neural network model, the number of hidden units H should be

determined using cross validation. The generalization performance of the model is estimated

for each candidate value of H ∈ {1, 2, . . . , 10}. This is done by randomly sampling 80% of the

training instances to train a classifier and then validating it on the remaining 20%. Five such

random data splits are performed and the average over these five trials is used to estimate the

generalization performance. The value H∗

that gives the best performance among the 10 choices

of H can then be found. Subsequently, a neural network classifier with H∗ hidden units in a

single layer is trained from scratch using all the training instances available. In addition, if you

wish, you may learn to use the more powerful model selection submodule in scikit-learn

to facilitate performing grid search for hyperparameter tuning. Since the solution found may

depend on the initial weight values chosen randomly, you may repeat each setting multiple times

and report the average classification accuracy.

2.3 Empirical Study

You will use three binary classification and regression data sets which are available as a ZIP

file (datasets.zip). The following table shows the number of features, number of training

examples, and number of test examples for each data set.

Data set #features #train #test

fifa 36 13191 4397

finance 26 2754 918

orbits 12 9642 3215

When you load each .npz data file, you will find six NumPy arrays.

train X classification train Y regression train Y

test X classification test Y regression test Y

1For simplicity, you are not required to add regularization terms to the loss functions though you may do it if

you wish.

2

Each row of X stores the features of one example and the corresponding row of Y stores its class

label (0 or 1) for classification, and regression label (0 to 1) for regression. As is always the

case, the label files for the test sets should not be used for training but only for measuring the

accuracy on the test data.

For each of the three data sets, you will evaluate the following methods with respect to the

regression and classification accuracy on the training set and the test set separately:2

• Linear regression

• Logistic regression

• Neural network with H∗ hidden units (H∗ determined by cross validation)

You are expected to also report the time required by each method to complete the task, excluding

the time needed for loading the data files. For the linear regression model, you are required to

compute the squared error (f(x; w) − y)

2

for each data point in the test set and then plot

the distribution of the squared error values as a histogram. For the logistic regression model,

you are required to visualize the classification results to depict the performance on both the

training and test sets. For the neural network model, you should report the performance of each

value of H ∈ {1, 2, . . . , 10} in the cross validation procedure for determining the best value H∗

.

Furthermore, you should keep in mind to report the best (i.e., lowest) loss of the neural network

model on both the training and test sets before the model is overfitted. For reporting the results

of the neural network model, you are required to visualize not only the classification results on

the training set and test set after training the model, but also the change in performance on the

training set and validation set during training the model.

Your programs should be written in such a way that the TA can run them easily to verify the

results reported by you.

2.4 Report Writing

In your report, you are expected to present the parameter settings and the experiment results.

Besides reporting the accuracy (for both training and test data) in numbers, graphical aids

should also be used to analyze the performance of different methods visually. Note that you may

not score high if you fail to provide analysis and visualization of your experiment results. Some

utilities in scikit-learn such as auc and confusion matrix are recommended for reporting

the experiment results. For the time required by each method to complete the task, you report

it in seconds.

3 Some Programming Tips

As is always the case, good programming practices should be applied when coding your program.

Below are some common ones but they are by no means complete:

• Using functions to structure program clearly

2You may also try to use single-hidden-layer neural networks for the regression tasks but it is not required for

this assignment. Please note that the squared loss should be used for regression tasks.

3

• Using meaningful variable and function names to improve readability

• Using indentation

• Using consistent styles

• Including concise but informative comments

For scikit-learn in particular, you are recommended to take full advantage of the built-in

classes which can keep your program both short and efficient. Proper use of implementation

tricks often leads to speedup by orders of magnitude. Please be careful to choose the built-in

models that are suitable for your tasks, e.g., sklearn.linear model.LogisticRegression is

not a correct choice for your second task since it does not use gradient descent.

4 Assignment Submission

Assignment submission should only be done electronically using the Course Assignment Submission System (CASS):

https://cssystem.cse.ust.hk/UGuides/cass/student.html

There should be two files in your submission with the following naming convention required:

1. Report (with filename report): preferably in PDF format.

2. Source code and a README file (with filename code): all necessary code for running

your program as well as a brief user guide for the TA to run the programs easily to

verify your results, all compressed into a single ZIP or RAR file. The data should not be

submitted to keep the file size small.

When multiple versions with the same filename are submitted, only the latest version according

to the timestamp will be used for grading. Files not adhering to the naming convention above

will be ignored.

5 Grading Scheme

This programming assignment will be counted towards 12% of your final course grade. Note that

the plus sign (+) in the table below indicates that reporting without providing the corresponding

code will get zero point. The maximum scores for different tasks are as follows:

4

Grading scheme Code

(60)

Report

(+40)

Empirical study on linear regression

– Build the linear regression model 2

– Compute the R2

score of the linear regression model on

both the training and test sets 3 +2

– Depict a histogram of the squared errors of the data points

in the test set of the linear regression model 10 +3

Empirical study on logistic regression

– Build the logistic regression model by adopting the gradient

descent optimization algorithm, and present the model settings 5 +2

– Compute the accuracy of the logistic regression model on

both the training and test sets 5 +3

– Record and visualize the experiment results of the logistic

regression model on both the training and test sets 10 +3

Empirical study on neural network model

– Build the neural network model by adopting the gradient

descent optimization algorithm, and present the model settings 5 +2

– Report the parameter tuning results of the neural network

model using cross validation 5 +4

– Compute the best (i.e., lowest) loss of the neural network model

on both the training and test sets before the model is overfitted 5 +3

– Record and visualize the experiment results of the neural

network model, including performance change over time 10 +3

Writing report

– Present the computing environment for this assignment +2

– Present the time required by each method to complete the task +3

– Compare and analyze the performance of all the regression

and classification methods involved +10

Late submission will be accepted but with penalty. The late penalty is deduction of one point

(out of a maximum of 100 points) for every minute late after 11:59pm. Being late for a fraction of

a minute is considered a full minute. For example, two points will be deducted if the submission

time is 00:00:34.

6 Academic Integrity

Please read carefully the relevant web pages linked from the course website.

While you may discuss with your classmates on general ideas about the assignment, your submission should be based on your own independent effort. In case you seek help from any person

or reference source, you should state it clearly in your submission. Failure to do so is considered

plagiarism which will lead to appropriate disciplinary actions.

5