1 Introduction

Assignment in this course can consist of two parts.

1. Programming: The goal of the programming homeworks in this course is to learn about how machine learning algorithms work through hands on experience. In each homework assignment you will both implement an algorithm and run experiments with it. You will also experiment with these algorithms in a Python Notebook to learn more about how they work.

2. Analytical questions: These questions will ask you to consider questions related to the topics covered by the assignment. You will be able to answer these questions without relying on your programming.

Assignments are worth various points. Typical assignments are worth 100 each, with point totals indicated in the assignment. Each assignment will contain a version number at the top. While we try to ensure every homework is perfect when we release it, errors do happen. When we correct these, we’ll update the version number, post a new PDF and announce the change. Each homework starts at version 1.0.

2 Data

The ﬁrst part of the semester will focus on supervised classiﬁcation. We consider several real world binary classiﬁcation datasets taken from a range of applications. Each dataset is in the same format (described below) and contains a train, development and test ﬁle. You will train your algorithm on the train ﬁle and use the development set to test that your algorithm works. The test ﬁle contains unlabeled examples that we will use to test your algorithm. It is a very good idea to run on the test data just to make sure your code doesn’t crash. You’d be surprised how often this happens.

2.1 Biology

Biological research produces large amounts of data to analyze. Applications of machine learning to biology include ﬁnding regions of DNA that encode for proteins, classiﬁcation of gene expression data and inferring regulatory networks from mRNA and proteomic data.

Fall 2018 CS 475 Machine Learning: Homework 1 2

Our biology task of characterizing gene splice junction sequences comes from molecular biology, a ﬁeld interested in the relationships of DNA, RNA, and proteins. Splice junctions are points on a sequence at which “superﬂuous” RNA is removed before the process of protein creation in higher organisms. Exons are nucleotide sequences that are retained after splicing while introns are spliced out. The goal of this prediction task is to recognize DNA sequences that contain boundaries between exons and introns. Sequences contain exon/intron (EI) boundaries, intron/exon (IE) boundaries, or do not contain splice examples. For a binary task, you will classify sequences as either EI boundaries (label 1) or non-splice sequences (label 0). Each learning instance contains a 60 base pair sequence (ex. ACGT), with some ambiguous slots. Features encode which base pair occurs at each position of the sequence.

2.2 Finance

Finance is a data rich ﬁeld that employs numerous statistical methods for modeling and prediction, including the modeling of ﬁnancial systems and portfolios.1 Our ﬁnancial task is to predict which Australian credit card applications should be accepted (label 1) or rejected (label 0). Each example represents a credit card application, where all values and attributes have been anonymized for conﬁdentiality. Features are a mix of continuous and discrete attributes and discrete attributes have been binarized.

2.3 NLP

Natural language processing studies the processing and understanding of human languages. Machine learning is widely used in NLP tasks, including document understanding, information extraction, machine translation and document classiﬁcation. Our NLP task is sentiment classiﬁcation. Each example is a product review taken from Amazon kitchen appliance reviews. The review is either positive (label 1) or negative (label 0) towards the product. Reviews are represented as uni-gram and bi-grams; each one and two word phrase is extracted as a feature.

2.4 Speech

Statistical speech processing has its roots in the 1980s and has been the focus of machine learning research for decades. The area deals with all aspects of processing speech signals, including speech transcription, speaker identiﬁcation and speech information retrieval. Our speech task is spoken letter identiﬁcation. Each example comes from a speaker saying one of the twenty-six letters of English alphabet. Our goal is to predict which letter was spoken. The data was collected by asking 150 subjects to speak each letter of the alphabet twice. Each spoken utterance is represented as a collection of 617 real valued attributes scaled to be between -1.0 and 1.0. Features include spectral coeﬃcients; contour features, sonorant features, pre-sonorant features, and post-sonorant features. The binary task is to distinguish between the letter M (label 0) and N (label 1).

1For an overview of such applications, see the proceedings of the 2005 NIPS workshop on machine learning in ﬁnance. http://www.icsi.berkeley.edu/~moody/MLFinance2005.htm

Fall 2018 CS 475 Machine Learning: Homework 1 3

2.5 Vision

Computer vision processes and analyzes images and videos and it is one of the fundamental areas of robotics. Machine learning applications include identifying objects in images, segmenting video and understanding scenes in ﬁlm. Our vision task is image segmentation. In image segmentation, an image is divided into segments are labeled according to content. The images in our data have been divided into 3×3 regions. Each example is a region and features include the centroids of parts of the image, pixels in a region, contrast, intensity, color, saturation and hue. The goal is to identify the primary element in the image as either a brickface, sky, foliage, cement, window, path or grass. In the binary task, you will distinguish segments of foliage (label 0) from grass (label 1).

2.6 Synthetic Data

When developing algorithms it is often helpful to consider data with known properties. We typically create synthetic data for this purpose. To help test your algorithms, we are providing two synthetic datasets. These data are to help development.

2.6.1 Easy

The easy data is labeled using a trivial classiﬁcation function. Any reasonable learning algorithm should achieve near ﬂawless accuracy. Each example is a 10 dimensional instance drawn from a multi-variate Gaussian distribution with 0 mean and a diagonal identity covariance matrix. Each example is labeled according to the presence one of 6 features; the remaining features are noise.

2.6.2 Hard

Examples in this data are randomly labeled. Since there is no pattern, no learning algorithm should achieve accuracy signiﬁcantly diﬀerent from random guessing (50%). Data is generated in an identical manner as Easy except there are 94 noisy features.

3 Programming (50 points)

In this assignment you will implement two classiﬁers: perceptron and logistic regression. We have provided Python code to serve as a testbed for your algorithms. We will reuse this testbed in future assignments as well. You will ﬁll in the details. Search for comments that begin with TODO; these sections need to be written. You may change the internal code as you see ﬁt but do not change the names of any of the ﬁles or command-line arguments that have already been provided. Other than the given ﬁlenames and command-line arguments, you are free to change what you wish: you can modify internal code, add internal code, add ﬁles, and/or add new command-line arguments (in which case you should include appropriate defaults as we won’t add these arguments when we run your code). In addition to this, we have provided a Jupyter notebook template for some follow-up analysis, which also looks at linear regression and why it is not appropriate for binary classiﬁcation. The results of this will be turned in as a PDF on Gradescope. See details at the end of the assignment for how to submit your work.

Fall 2018 CS 475 Machine Learning: Homework 1 4

3.1 Python Libraries

We will be using Python 3.6.5. We are not using Python 2, and will not accept assignments written for this version. We recommend using a recent release of Python 3.6.x, but anything in this line (e.g. 3.x) should be ﬁne. For each assignment, we will tell you which Python libraries you may use. We will do this by providing a requirements.txt ﬁle. We strongly recommend using a virtual environment to ensure compliance with the permitted libraries. By strongly, we mean that unless you have a very good reason not to, and you really know what you are doing, you should use a virtual environment. You can also use anaconda environments, which achieve the same goal. The point is that you should ensure you are only using libraries available in the basic Python library or the requirements.txt ﬁle.

3.2 Virtual Environments Virtual environments are easy to set up and let you work within an isolated Python environment. In short, you can create a directory that corresponds to a speciﬁc Python version with speciﬁc packages, and once you activate that environment, you are shielded from the various Python / package versions that may already reside elsewhere on your system. Here is an example:

# Create a new virtual environment. python3 -m venv python3-hw1 # Activate the virtual environment. source python3-hw1/bin/activate # Install packages as specified in requirements.txt. pip3 install -r requirements.txt # Optional: Deactivate the virtual environment, returning to your system’s setup. deactivate When we run your code, we will use a virtual environment with only the libraries in requirements.txt. If you use a library not included in this ﬁle, your code will fail when we run it, leading to a large loss of points. By setting up a virtual environment, you can ensure that you do not mistakenly include other libraries, which will in turn help ensure that your code runs on our systems. Make sure you are using the correct requirements.txt ﬁle from the current assignment. We may add new libraries to the ﬁle in subsequent assignments, or even remove a library (less likely). If you are using the wrong assignments requirements.txt ﬁle, it may not run when we grade it. For this reason, we suggest creating a new virtual environment for each assignment. It may happen that you ﬁnd yourself in need of a library not included in the requirements.txt for the assignment. You may request that a library be added by posting to Piazza. This may be useful when there is some helpful functionality in another library that we omitted. However, we are unlikely to include a library if it either solves a major part of the assignment, or includes functionality that isn’t really necessary. In this and future assignments we will allow you to use numpy and scipy. Your Jupyter notebook may have diﬀerent libraries allowed than code you write in the provided framework. We will let you know when we allow additional libraries to be included in the Jupyter notebook.

3.3 How to Run the Provided Framework

The framework operates in two modes: train and test. Both stages are in the main method of classify.py.

Fall 2018 CS 475 Machine Learning: Homework 1 5

3.3.1 Train Mode The usage for train mode is

python3 classify.py –mode train –algorithm algorithm_name –model-file model_file –data train_file

The mode option indicates which mode to run (train or test). The algorithm option indicates which training algorithm to use. Each assignment will specify the string argument for an algorithm. The data option indicates the data ﬁle to load. Finally, the model-file option speciﬁes where to save the trained model.

3.3.2 Test Mode The test mode is run in a similar manner:

python3 classify.py –mode test –model-file model_file –data test_file –predictions-file predictions_file

The model file is loaded and run on the data. Results are saved to the predictions file.

3.3.3 Examples As an example, the following trains a perceptron classiﬁer on the speech training data:

python3 classify.py –mode train –algorithm perceptron –model-file speech.perceptron.model \ –data speech.train

To run the trained model on development data:

python3 classify.py –mode test –model-file speech.perceptron.model –data speech.dev \ –predictions-file speech.dev.predictions

As we add new algorithms we will also add command line ﬂags using the argparse library to specify algorithmic parameters. These will be speciﬁed in each assignment.

3.4 Data Formats The data are provided in what is known as SVM-light format. Each line contains a single example:

0 1:-0.2970 2:0.2092 5:0.3348 9:0.3892 25:0.7532 78:0.7280

The ﬁrst entry on the line is the label. The label can be an integer (0/1 for binary classiﬁcation) or a real valued number (for regression.) The classiﬁcation label of −1 indicates unlabeled. Subsequent entries on the line are features. The entry 25:0.7532 means that feature 25 has value 0.7532. Features are 1-indexed. Model predictions are saved as one predicted label per line in the same order as the input data. The code that generates these predictions is provided in the library. The script compute accuracy.py can be used to evaluate the accuracy of your predictions for classiﬁcation:

python3 compute_accuracy.py data_file predictions_file

We provide this script since it is exactly how we will evaluate your output. Make sure that your algorithm is outputting labels as they appear in the input ﬁles. If you use a diﬀerent internal representation of your labels, make sure the output matches what’s in the data ﬁles. The above script will do this for you, as you’ll get low accuracy if you write the wrong labels.

Fall 2018 CS 475 Machine Learning: Homework 1 6

3.5 Existing Components

The foundations of the learning framework have been provided for you. You will need to complete this library by ﬁlling in code where you see a TODO comment. You are free to make changes to the code as needed provided you do not change the behavior of the command lines described above. We emphasize this point: do not change the existing command line ﬂags, existing ﬁlenames, or algorithm names. We use these command lines to test your code. If you change their behavior, we cannot test your code. The code we have provided is fairly compact, and you should spend some time to familiarize yourself with it. Here is a short summary to get you started: • data.py – This contains the load data function, which parses a given data ﬁle and returns features and labels. The features are stored as a sparse matrix of ﬂoats (and in particular as a scipy.sparse.csr matrix of ﬂoats), which has num examples rows and num features columns. The labels are stored as a dense 1-D array of integers with num examples elements. • classify.py – This is the main testbed to be run from the command line. It takes care of parsing command line arguments, entering train/test mode, saving models/predictions, etc. Once again, do not change the names of existing command-line arguments. • models.py – This contains a Model class which you should extend. Models have (in the very least) a fit method, for ﬁtting the model to data, and a predict method, which computes predictions from features. You are free to add other methods as necessary. Note that all predictions from your model must be 0 or 1; if you use other intermediate values for internal computations, then they must be converted before they are returned. • compute accuracy.py – This is a script which simply compares the true labels from a data ﬁle (e.g., bio.dev) to the predictions that were saved by running classify.py (e.g., bio.dev.perceptron.predictions). • run on all datasets.py – This is not necessarily needed, but is included simply to make it easier for you to test your algorithms on all datasets, and to make sure that your algorithms even run on the test sets. Inside this script you can specify the main data directory, which should contain all of the *.train, *.dev, *.test ﬁles, along with an output directory (for models and predictions). The script loops through the datasets to train and test on each. Feel free to modify this script as needed; by default, it assumes the data directory is ./datasets and that the desired output directory is ./output.

We have also included a toy model, which we call Useless, for your reference. This is a binary classiﬁer which forms predictions in a very simple and naive way. At training time, it simply memorizes the ﬁrst example it sees (both its features and label). At prediction time, it forms the dot product between all of the examples and the memorized training example, and forms a prediction based on this dot product. (If the dot product is greater than 0, then it returns the same label as the memorized example; otherwise, the opposite label is returned.) Do not think too much about this toy model – it’s simply included as a reference for how the fit and predict methods can be deﬁned. In fact you can go

Fall 2018 CS 475 Machine Learning: Homework 1 7

through the whole training/testing process with this toy model on the provided datasets, and we recommend that you do so.

3.6 Perceptron (15 points)

Here you will implement the Perceptron algorithm for binary classiﬁcation. The perceptron is a mistake-driven online learning algorithm. It takes as input a vector of real-valued inputs x and makes a prediction ˆ y ∈{−1,+1} (for this assignment we consider only bi-nary labels). This simpliﬁes computations internally, but we again emphasize that your model’s predict method must return labels that are 0 or 1. Predictions are made using a linear classiﬁer: ˆ y = sign(w·x). The term w·x is the dot productof w and x computed asPi xiwi. Updates to w are made only when a prediction is incorrect: ˆ y 6= y. The new weight vector w0 is a function of the current weight vector wand example x, y. The weight vector is updated so as to improve the prediction on the current example. Note that Perceptron naturally handles continuous and binary features, so no special processing is needed. The basic structure of the algorithm is:

1. Initialize w to 0, set learning rate η and number of iterations I

2. For each training iteration k = 1…I:

(a) For each example i = 1…N: i. Receive an example xi

ii. Predict the label ˆ yi = sign(w·xi) =?1 if w·xi ≥ 0 −1 otherwise iii. If ˆ yi 6= yi, make an update to w: w0 = w + ηyixi Note that there is no bias term in this version and you should not include one in your solution. Also observe the deﬁnition of “sign” to account for 0 values. Once again, while sign returns −1 and 1, the outputs from your predict method must be the actual labels, which are 0 or 1.

3.7 Logistic Regression (20 points)

The logistic regression model is used to model binary classiﬁcation data. Logistic regression is a special case of generalized linear regression where the labels Y are modeled as a linear combination of the data X, but in a transformed space speciﬁed by g, sometimes called the “link function”: E[y | x] = g(wx + ?) (1) where ? is a noise term, usually taken to be Gaussian.

This “link function” allows you to model inherently non-linear data with a linear model. In the case of logistic regression, the link function is the logistic function:

g(z) =

1 1 + e−z

(2)

Fall 2018 CS 475 Machine Learning: Homework 1 8

3.7.1 Gradient Descent

In this assignment, we will solve for the parameters w in our logistic regression model using gradient descent to ﬁnd the maximum likelihood estimate. Gradient descent (GD) is an optimization technique that is both very simple and powerful. It works by taking the gradient of the objective function and taking steps in directions where the gradient is negative, which decreases the objective function2.

3.7.2 Maximum Conditional Likelihood

Since we seek to maximize the objective, we will use gradient ascent. We begin by writing the conditional likelihood: P(Y | w,X) = n Y i=1 p(yi | w,xi) (3) Since yi ∈{0,1}, we can write the conditional probability inside the product as: P(Y | w,X) = n Y i=1 p(yi = 1 | w,xi)yi ×(p(yi = 0 | w,xi))1−yi (4) Note that one of these terms in the product will have an exponent of 0, and will evaluate to 1.

For ease of math and computation, we will take the log:

`(Y,X,w) = logP(Y | w,X) =

n X i=1

yi log(p(yi = 1 | w,xi)) + (1−yi)log(p(yi = 0 | w,xi)) (5) Plug in our logistic function for the probability that y is 1:

`(Y,X,w) =

n X i=1

yi log(g(wxi)) + (1−yi) log(1−g(wxi))) (6)

Recall that the link function, g, is the logistic function. It has the nice property 1−g(z) = g(−z).

`(Y,X,w) =

n X i=1

yi log(g(wxi)) + (1−yi) log(g(−wxi)) (7)

We can now use the chain rule to take the gradient with respect to w:

∇`(Y,X,w) =

n X i=1

yi

1 g(wxi)∇g(wxi) + (1−yi)

1 g(−wxi)∇g(−wxi) (8)

Since ∂ ∂zg(z) = g(z)(1−g(z)): ∇`(Y,X,w) = n X i=1 yi

1 g(wxi)

g(wxi)(1−g(wxi))∇wxi (9)

+(1−yi)

1 g(−wxi)

g(−wxi)(1−g(−wxi))∇(−wxi) (10)

2Gradient descent decreases the objective function if the gradient (ﬁrst order approximation) is locally accurate, see Taylor expansion.

Fall 2018 CS 475 Machine Learning: Homework 1 9

Simplify again using 1−g(z) = g(−z) and cancel terms ∇`(Y,X,w) = n X i=1 yig(−wxi)∇wxi + (1−yi)g(wxi)∇(−wxi) (11) You can now get the partial derivatives (components of the gradient) out of this gradient function by: ∂

∂wj

`(Y,X,w) =

n X i=1

yig(−wxi) xij + (1−yi)g(wxi)(−xij) (12)

This is the equation that you will use to calculate your updates. (If you’d like to, you can optimize this further, but only if you desire. Here the main computational gain comes from using this vectorized form; it is easily exploited using NumPy and will be much faster than manually looping through the elements of w to be updated. For example, see Python for Data and StackExchange.)

3.7.3 Oﬀset Feature

None of the math above mentions an oﬀset feature (bias feature), a w0, that corresponds to a xi,0 that is always 1. It turns out that we don’t need this if our data is centered. By centered we mean that E[y] = 0. While this may or may not be true, for this assignment you should assume that your data is centered. Do not include another feature that is always 1 (x0) or weight (w0) for it.

3.8 Jupyter Notebook – Linear Regression (15 points)

In addition to this, use the python Jupyter notebook template provided to do some additional analysis. You’ll need to additionally make sure Jupyter is installed in your virtual environment. You can install Jupyter using pip3. Jupyter lets you keep notebooks which contain code and outputs from code, including plots. See http://jupyter.readthedocs.io/en/latest/running.html. In short, you’ll run the notebook by running python jupyter notebook or jupyter notebook, and this will create a local web server on your machine. To use Jupyter, you can use a web browser and navigate to http://localhost:8888. Then, run python jupyter notebook.ipynb to launch the notebook, and follow the instructions there. This will provide some additional insight into the relationship between logistic regression and linear regression, as well as ensure your logistic regression is functioning properly.

3.8.1 Completing the Notebook

The notebook contains detailed directions for how to complete it. You will need to ﬁrst complete the logistic regression model before you can complete the notebook. You will be allowed to use sklearn in your Python notebook. See http://scikit-learn. org/stable/ for more details. We have noted in the Notebook which cells need to be completed to receive credit. We have added Markdown cells with the title “Question N: …”. The cell immediately following these produce plots that are the answer to each question. Highlight these plots when you submit your notebook on gradescope.

Fall 2018 CS 475 Machine Learning: Homework 1 10

3.9 Deliverables

You need to implement the Perceptron and Logistic Regression algorithms as well as complete the Jupyter notebook provided. To do this, see the TODO comments in the code, and feel free to modify / add code as necessary. However, once again, do not modify the ﬁlenames or command-line arguments that have already been provided. Your predictors will be selected by passing the string perceptron or logistic as the argument for the algorithm parameter.

3.10 Learning rate In both algorithms, you will need to specify a learning rate η, where 0 < η ≤ 1. Your default value for η should be 1. You must add a command line argument to allow this value to be adjusted via the command line. Add this command line option by adding the following code to the get args function in classify.py.

parser.add_argument(“–online-learning-rate”, type=float, help=”The learning rate for gradient based updates”, default=1.0)

Be sure to add the option name exactly as it appears above. You can then use the value read from the command line in your main method by referencing it as args.online learning rate. Note that the dashes have been replaced by underscores.

3.11 Number of training iterations

Usually we iterate multiple times over the data. This can improve performance by increasing the number of updates made. We will deﬁne the number of times each algorithm iterates over all of the data using the parameter online training iterations. You must deﬁne a new command line option for this parameter. Use a default value of 5 for this parameter. You can add this option by adding the following code to the get args function of classify.py.

parser.add_argument(“–online-training-iterations”, type=int, help=”The number of training iterations for online methods.”, default=5)

You can then use the value read from the command line in your main method by referencing it as args.online training iterations. During training, you should not change the order of examples. You must iterate over examples exactly as they appear in the data ﬁle, i.e. as provided by the data loader.

3.12 Grading Programming

The programming section of your assignment will be graded using an automated grading program. Your code will be run using the provided command line options, as well as other variations on these options (diﬀerent parameters, data sets, etc.) The grader will consider the following aspects of your code.

1. Exceptions: Does your code run without crashing?

2. Output: Some assignments will ask you to write some data to the console. Make sure you follow the provided output instructions exactly.

Fall 2018 CS 475 Machine Learning: Homework 1 11

3. Accuracy: If your code works correctly, then it should achieve a certain accuracy on each data set. While there are small diﬀerence that can arise, a correctly working implementation will get the right answer.

4. Speed/Memory: As far as grading is concerned, eﬃciency largely doesn’t matter, except where lack of eﬃciency severely slows your code (so slow that we assume it is broken) or the lack of eﬃciency demonstrates a lack of understanding of the algorithm. For example, if your code runs in two minutes and everyone else runs in 2 seconds, you’ll lose points. Alternatively, if you require 2 gigs of memory, and everyone else needs 10 MB, you’ll lose points. In general, this happens not because you did not optimize your code, but when you’ve implemented something incorrectly.

The Python notebook will be graded manually based on including the requested outputs. See the notebook for details.

3.13 Code Readability and Style

In general, you will not be graded for code style. However, your code should be readable, which means minimal comments and clear naming / organization. If your code works perfectly then you will get full credit. However, if it does not we will look at your code to determine how to allocate partial credit. If we cannot read your code or understand it, then it is very diﬃcult to assign partial credit. Therefore, it is in your own interests to make sure that your code is reasonably readable and clear.

3.14 Code Structure

Your code must support the command line options and the example commands listed in the assignment. Aside from this, you are free to change the internal structure of the code, write new classes, change methods, add exception handling, etc. However, once again, do not change the names of the ﬁles or command-line arguments that have been provided. We suggest you remember the need for clarity in your code organization.

3.15 Knowing Your Code Works

How do you know your code really works? That is a very diﬃcult problem to solve. Here are a few tips:

1. Check results on easy and hard. The majority classiﬁer is simple; you can count labels and then check its outputs to make sure it did the right thing. In the case of the perceptron, you should get close to 100% on easy and close to 50% on hard.

2. Use Piazza. While you cannot share code, you can share results. We encourage you to post your results on dev data for your diﬀerent algorithms. A common result will quickly emerge that you can measure against.

3. Output intermediate steps. Looking at ﬁnal predictions that are wrong tells you little. Instead, print output as you go and check it to make sure it looks right. This can also be helpful when sharing information on Piazza.

4. Debug. Find a Python debugger that you like and use it. This can be very helpful.

Fall 2018 CS 475 Machine Learning: Homework 1 12

3.16 Debugging

The most common question we receive is “how do I debug my code?” The truth is that machine learning algorithms are very hard to debug because the behavior of the algorithm is unknown. In these assignments, you won’t know ahead of time what accuracy is expected for your algorithm on a dataset. This is the reality of machine learning development, though in this class you have the advantage of your classmates, who may post the output of their code to the bulletin board. While debugging machine learning code is therefore harder, the same principles of debugging apply. Write tests for diﬀerent parts of your code to make sure it works as expected. Test it out on the easy datasets to verify it works and, when it doesn’t, debug those datasets carefully. Work out on paper the correct answer and make sure your code matches. Don’t be afraid of writing your own data for speciﬁc algorithms as needed to test out diﬀerent methods. This process is part of learning machine learning algorithms and a reality of developing machine learning software.

4 Analytical (50 Points)

In addition to completing the analytical questions, your assignment for this homework is to learn Latex. All homework writeups must be PDFs compiled from Latex. Why learn latex?

1. It is incredibly useful for writing mathematical expressions.

2. It makes references simple.

3. Many academic papers are written in latex.

The list goes on. Additionally, it makes your assignments much easier to read than if you try to scan them in or complete them in Word. We realize learning latex can be daunting. Fear not. There are many tutorials on the Web to help you learn. We recommend using pdﬂatex. It’s available for nearly every operating system. Additionally, we have provided you with the tex source for this PDF, which means you can start your writeup by erasing much of the content of this writeup and ﬁlling in your answers. You can even copy and paste the few mathematical expressions in this assignment for your convenience. As the semester progresses, you’ll no doubt become more familiar with latex, and even begin to appreciate using it. Be sure to check out this cool latex tool for ﬁnding symbols. It uses machine learning! http://detexify.kirelabs.org/classify.html For each homework assignment we will provide you with a Latex template. You must use the template. The template contains detailed directions about how to use it.

1) Unsupervised and Supervised Learning (9 points) Suppose you’re building a system to categorize users of a music service by their favorite music genre. Your dataset consists of over a million user proﬁles (including what music they listened to), but of that, only 1,000 user proﬁles have been labeled with the user’s favorite music genre.

1. What are the advantages and disadvantages of using an Unsupervised approach to this problem? How do you plan on measuring how well your approach works?

Fall 2018 CS 475 Machine Learning: Homework 1 13

2. What are the advantages and disadvantages of using an Supervised approach to this problem? How do you plan on measuring how well your approach works?

3. How might a semi-supervised approach provide a balance between the Supervised and Unsupervised approaches?

2) Comparing Algorithms (12 points) We have covered three main algorithms so far: Linear Regression, Logistic Regression, and Perceptron. Consider each of the following statements. For each, consider if the statement is true for one or more of the three algorithms listed. If true, check the box corresponding to the algorithm for which the statement is true. Alternatively, check the box corresponding to “None” if none of the three algorithms applies. For example, 1. This algorithm begins with the letter “L”. 4Linear Regression 4Logistic Regression 2Perceptron 2None2. This algorithm begins with the letter “Q” 2Linear Regression 2Logistic Regression 2Perceptron 4None To repeat, each statement may apply to multiple algorithms (please mark all that the statement applies to) OR none of the algorithms. Please answer each of the following statements for each of the three algorithms. 1. Has no closed-form solution for calculating parameters w.

2. Guaranteed to ﬁnd the optimal solution from the hypothesis class given the input.

3. The maximum likelihood estimator (MLE) solution is equal to minimizing the error.

4. Changes are only made to parameters after the algorithm makes an incorrect prediction.

3) Regularization (9 points) In class, we’ve discussed adding a regularization term to our objective in the following form;

ED(w) = λ

1 2

M X j=1

|wj|q

where the regularization setting is speciﬁed by q. In class, we’ve discussed q = 2 and q = 1. Now, also consider alternative settings – speciﬁcally q = ∞, q = 0, For each of the following questions, answer True or False (in box a). If True, justify your answer (in box b). If False, list the regularization setting for which the statement is correct (also in box b).

Fall 2018 CS 475 Machine Learning: Homework 1 14

1. The regularizer setting q = 0 penalizes the total number of non-zero entries in the weight vector.

2. The regularizer setting q = 1 produces a sparse set of weights.

3. The regularizer setting q = 2 is a penalty on the largest entry in the weight vector.

4) Learning Rates for Logistic Regression (10 points) In the Gradient Descent method for Logistic Regression as described in lectures, a learning rate parameter γ is supplied. The selection of γ is an important component of this learning algorithm. For each of the following strategies, note whether it is a “good” strategy or not (in box a) for selecting γ in terms of achieving both convergence and speed, and justify your answer (box b).

1. Start with a large γ (e.g. γ = 0.1), reducing it towards a smaller value (e.g. γ = 0.01) after several iterations.

2. Start with a small γ, and maintain the rate through all iterations.

3. Instead of using a manually-speciﬁed γ, we use Newton’s method. This uses the second order Taylor approximation to ﬁnd the stationary point of the objective function. The update rule is instead wt+1 = wt −H−1l0(w) where H is the Hessian matrix: H = l00(wt) = ∂2l ∂w∂w

5) Linear Regression (10 points) Suppose you observe n data points (x1,y1),…,(xn,yn), where all x and y are scalars. For each of the following pairs of models and error functions, decide whether (a) it is a valid model-error function pair and (b) a closed-form solution exists for w. If it does, derive it and place the ﬁnal equation in the box c. Place the remainder of your derivation in box d. If it does not have a closed-form solution, justify your answer in box e.

1. Your model is ˆ y = wx and you want to minimize the error functionPi|yi − ˆ yi| 2. Your model is ˆ y = wx and you want to minimize the error functionPi(yi − ˆ yi)2k for some integer k. 3. Your model is ˆ y = wx and you want to minimize the error functionPip(yi − ˆ yi). 5 What to Submit

In each assignment you will submit three things.

1. Submit your code (.py ﬁles) to cs475.org as a zip ﬁle. Your code must be uploaded as code.zip with your code in the root directory. By ‘in the root directory,’ we mean that the zip should contain *.py at the root (./*.py) and not in any sort of substructure (for example hw1/*.py). One simple way to achieve this is to zip using the command line, where you include ﬁles directly (e.g., *.py) rather than specifying a folder (e.g., hw1):

Fall 2018 CS 475 Machine Learning: Homework 1 15

zip code.zip *.py

A common mistake is to use a program that automatically places your code in a subfolder. It is your job to make sure you have zipped your code correctly. We will run your code using the exact command lines described earlier, so make sure it works ahead of time, and make sure that it doesn’t crash when you run it on the test data. A common mistake is to change the command line ﬂags. If you do this, your code will not run. Remember to submit all of the source code, including what we have provided to you. We will include requirements.txt but nothing else.

2. Submit your writeup to gradescope.com. Your writeup must be compiled from latex and uploaded as a PDF. The writeup should contain all of the answers to the analytical questions asked in the assignment. Make sure to include your name in the writeup PDF and to use the provided latex template for your answers following the distributed template. You will submit this to the assignment called “Homework 1: Supervised Classiﬁers 1: Written”.

3. Submit your Jupyter notebook to gradescope.com. Create a PDF version of your notebook (File → Download As → PDF via LaTeX (.pdf)). Be sure you have run the entire notebook so we can see all of your output. You will submit this to the assignment called “Homework 1: Supervised Classiﬁers 1: Notebook”. When you submit on gradescope, you will indicate which output box corresponds to which questions.

Sale!