Homework 2 CSCE 633

Instructions for homework submission

a) Please write a brief report and include your code.

b) Create a single pdf and submit it on eCampus. Please do not submit .zip files or colab

notebooks.

c) Please start early đ

d) The maximum grade for this homework, excluding bonus questions, is 10 points (out of 100

total for the class). There is 1 bonus point.

Question: Machine learning with Pokemon GO

Recent studies have found that novel mobile games can lead to increased physical activity. A

notable example is Pokemon Go, a mobile game combining the Pokemon world through augmented reality with the real world requiring players to physically move around. Specifically,

in the following study, researchers have found that Pokemon Go leads to increased levels of

physical activity for the most engaged players!

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5174727/

In this problem, our goal is to predict the combat points of each pokemon in the 2017 Pokemon

Go mobile game. Each pokemon has its own unique attributes that can help predicting its

combat points. These include:

1. Stamina

2. Attack value

3. Defense value

4. Capture rate

5. Flee rate

6. Spawn chance

7. Primary strength

1

Inside âHomework 2â folder on Piazza you can find the data file (named âhw2 data.csvâ) that

will be used for our experiments. The rows of these files refer to the data samples (i.e., pokemon

samples), while the columns denote the name of the pokemon (column 1), its attributes (columns

2-8), and the combat point outcome (column 9). You can ignore column 1 for the rest of this

problem.

(i) (1 point) Data exploration: Which are categorical and which are numerical attributes

(columns 2-8) of this dataset?

(ii) (1 point) Data exploration: Plot 2-D scatter plots and compute the Pearsonâs correlation coefficient between the numerical attributes and the outcome of interest. Which attributes

would be the most predictive of the outcome of combat points?

Note: The Pearsonâs correlation coefficient is a measure of linear association between two variables. It ranges between -1 and 1, with values closer to 1 indicating high degree of association

between a feature and the outcome. For more details, see this link: https://en.wikipedia.

org/wiki/Pearson_correlation_coefficient. You can use any available library to compute

this metric.

(iii) (1 point) Data exploration: Plot 2-D scatter plots and compute the Pearsonâs correlation coefficient between the numerical attributes themselves. Which variables are the most

correlated to each other?

(iv) (1 point) Pre-processing of categorical variables: Categorical variables require special attention because usually they cannot be the input of regression models as they are. A

potential way to treat categorical variables is to simply convert each value of the variable to

a separate number. However, this might impute non-existent relative associations between the

values, which might not always be representative of the data (e.g., if we assign â1â to the value

âgreenâ and â2â to the value âredâ, the regression algorithm will assume that âredâ is greater

than âgreen,â which is not necessarily the case). For this reason, we can use a âone hot encodingâ to represent categorical variables. According to this, we will create a binary column for

each category of the categorical variable, which will take a value of 1 if the sample belongs to

that category, and 0 otherwise. For each categorical variable of the problem, count the number

of different values and implement the one hot encoding. For the remaining of the problem,

you will be working with the one hot encoding of the categorical variables.

Note: You can find more information on different types of pre-processing categorical variables

in the following links:

https://pbpython.com/categorical-encoding.html

Coding Systems for Categorical Variables in Regression AnalysisÂ

(v) (2 points) Predicting combat points: The goal of this question is to predict the

combat points using the numerical attributes, as well as the categorical attributes that were

pre-processed with the one hot encoding process. Implement a linear regression model using

the ordinary least squares (OLS) solution. How many parameters does the model have? To

test your model, randomly split the data into 5 folds and use a 5-fold cross-validation. For each

fold compute the square root of the residual sum of squares error (RSS) between the actual and

predicted outcome variable. Also compute the average square roof of the RSS over all folds.

Hint: You will build the data matrix X â RNtrainĂD, whose rows correspond to the training

samples x1, . . . , xNtrain â RDĂ1 and columns to the D features (including the constant 1 for

2

the intercept): X =

ïŁź

ïŁŻ

ïŁ°

1, x

T

1

.

.

.

1, x

T

N

ïŁč

ïŁș

ïŁ»

â RNtrainĂD. Then use the ordinary least squares solution

that we learned in class: wâ = (XT X)

â1XT y.

Note: You can use libraries for matrix operations and random sampling, but please implement

the linear regression algorithm, the 5-fold cross-validation process, and the RSS error computation.

(vi) (2 points) Predicting combat points: Repeat the same experiment as in question (v),

but instead of linear regression, implement linear regression with regularization. Experiment

and report your results with different values of the regularization term Î».

Note: You can use libraries for matrix operations and random sampling, but please implement

the regularized linear regression algorithm, the 5-fold cross-validation process, and the RSS

error computation.

Note: Use the same sample split as in question (v) for better comparison between regularized

and non-regularized regression.

(vii) (Bonus, 1 point) Based on your findings from questions (ii) and (iii), use linear regression

and experiment with different feature combinations. Report your results.

(viii) (1 point) Use the mean sample value of the outcome to binarize the data. Run a logistic

regression model to classify between low and high combat points. To evaluate the model,

randomly split 80% of data into training and 20% into testing. Report the accuracy of the

classifier on the test data.

Note: You can use a built-in function for the logistic regression from the available libraries.

(ix) (1 point) Run a logistic regression model with regularization to classify between low

and high combat points. Use the same training and testing split as in question (viii). Find

the optimal regularization term using a 5-fold cross-validation on the training data. Use the

regularization term that provided the best results from the cross-validation, and evaluate the

regularized logistic regression on the test data. Report the final accuracy on the test data, as

well as the best hyperparameter.

Note: You can use a built-in function for the logistic regression from the available libraries.

3