Sale!

Homework 2 CSCE 633

$30.00

Category:
5/5 - (2 votes)

Homework 2 CSCE 633

Instructions for homework submission
a) Please write a brief report and include your code.
b) Create a single pdf and submit it on eCampus. Please do not submit .zip files or colab
notebooks.
c) Please start early 🙂
d) The maximum grade for this homework, excluding bonus questions, is 10 points (out of 100
total for the class). There is 1 bonus point.
Question: Machine learning with Pokemon GO
Recent studies have found that novel mobile games can lead to increased physical activity. A
notable example is Pokemon Go, a mobile game combining the Pokemon world through augmented reality with the real world requiring players to physically move around. Specifically,
in the following study, researchers have found that Pokemon Go leads to increased levels of
physical activity for the most engaged players!
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5174727/
In this problem, our goal is to predict the combat points of each pokemon in the 2017 Pokemon
Go mobile game. Each pokemon has its own unique attributes that can help predicting its
combat points. These include:
1. Stamina
2. Attack value
3. Defense value
4. Capture rate
5. Flee rate
6. Spawn chance
7. Primary strength
1
Inside “Homework 2” folder on Piazza you can find the data file (named “hw2 data.csv”) that
will be used for our experiments. The rows of these files refer to the data samples (i.e., pokemon
samples), while the columns denote the name of the pokemon (column 1), its attributes (columns
2-8), and the combat point outcome (column 9). You can ignore column 1 for the rest of this
problem.
(i) (1 point) Data exploration: Which are categorical and which are numerical attributes
(columns 2-8) of this dataset?
(ii) (1 point) Data exploration: Plot 2-D scatter plots and compute the Pearson’s correlation coefficient between the numerical attributes and the outcome of interest. Which attributes
would be the most predictive of the outcome of combat points?
Note: The Pearson’s correlation coefficient is a measure of linear association between two variables. It ranges between -1 and 1, with values closer to 1 indicating high degree of association
between a feature and the outcome. For more details, see this link: https://en.wikipedia.
org/wiki/Pearson_correlation_coefficient. You can use any available library to compute
this metric.
(iii) (1 point) Data exploration: Plot 2-D scatter plots and compute the Pearson’s correlation coefficient between the numerical attributes themselves. Which variables are the most
correlated to each other?
(iv) (1 point) Pre-processing of categorical variables: Categorical variables require special attention because usually they cannot be the input of regression models as they are. A
potential way to treat categorical variables is to simply convert each value of the variable to
a separate number. However, this might impute non-existent relative associations between the
values, which might not always be representative of the data (e.g., if we assign “1” to the value
“green” and “2” to the value “red”, the regression algorithm will assume that “red” is greater
than “green,” which is not necessarily the case). For this reason, we can use a “one hot encoding” to represent categorical variables. According to this, we will create a binary column for
each category of the categorical variable, which will take a value of 1 if the sample belongs to
that category, and 0 otherwise. For each categorical variable of the problem, count the number
of different values and implement the one hot encoding. For the remaining of the problem,
you will be working with the one hot encoding of the categorical variables.
Note: You can find more information on different types of pre-processing categorical variables
in the following links:
https://pbpython.com/categorical-encoding.html

Coding Systems for Categorical Variables in Regression Analysis 


(v) (2 points) Predicting combat points: The goal of this question is to predict the
combat points using the numerical attributes, as well as the categorical attributes that were
pre-processed with the one hot encoding process. Implement a linear regression model using
the ordinary least squares (OLS) solution. How many parameters does the model have? To
test your model, randomly split the data into 5 folds and use a 5-fold cross-validation. For each
fold compute the square root of the residual sum of squares error (RSS) between the actual and
predicted outcome variable. Also compute the average square roof of the RSS over all folds.
Hint: You will build the data matrix X ∈ RNtrain×D, whose rows correspond to the training
samples x1, . . . , xNtrain ∈ RD×1 and columns to the D features (including the constant 1 for
2
the intercept): X =



1, x
T
1
.
.
.
1, x
T
N



∈ RNtrain×D. Then use the ordinary least squares solution
that we learned in class: w∗ = (XT X)
−1XT y.
Note: You can use libraries for matrix operations and random sampling, but please implement
the linear regression algorithm, the 5-fold cross-validation process, and the RSS error computation.
(vi) (2 points) Predicting combat points: Repeat the same experiment as in question (v),
but instead of linear regression, implement linear regression with regularization. Experiment
and report your results with different values of the regularization term λ.
Note: You can use libraries for matrix operations and random sampling, but please implement
the regularized linear regression algorithm, the 5-fold cross-validation process, and the RSS
error computation.
Note: Use the same sample split as in question (v) for better comparison between regularized
and non-regularized regression.
(vii) (Bonus, 1 point) Based on your findings from questions (ii) and (iii), use linear regression
and experiment with different feature combinations. Report your results.
(viii) (1 point) Use the mean sample value of the outcome to binarize the data. Run a logistic
regression model to classify between low and high combat points. To evaluate the model,
randomly split 80% of data into training and 20% into testing. Report the accuracy of the
classifier on the test data.
Note: You can use a built-in function for the logistic regression from the available libraries.
(ix) (1 point) Run a logistic regression model with regularization to classify between low
and high combat points. Use the same training and testing split as in question (viii). Find
the optimal regularization term using a 5-fold cross-validation on the training data. Use the
regularization term that provided the best results from the cross-validation, and evaluate the
regularized logistic regression on the test data. Report the final accuracy on the test data, as
well as the best hyperparameter.
Note: You can use a built-in function for the logistic regression from the available libraries.
3

PlaceholderHomework 2 CSCE 633
$30.00
Open chat
Need help?
Hello
Can we help?