Data Analytics and Machine Learning
Problem Set 1
Question 1 : On ggplot2 and regression planes
Use the imports-85.csv dataset available at CCLE (Week 1). The data is taken from the UCI
Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php.
1. Use ggplot2 to visualize the relationship between price and horsepower and body
style. Price is the dependent variable. Consider both the “log()” and “^2”
transformations of price as dependent variables. Does the body style variable appear
to be relevant for car prices , above and beyond horsepower?
2. Run a regression of your preferred specification. Perform residual diagnostics as you
learned in Econometrics. What do you conclude from your regression diagnostic plots
of residuals vs. fitted and residuals vs. horsepower?
3. Now use ggplot2 to visualize the relationship between fuel efficiency (city-mpg) and
horsepower. Now regress city.mpg on horsepower. Is the regression result consistent
with the conclusion you would draw based on the plot? More on this next week.
note: Make sure that your continuous variables are in numeric form (use the function
as.numeric as necessary).
Use the StockRetAcct_insample.dta Stata dataset available at CCLE (Week 1) for the next
Question 2 : Nonlinear relations
A common concern is that the relationship between a predictive variable (X) and the
outcome we are trying to predict (Y) is nonlinear. On the surface, this seems to invalidate
linear regressions, such as the Fama-MacBeth regression. However, this is not generally the
case. For instance, if Y = f(X) + noise, where f(.) is not linear in X, simply define a
transformation of X as, generally, Z = a + bf(X). Now, it is clear that Y = a1 + b1*Z, for
constants a, a1, b, and b1. In other words, one could include squared values of X in the
regression, perhaps max(0,X), etc.
We will see this in action for the case of Issuance (lnIssue). This is the average amount of
stock issuance in the last 36 months, normalized by market equity. Generally, firms that
issue a lot of equity have low returns going forward.
a. Construct decile sorts (10 portfolios) as in the class notes, but now based on the
issuance variable lnIssue. Give the average return to each decile portfolio, valueweighting stocks within each portfolio each year, equal-weighting across years.
b. Plot the average return to these 10 portfolios, similar to what we did in the Topic 1(ef) notes. Discuss whether the pattern seems linear or not.
c. Since most of the ‘action’ is in the extreme portfolios, consider a model where
expected returns to stocks is linear in a transformed issuance-characteristic that takes
three values: -1 if the stock’s issuance is in Decile 1, 1 if the stock’s issuance is in decile
10, and 0 otherwise.
Create this transformed issuance variable and run a Fama-MacBeth regression with it.
Report the results. What is the nature of the portfolio implied by the Fama-MacBeth
regression? That is, what stocks do you go long, short, no position?
Question 3 : Double‐sorts and functional forms
In the lecture notes we saw that the value spread is much larger for small stocks. Using this
fact, I proposed a model where expected returns are linear in the book-to-market ratio as
well as the interaction between book-to-market and size. In other words, holding size
constant there is a linear relation between expected stock returns and book-to-market.
In this question, we will dig deeper into whether this is a reasonable assumption or not
based on visual analysis.
a. Create independent quintile sorts based on book-to-market (lnBM) and size (lnME).
That is create a quintile variable by year for book-to-market and then create a quintile
variable by year for size.
b. For each size quintile, plot the average returns to the five book-to-market quintile
portfolios. So, for size quintile 1, and book-to-market quintile 3, the stocks in this
portfolio all have size quintile equal to 1 and book-to-market quintile equal to 3. Thus,
I’m looking for five plots here, one for each size quintile.
Does the assumption of conditional linearity seem ok, or would you suggest a different
Data Analytics and Machine Learning Problem Set 1
Data Analytics and Machine Learning