Data Analytics and Machine Learning

Problem Set 4

Using the StockRetAcct_sample.dta Stata dataset available at CCLE (Week 1), we will in this

exercise code up a “machine learning” version of characteristics-based investment. That is,

we will program an automated learning routine that finds the model specification — which

variables to include and the functional form these variables take — and evaluate the out-ofsample performance of this model.

As described in the lecture notes for Topic 4, we will use the Elastic Net procedure, which

includes Ridge Regressions and LASSO as special cases. The setup for this homework is

given in slides 46 – 51 in Topic 4. The core idea is to develop a model where we estimate

the portfolio weights of the MVE portfolio directly (as opposed to separately estimate the

conditional mean returns and covariance matrix).

Question 1: Automatic Stock Picking Algorithm

a. Download the data. The firm-level characteristics you will use are lnIssue, lnProf,

lnInv, and lnME. For each of these four characteristics, create new, additional

characteristics as the squared value of the original characteristic. Name the new

characteristics the same as the orignal, but with a “2” at the end. For instance, for

lnProf, the squared value should be lnProf2. Further, create additional characteristics

by multiplying each characteristic with lnME (except for lnME itself, which you already

have squared). To name these, add _ME at the end. Thus, lnProf interacted with lnME is

named lnProf_ME. You should have now gone from 4 to 11 characteristics.

(i) For each year in the sample, cross-sectionally demean each of the 11 characteristics.

That is, for each characteristic and each year subtract the average value of that

characteristic across stocks. Then add as final characteristic a column of 1’s to the

dataset. This effectively inserts an intercept in the relation between the MVE portfolio

weight and the characteristics.

Next calculate the factor portfolio returns corresponding to each of these 12

characteristics, as explained on slide 48 in the Topic 4 note (F is implicitly defined in

the equation there). Note that the factor corresponding to the constant is simply an

equal-weighted portfolio of all stocks (the “market”). The overall idea is that with this

approach you have a market factor and long-short characteristics factors.

Calculate and report the factor sample means and sample variance-covariance matrix

for these 12 factorsâ€™ returns, as well as the factors’ sample Sharpe ratios.

(ii) Next, you are to use the Elastic Net procedure (alpha = 0.5 in glmnet) to estimate the b

coefficients. We will use the 25 year sample from 1980-2004 for the cross-validation

exercise, and then we will use the 2005-2014 period for the out of sample testing.

Here, we cannot use the pre-programmed cross-validation procedure in cv.glmnet.

The reason is that the right- and left-hand side variables depend on the sample in a

way not accounted for in the canned glmnet procedure.

You are to run a cross-sectional regression of average returns to the factors on the

covariances of each factor with itself and the other factors (as in the bottom equation

on slide 49 in Topic 4). A 5-fold cross-validation would tell you to first find the sample

factor averages and sample factor covariance matrix in a 20-year subperiod, and then

see how well the estimated b coefficients do in the 5-year out of sample period. In the

out of sample period, the average returns are the 5-year average factor returns for this

period and the covariances are the 5-year covariances in this period. Thus, due to the

combination of time series info (average returns and sample covariance matrix) and

the cross-sectional regression, our setting is a little more complicated than the

standard cv.glmnet code.

So, to be clear: first, find sample average factor returns and covariance matrix from

1980-1999. Estimate the Elastic Net using the glmnet procedure (use family =

‘Gaussian’, alpha = 0.5). This gives you a matrix of b coefficients as a function of

lambda. For each of these sets of b coefficients (each vector of b’s corresponds to a

particular lambda), calculate the mean squared error in the cross-validation out-ofsample period 2000-2004. Now you have MSE as a function of lambda for one 5-year

fold. Then repeat using as in-sample data the 1980-1984 and 1990-2004 period. The

out of sample data is then the 1985-1989 period. Get the MSEs as a function of lambda

and save. Repeat until you have done all 5 folds. Then take the average MSE for each

value of lambda. Pick the lambda that gives the smallest average MSE. Finally, estimate

the elastic net on the full 1980-2004 sample period. Pick the b-coefficient that

corresponds to the value of lambda you have chosen.

(iii) With the final b-vector in hand, calculate the out-of-sample average return, standard

deviation, and Sharpe ratio for the corresponding estimated “ex ante” MVE portfolio

with return b’F_t in the period 2005-2014.

(iv) Plot the cumulative return on this portfolio relative to that on the market (get market

return using the value-weights in the sample, MEwt) over the 2005-2014 period,

where you normalize the “MVE” portfolio’s standard deviation to be the same as the

market over this period. Compare. Note that one should really redo the estimation

each year to get proper out of sample results that would mimic what you would do in

the real world. Also, you could experiment in the in-sample cross-validation with

different values for alpha to see what works best.

Sale!