This is an excerpt from some notes I took for Yale’s S&DS 365 Class (Intermediate Machine Learning)
For low dimensional prediction, we can use least squares. For high dimensional linear regression, there is a bias-variance tradeoff because no closed solution exists: omitting too many variables leads to high bias, and selecting too many leads to high variance. To mitigate this, we need to select a good subset of variables. The lasso is a fast way to select variables.
Sparse linear regression
- Ridge regression doesn’t take advantage of sparsity. Maybe only a small number of covariates are good predictors. How do we find them?
- Lasso:
- L1 measures sparsity and keeps the optimization convex.
Definition. For a particular , we call the lasso estimator.
The selected set of variables are where nonzero
How can we select ?
In general, we can approximate risk by approx. leave-one-out cross-validation and select lambda that minimizes risk.
LASSO
- Find and for each .
- Compute for each using LOOCV.
- Choose to minimize estimated risk
- Let
- Linear regression on chosen variables
Algorithm for the LASSO: derived in steps
- One dimension, one data point
However, we can’t differentiate the norm at . We have to use a sub-differential, sub-gradient hyperplanes that don’t intersect? So sub-differential of the absolute value is anything slope. So we write the derivative
If
If , then . If , then so .
- Now, with , the derivative becomes
This is identical to earlier, but substituting in and , so we get
- Now, with many data points
The derivative is
We get
- When we have variables there is no closed-form solution, so we use coordinate descent. We start by choosing one variable, and freeze all the rest and take the closed form solution. Then we move on and iterate until convergence.
- Then, use least squares on the selected subset.
Nonparametric Regression
Assume only that , where is a smooth function of .
The most popular methods are kernel methods
- Smoothing kernels
- Penalization (Mercer) kernels
Smoothing kernels implements a sort of averaging among data points.