This is an excerpt from some notes I took for Yale’s S&DS 365 Class (Intermediate Machine Learning)

For low dimensional prediction, we can use least squares. For high dimensional linear regression, there is a bias-variance tradeoff because no closed solution exists: omitting too many variables leads to high bias, and selecting too many leads to high variance. To mitigate this, we need to select a good subset of variables. The lasso is a fast way to select variables.


Sparse linear regression

  • Ridge regression doesn’t take advantage of sparsity. Maybe only a small number of covariates are good predictors. How do we find them?
  • Lasso:
  • L1 measures sparsity and keeps the optimization convex.

Definition. For a particular , we call the lasso estimator.

The selected set of variables are where nonzero

How can we select ?

In general, we can approximate risk by approx. leave-one-out cross-validation and select lambda that minimizes risk.

LASSO

  1. Find and for each .
  2. Compute for each using LOOCV.
  3. Choose to minimize estimated risk
  4. Let
  5. Linear regression on chosen variables

Algorithm for the LASSO: derived in steps

  • One dimension, one data point

However, we can’t differentiate the norm at . We have to use a sub-differential, sub-gradient hyperplanes that don’t intersect? So sub-differential of the absolute value is anything slope. So we write the derivative

If

If , then . If , then so .

  • Now, with , the derivative becomes

This is identical to earlier, but substituting in and , so we get

  • Now, with many data points

The derivative is

We get

  • When we have variables there is no closed-form solution, so we use coordinate descent. We start by choosing one variable, and freeze all the rest and take the closed form solution. Then we move on and iterate until convergence.
  • Then, use least squares on the selected subset.

Nonparametric Regression

Assume only that , where is a smooth function of .

The most popular methods are kernel methods

  1. Smoothing kernels
  2. Penalization (Mercer) kernels

Smoothing kernels implements a sort of averaging among data points.