This is an excerpt from some notes I took for Yale’s S&DS 365 Class (Intermediate Machine Learning)

For low dimensional prediction, we can use least squares. For high dimensional linear regression, there is a bias-variance tradeoff because no closed solution exists: omitting too many variables leads to high bias, and selecting too many leads to high variance. To mitigate this, we need to select a good subset of variables. The lasso is a fast way to select variables.

Sparse linear regression

Ridge regression doesn’t take advantage of sparsity. Maybe only a small number of covariates are good predictors. How do we find them?
Lasso:

\hat{β} = argmin (\frac{1}{2 n} i = 1 \sum n (Y_{i} - β^{⊤} X_{i})^{2} + λ ∥ β ∥_{1})

L1 measures sparsity and keeps the optimization convex.

Definition. For a particular $λ$ , we call $\hat{β} (λ)$ the lasso estimator.

The selected set of variables are $j$ where $\hat{β}_{j}$ nonzero

How can we select $λ$ ?

In general, we can approximate risk by approx. leave-one-out cross-validation and select lambda that minimizes risk.

LASSO

Find $\hat{β} (λ)$ and $\hat{S} (λ)$ for each $λ$ .
Compute $\hat{R} (λ)$ for each $λ$ using LOOCV.
Choose $\hat{λ}$ to minimize estimated risk
Let $\hat{S} (\hat{λ})$
Linear regression on chosen variables

Algorithm for the LASSO: derived in steps

One dimension, one data point

f (β) = \frac{1}{2} (y - β)^{2} + λ ∣ β ∣

However, we can’t differentiate the norm at $0$ . We have to use a sub-differential, sub-gradient hyperplanes that don’t intersect? So sub-differential of the absolute value is anything $\in [- 1, 1]$ slope. So we write the derivative

β - y + λ v = 0

If $β > 0 ⟹ v = 1, β < 0 ⟹ v = - 1, β = 0 ⟹ v = [- 1, 1]$

β = y - λ v

If $∣ y ∣ \geq λ$ , then $v = sgn y$ . If $y \leq λ$ , then $v = \frac{y}{λ}$ so $β = 1$ .

\hat{β} = soft_{λ} (y) = sgn (y) \cdot max (∣ y ∣ - λ, 0)

Now, with $x \neq = 1$ , the derivative becomes

- y x + β x^{2} + λ v - \frac{y}{x} + β + \frac{λ}{x ^{2}} v = 0 = 0

This is identical to earlier, but substituting in $y = y / x$ and $λ = λ / x^{2}$ , so we get

\hat{β} = soft_{λ / x^{2}} (y / x) = sgn (\frac{y}{x}) \cdot max (\frac{y}{x} - \frac{λ}{x ^{2}}, 0)

Now, with many data points

f (β) = \frac{1}{2} \frac{1}{n} i = 1 \sum n (y_{i} - β x_{i})^{2} + λ ∣ β ∣

The derivative is

f (β) f (β) = \frac{1}{n} i = 1 \sum n (β x_{i}^{2} - y_{i} x_{i}) + λ v = 0 = - \frac{1}{n} i = 1 \sum n y_{i} x_{i} + \frac{1}{n} i = 1 \sum n (β x_{i}^{2}) + λ v = 0

We get

\hat{β} = soft_{λ / \frac{1}{n} \sum_{i} x_{i}^{2}} (\frac{\sum _{i} y _{i} x _{i}}{\sum _{i} x _{i}^{2}})

When we have $p$ variables there is no closed-form solution, so we use coordinate descent. We start by choosing one variable, and freeze all the rest and take the closed form solution. Then we move on and iterate until convergence.
Then, use least squares on the selected subset.

Nonparametric Regression

Assume only that $Y_{i} = m (X_{i}) + ϵ$ , where $m (x)$ is a smooth function of $x$ .

The most popular methods are kernel methods

Smoothing kernels
Penalization (Mercer) kernels

Smoothing kernels implements a sort of averaging among data points.

lasso, smoothing, and kernels

Sparse linear regression

Algorithm for the LASSO: derived in steps

Nonparametric Regression