kernels and neural networks

This is an excerpt from some notes I took for Yale’s S&DS 365 Class (Intermediate Machine Learning)

Mercer kernels

Definition. A Mercer kernel has the property that the matrix $K = [K (x_{i}, x_{j})]_{n \times n}$ over a set of points $x_{1} \dots x_{n}$ is positive semi-definite. Example. The gaussian kernel is Mercer

Remark. Instead of using local smoothing, we can optimize the fit to the data subject to a roughness penalty (i.e., adding regularization). We want to find, from a class of functions $H$

\overset{m}{^} = ar g \overset{m}{^} \in H min (i \sum (Y_{i} - \overset{m}{^} (X_{i})))^{2} + λ penalty (\overset{m}{^})

Remark. We can create a set of basis functions based on $K$ . Fix a point $z \in R^{p}$ and define $K_{z} (x) = K (z, x)$ . Then, drawing $z_{i}$ from the space of possible data $R^{p}$ , we can define the Reproducing Kernel Hilbert Space:

H_{0} = {f ∣ f = i = 1 \sum k α K_{z_{i}} (\cdot), α_{i} \in R, z_{i} \in R^{p}}

Definition. Given two different functions $f, g \in H$ , we define the inner product as

⟨ f, g ⟩_{K} = i \sum j \sum α_{i} β_{i} K (x_{i}, x_{j}) = α^{⊤} K β

and the norm as

∥ f ∥_{K}^{2} = ⟨ f, f ⟩ = α^{⊤} K α

This norm allows us to penalize functions for being too complex.

Theorem. Representer Theorem. Let $\overset{m}{^}$ minimize $\sum_{i = 1}^{n} (Y_{i} - m (X_{i}))^{2} + λ ∥ m ∥_{K}^{2}$ . Then, $\overset{m}{^} (x) = \sum_{i = 1}^{n} α_{i} K (X_{i}, x)$ . Remark. This allows us to optimize over only $α$ and plug in the above formulation of $\overset{m}{^}$ to yield $J (α) = ∥ Y - K α ∥^{2} + λ α^{⊤} K α$ , and now we can find $α$ to minimize $J .$ Since this is linear and convex, we can find the closed form solution $\overset{α}{^} = (K + λ I)^{- 1} Y$ . Remark. Here, again, $λ$ creates a bias-variance trade-off. Remark. Alternatively, we can solve the optimization problem using gradient descent. The update to $α$ is

α ⟵ α + η (K (y - K α) - λ K α)

where $η$ is a step size hyperparameter.

Remark. If $x \to ϕ (x) \in R^{d}$ (where $d ≫ p$ ) is a feature mapping, we can define a Mercer kernel by $K (x, x^{'}) = ϕ (x)^{⊤} ϕ (x^{'})$ . Conversely, from any Mercer kernel, we can derive the corresponding feature map (from the spectral theorem).

Neural Networks

Remark. MLPs. Yay! Yippee!! We can define the parametric classification model (one hidden layer) where

P (y ∣ x) = softmax (W_{2}^{⊤} ϕ (W_{1} x))

where $ϕ$ is a component-wise nonlinear function and $W_{1}, W_{2}$ are weight matrices. Remark. A neural network is NOTHING MORE than a parametric linear model with a non-linearity.

Backpropagation

Remark. Yay!!! $c :$