This is an excerpt from some notes I took for Yale’s S&DS 365 Class (Intermediate Machine Learning)

Mercer kernels

Definition. A Mercer kernel has the property that the matrix over a set of points is positive semi-definite. Example. The gaussian kernel is Mercer


Remark. Instead of using local smoothing, we can optimize the fit to the data subject to a roughness penalty (i.e., adding regularization). We want to find, from a class of functions


Remark. We can create a set of basis functions based on . Fix a point and define . Then, drawing from the space of possible data , we can define the Reproducing Kernel Hilbert Space:

Definition. Given two different functions , we define the inner product as

and the norm as

This norm allows us to penalize functions for being too complex.


Theorem. Representer Theorem. Let minimize . Then, . Remark. This allows us to optimize over only and plug in the above formulation of to yield , and now we can find to minimize Since this is linear and convex, we can find the closed form solution . Remark. Here, again, creates a bias-variance trade-off. Remark. Alternatively, we can solve the optimization problem using gradient descent. The update to is

where is a step size hyperparameter.


Remark. If (where ) is a feature mapping, we can define a Mercer kernel by . Conversely, from any Mercer kernel, we can derive the corresponding feature map (from the spectral theorem).


Neural Networks

Remark. MLPs. Yay! Yippee!! We can define the parametric classification model (one hidden layer) where

where is a component-wise nonlinear function and are weight matrices. Remark. A neural network is NOTHING MORE than a parametric linear model with a non-linearity.


Backpropagation

Remark. Yay!!!