CPSC 340 Regularization

L2-Regularization

Standard regularization strategy is L2-regularization. Also called ridge regression

$f(w) = \frac{1}{2}

Xw-y

^2 + \frac{\lambda}{2}

^2$

Intuition: Large slope w tends to lead to overfitting. Large $\lambda$ puts large penalty on overfitting.

L2-Regularization Normal Equations

L2-Regularization Gradient Descent

$f(w) = \frac{1}{2}||Xw-y||^2 + \frac{\lambda}{2}||w||^2$

$f’(w) = X^T (Xw-y) + \lambda w$

$w^{t+1} = w^t - \alpha f’(w)$

Standarizating Features

Features may have different scales.

it doesn’t matter for decsion tree or naive bayes because they look only one feature at a time.
doesn’t matter for least squares
matter for KNN - distance will be controlled by large scale features
matter for regularized least squares - Penalizing ($w_j^2$) means different things if features ‘j’ are on different scales.

Steps:

For each feature:

compute mean and standard variation
substract mean and divide by standard variation

Standarizating Targets

In regression, we sometimes need to standandize y.

Gaussian RBF: A sum of bumps

What is radial basis functions (RBFs)?

A set of non-parametric bases that depend on distances to training points. Replace $x_i = (x_{i1}, x_{i2}, …, x_{id})$ with $z_i = (g(

x_i-x_1

, g(

x_i-x_2

), …, g(

x_i-x_n

))$

Gaussian RBF: $g(\epsilon) = exp(-\frac{\epsilon ^ 2}{2\theta ^ 2})$

How to avoid using huge number of features?

we regularize $w$ and use validation error to choose $\sigma$ and $\lambda$

L1-Regularization (Lasso)

Simultaneously regularizes and select features

$f(w) = \frac{1}{2}

Xw-y

^2 + \lambda

_1$

Regularization and Sparsity

L2 doesn’t give sparsity but L1 gives.

L1 Loss Vs. L1 Regularization

L1 loss is robust to outlier datapoints. L1 regularization is robust to irrelevant features.

Combine L1 loss and L1 regularization $f(w) = ||Xw-y|| + \lambda||w||_1$