CPSC 340 Course Summary

This is the summary for CPSC340 courses. Mainly there are 5 parts in this course.

supervised learning - based on countings and distances
unsupervised learning - based on countings and distances
supervised learning - based on linear models
unsupervised learning - based on linear models
neural networks

Part 1 Supervised Learning

Cross Validation - L5
KNN
Naive Bayes
- Laplace Smoothing
  Part 2 Unsupervised Learning
  
  Part 3 Supervised Learning based on Linear Models
  
  Linear regression (L11, L12)
- least square
- partial derivative to find the best w vector
- can be solved by normal equation
  - cost: O(d^3 + nd^2)
- least square have some issues
  Non-linear regression (L12)
- change of basis: polynomial (large p might result in overfitting)

Gradient desecent (L13)

Normal equation is slow when d is large
cost for t iterations: O(ndt)
how to determine a function is convex
Different loss functions(L14) - Robust regression, Brittle regression
Robust regression: focus less on large errors(outliers)
- L1-norm **f(w) = XW-y **
- Smooth Approximation: Huber Loss
- Non-convex loss can be very robust to outliers, but hard to find global minimum, so L1-norm is the most robust convex loss function

Brittle regression: really care about getting outliers right

minimze the worst error
very sensitive to outliers
$f(w) = XW-y _{\infin}$

_{\infin} = max{

r_i

Smooth appro: $max{ Z_i } \approx \log(\sum_{i}\exp(Z_i))$

Complexity Penalties

information criteria: $score(p) = \frac{1}{2}

ZV-y

^2 + \lambda k$

k is the number of estimated parameters, for polynomial k = p + 1
$\lambda = 1$ is called AIC

penalize for polynomial p: $score(p) = \frac{1}{2}||ZV-y||^2 + (p+1)\lambda$
Feature selection (L15)

Select features that are relevant
Approach 1: association - caculate weights for different features
Approach 2: regression weight - calculate weights with all features, compare $w_j$ with a threshold
Approach 3: search and score
- define score function
  - cannot be training error - we all end with all features
  - validation error -
    - problem #1: too many choices
      - forward selection: start from empty set S {}, for each possible $w_j$, compute score for $S \cup w_j$; find the best score, if it improves S, then add it to S.
    - problem #2: might have false positives
      - add complexity penalty to avoid false positives
      - L0-Norm: number of non-zero values - $f(w) = \frac{1}{2}||XW-y||^2 + \lambda ||W||_0$
        L16,17 Regularization, Standarization

L2-regularization to avoid overfitting

L2-regu + least square = ridge regression

$f(w) = \frac{1}{2}

XW-y

^2 + \frac{\lambda}{2}

^2$

solved by normal equation (L16, p56) or gradient descent (L16,p58 cost is still O(ndt))
advantage - L16, p68 bonus slides

L1-regularization

disadvantages for the “search and score” method:
- L0-norm is hard to minimize
- still O(d^2) optional choices

we can use L1-regularization, which combine feature selection and regularization

$f(w) = \frac{1}{2}

XW-y

^2 + \lambda

_1$

also called “LASSO” regularization
faster than “search and score”

compare with L2-reg:
- answer is not unique
- needs to be solved by iteration

Sparsity
- L1 tends to set w = 0
Loss Vs. regularization
- L1-loss function is robust to(cares less) outliers
- L1-regularization is robust to(cares less) irrelevant features
different regularizations
- L0-regularization: for selecting features
- L2-regularization: avoid overfitting
- L1-regularization: both
- L17, p31
Standarization
- not required: decision tree, naive Bayes, least squares
- required: KNN, regularized least squares
- $\mu_{j} = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$
- $\theta_{j} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(X_{ij} - \mu_{j})^2}$
- $X_{ij} = \frac{X_{ij} - \mu_{j}}{\theta_{j}}$
  (RBF)Radial Basis Functions - L17
besides polynomial bases, we can use other bases
Non-parametric bases
- size of basis grows with ‘n’
- can model complicated functions
Gaussian RBF (L17, p7)
- a sum of bumps
  Regression Classifier - L18,19

Loss functions
- we shouldn’t penalize “too right” prediction
- Binary class
  - hinge loss
    - $f(w) = \sum_{i=1}^n\max{0, 1-y_iw^Tx_i}$
    - add L2-reg = SVM
  - logistic loss
    - $f(w) = \sum_{i=1}^n\log(1+\exp(-y_iw^Tx_i))$
  - probabilistic classifier
    - map $w^Tx_i$ to [0, 1]
    - binary class - sigmoid function
- Multi-classes
  - one-vs-all classification
    - training: for each class “c”, train a binary classifier to predict whether example is “c”
    - prediction: get a score for each class, predict c with the highest score
  - Mutli-class hinge loss
    - $\sum_{c\neq y_i} \max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i}$ or
    - $\max {(\max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i})}$
    - add L2-reg = multi-class SVM
  - Multi-class softmax loss ( L20, p5)
    - Multi-class Logistic regression: $f(w) = \sum_{i=1}^{n}[-w_{y_i}^Tx_i + \log (\sum_{c=1}^k \exp(w_c^Tx_i))] + \frac{\lambda}{2} \sum_{c=1}^k \sum_{j=1}^d w_{cj}^2$
  - probability: sofmax function

L20 Feature engineering

L21 Kernel Trick

For linear regression, what we do:
- from X, form Z
- compute $V = (Z^TZ+\lambda I)^{-1}(Z^Ty)$
For kernel regression, what we do:
- form inner product K from X
- compute $U = (K+\lambda I)^{-1}y$
Polynomial kernels
Guassian RBF kernels
L22 Stochastic Gradient

L23,24 Boosting, MLE, MAP

likelihood prior

Part 4 Unsupervised Learning based on Linear Models (Latent-Factor Models)

laten-factor model uses:

L25, 26 PCA
- Given X, k, output 2 matrices: Z(n*k), W(k*d)
- each row of W - $w_c$: the factor/PC
- each row of Z - $z_i$: weights / features/ new coordinates
- $f(W,Z) = \frac{1}{2} ZW-X _F^2$
- REMEMBER to center the data
- variance remaining
- non-uniqueness of PCA

L27 Sparse Matrix, Robust PCA

NMF leads to sparsity
Robust PCA - use abosolute error instead of squares

Regularized PCA

$L_2$ reg PCA: $f(W,Z) = \frac{1}{2}

ZW-X

_F^2 + \frac{\lambda_1}{2}

_F^2 + \frac{\lambda_2}{2}

_F^2$

$L_1$ reg PCA: use L1-norm
also leads to sparsity
advantange and disadvantage over NMF

L28 Collborative filtering

Loss Vs. Regularization

L0-norm: number of non-zero values
L1-norm: sum of absolute values
L2-norm: squared sum of values
$L_{\infin}$ norm: largest value

Regularizations

L0-reg: feature selection
L1-reg(LASSO): regularization and feature selection
L2-reg: avoid overfitting; make least squares solution unique
$L_{\infin}$ reg: make variables have the same magnitude
Losses
least squares loss
- Gaussian likelihood
L1-loss (absolute loss)
- robust regression
- robust to outliers
- huber loss (smooth approximation)
$L_{\infty}$ loss
- max of aboslute loss
- brittle regression
- care outliers
0-1 loss
- correct classification
hinge loss
- upper bound on 0-1 loss
logistic loss
- smooth function
multi-class hinge loss
- hinge loss for multi class
softmax loss
- logistic loss for multi class
- train all $w_c$ simultaneously for multi-classification

Part 1 Supervised Learning

Part 2 Unsupervised Learning

Part 3 Supervised Learning based on Linear Models

Linear regression (L11, L12)

Non-linear regression (L12)

Different loss functions(L14) - Robust regression, Brittle regression

Feature selection (L15)

L16,17 Regularization, Standarization

(RBF)Radial Basis Functions - L17

Regression Classifier - L18,19

L20 Feature engineering

L21 Kernel Trick

L22 Stochastic Gradient

L23,24 Boosting, MLE, MAP

Part 4 Unsupervised Learning based on Linear Models (Latent-Factor Models)

Loss Vs. Regularization

Regularizations

Losses

CATALOG

FEATURED TAGS