This is the summary for CPSC340 courses. Mainly there are 5 parts in this course.
- supervised learning - based on countings and distances
- unsupervised learning - based on countings and distances
- supervised learning - based on linear models
- unsupervised learning - based on linear models
- neural networks
Part 1 Supervised Learning
- Cross Validation - L5
- KNN
- Naive Bayes
- Laplace Smoothing
Part 2 Unsupervised Learning
Part 3 Supervised Learning based on Linear Models
Linear regression (L11, L12)
- least square
- partial derivative to find the best w vector
- can be solved by normal equation
- cost: O(d^3 + nd^2)
- least square have some issues
Non-linear regression (L12)
- change of basis: polynomial (large p might result in overfitting)
- Laplace Smoothing
- Gradient desecent (L13)
- Normal equation is slow when d is large
- cost for t iterations: O(ndt)
- how to determine a function is convex
Different loss functions(L14) - Robust regression, Brittle regression
- Robust regression: focus less on large errors(outliers)
-
L1-norm **f(w) = XW-y ** - Smooth Approximation: Huber Loss
- Non-convex loss can be very robust to outliers, but hard to find global minimum, so L1-norm is the most robust convex loss function
-
- Brittle regression: really care about getting outliers right
- minimze the worst error
- very sensitive to outliers
-
$f(w) = XW-y _{\infin}$ -
$ r _{\infin} = max{ r_i }$ -
Smooth appro: $max{ Z_i } \approx \log(\sum_{i}\exp(Z_i))$
- Complexity Penalties
-
information criteria: $score(p) = \frac{1}{2} ZV-y ^2 + \lambda k$ - k is the number of estimated parameters, for polynomial k = p + 1
- $\lambda = 1$ is called AIC
- penalize for polynomial p: $score(p) = \frac{1}{2}||ZV-y||^2 + (p+1)\lambda$
Feature selection (L15)
-
- Select features that are relevant
- Approach 1: association - caculate weights for different features
- Approach 2: regression weight - calculate weights with all features, compare $w_j$ with a threshold
- Approach 3: search and score
- define score function
- cannot be training error - we all end with all features
- validation error -
- problem #1: too many choices
- forward selection: start from empty set S {}, for each possible $w_j$, compute score for $S \cup w_j$; find the best score, if it improves S, then add it to S.
- problem #2: might have false positives
- add complexity penalty to avoid false positives
- L0-Norm: number of non-zero values - $f(w) = \frac{1}{2}||XW-y||^2 + \lambda ||W||_0$
L16,17 Regularization, Standarization
- problem #1: too many choices
- define score function
- L2-regularization to avoid overfitting
- L2-regu + least square = ridge regression
-
$f(w) = \frac{1}{2} XW-y ^2 + \frac{\lambda}{2} W ^2$ - solved by normal equation (L16, p56) or gradient descent (L16,p58 cost is still O(ndt))
- advantage - L16, p68 bonus slides
- L1-regularization
- disadvantages for the “search and score” method:
- L0-norm is hard to minimize
- still O(d^2) optional choices
- we can use L1-regularization, which combine feature selection and regularization
-
$f(w) = \frac{1}{2} XW-y ^2 + \lambda W _1$ - also called “LASSO” regularization
- faster than “search and score”
-
- compare with L2-reg:
- answer is not unique
- needs to be solved by iteration
- disadvantages for the “search and score” method:
- Sparsity
- L1 tends to set w = 0
- Loss Vs. regularization
- L1-loss function is robust to(cares less) outliers
- L1-regularization is robust to(cares less) irrelevant features
- different regularizations
- L0-regularization: for selecting features
- L2-regularization: avoid overfitting
- L1-regularization: both
- L17, p31
- Standarization
- not required: decision tree, naive Bayes, least squares
- required: KNN, regularized least squares
- $\mu_{j} = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$
- $\theta_{j} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(X_{ij} - \mu_{j})^2}$
- $X_{ij} = \frac{X_{ij} - \mu_{j}}{\theta_{j}}$
(RBF)Radial Basis Functions - L17
- besides polynomial bases, we can use other bases
- Non-parametric bases
- size of basis grows with ‘n’
- can model complicated functions
- Gaussian RBF (L17, p7)
- a sum of bumps
Regression Classifier - L18,19
- a sum of bumps
- Loss functions
- we shouldn’t penalize “too right” prediction
- Binary class
- hinge loss
- $f(w) = \sum_{i=1}^n\max{0, 1-y_iw^Tx_i}$
- add L2-reg = SVM
- logistic loss
- $f(w) = \sum_{i=1}^n\log(1+\exp(-y_iw^Tx_i))$
- probabilistic classifier
- map $w^Tx_i$ to [0, 1]
- binary class - sigmoid function
- hinge loss
- Multi-classes
- one-vs-all classification
- training: for each class “c”, train a binary classifier to predict whether example is “c”
- prediction: get a score for each class, predict c with the highest score
- Mutli-class hinge loss
- $\sum_{c\neq y_i} \max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i}$ or
- $\max {(\max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i})}$
- add L2-reg = multi-class SVM
- Multi-class softmax loss ( L20, p5)
- Multi-class Logistic regression: $f(w) = \sum_{i=1}^{n}[-w_{y_i}^Tx_i + \log (\sum_{c=1}^k \exp(w_c^Tx_i))] + \frac{\lambda}{2} \sum_{c=1}^k \sum_{j=1}^d w_{cj}^2$
- probability: sofmax function
- one-vs-all classification
L20 Feature engineering
L21 Kernel Trick
- For linear regression, what we do:
- from X, form Z
- compute $V = (Z^TZ+\lambda I)^{-1}(Z^Ty)$
- For kernel regression, what we do:
- form inner product K from X
- compute $U = (K+\lambda I)^{-1}y$
- Polynomial kernels
- Guassian RBF kernels
L22 Stochastic Gradient
L23,24 Boosting, MLE, MAP
likelihood prior
Part 4 Unsupervised Learning based on Linear Models (Latent-Factor Models)
laten-factor model uses:
- L25, 26 PCA
- Given X, k, output 2 matrices: Z(n*k), W(k*d)
- each row of W - $w_c$: the factor/PC
- each row of Z - $z_i$: weights / features/ new coordinates
-
$f(W,Z) = \frac{1}{2} ZW-X _F^2$ - REMEMBER to center the data
- variance remaining
- non-uniqueness of PCA
- L27 Sparse Matrix, Robust PCA
- NMF leads to sparsity
- Robust PCA - use abosolute error instead of squares
- Regularized PCA
-
$L_2$ reg PCA: $f(W,Z) = \frac{1}{2} ZW-X _F^2 + \frac{\lambda_1}{2} Z _F^2 + \frac{\lambda_2}{2} W _F^2$ - $L_1$ reg PCA: use L1-norm
- also leads to sparsity
- advantange and disadvantage over NMF
-
- L28 Collborative filtering
Loss Vs. Regularization
- L0-norm: number of non-zero values
- L1-norm: sum of absolute values
- L2-norm: squared sum of values
- $L_{\infin}$ norm: largest value
Regularizations
- L0-reg: feature selection
- L1-reg(LASSO): regularization and feature selection
- L2-reg: avoid overfitting; make least squares solution unique
- $L_{\infin}$ reg: make variables have the same magnitude
Losses
- least squares loss
- Gaussian likelihood
- L1-loss (absolute loss)
- robust regression
- robust to outliers
- huber loss (smooth approximation)
- $L_{\infty}$ loss
- max of aboslute loss
- brittle regression
- care outliers
- 0-1 loss
- correct classification
- hinge loss
- upper bound on 0-1 loss
- logistic loss
- smooth function
- multi-class hinge loss
- hinge loss for multi class
- softmax loss
- logistic loss for multi class
- train all $w_c$ simultaneously for multi-classification