CPSC 340 Course Summary

Posted by Luna on April 19, 2022

This is the summary for CPSC340 courses. Mainly there are 5 parts in this course.

  • supervised learning - based on countings and distances
  • unsupervised learning - based on countings and distances
  • supervised learning - based on linear models
  • unsupervised learning - based on linear models
  • neural networks

Part 1 Supervised Learning

  • Cross Validation - L5
  • KNN
  • Naive Bayes
    • Laplace Smoothing

      Part 2 Unsupervised Learning

      Part 3 Supervised Learning based on Linear Models

      Linear regression (L11, L12)

    • least square
    • partial derivative to find the best w vector
    • can be solved by normal equation
      • cost: O(d^3 + nd^2)
    • least square have some issues

      Non-linear regression (L12)

    • change of basis: polynomial (large p might result in overfitting)
  • Gradient desecent (L13)
    • Normal equation is slow when d is large
    • cost for t iterations: O(ndt)
    • how to determine a function is convex

      Different loss functions(L14) - Robust regression, Brittle regression

    • Robust regression: focus less on large errors(outliers)
      • L1-norm **f(w) =   XW-y   **
      • Smooth Approximation: Huber Loss
      • Non-convex loss can be very robust to outliers, but hard to find global minimum, so L1-norm is the most robust convex loss function
    • Brittle regression: really care about getting outliers right
      • minimze the worst error
      • very sensitive to outliers
      • $f(w) =   XW-y   _{\infin}$
      • $   r   _{\infin} = max{ r_i }$
      • Smooth appro: $max{ Z_i } \approx \log(\sum_{i}\exp(Z_i))$
    • Complexity Penalties
      • information criteria: $score(p) = \frac{1}{2}   ZV-y   ^2 + \lambda k$
        • k is the number of estimated parameters, for polynomial k = p + 1
        • $\lambda = 1$ is called AIC
      • penalize for polynomial p: $score(p) = \frac{1}{2}||ZV-y||^2 + (p+1)\lambda$

        Feature selection (L15)

    • Select features that are relevant
    • Approach 1: association - caculate weights for different features
    • Approach 2: regression weight - calculate weights with all features, compare $w_j$ with a threshold
    • Approach 3: search and score
      • define score function
        • cannot be training error - we all end with all features
        • validation error -
          • problem #1: too many choices
            • forward selection: start from empty set S {}, for each possible $w_j$, compute score for $S \cup w_j$; find the best score, if it improves S, then add it to S.
          • problem #2: might have false positives
            • add complexity penalty to avoid false positives
            • L0-Norm: number of non-zero values - $f(w) = \frac{1}{2}||XW-y||^2 + \lambda ||W||_0$

              L16,17 Regularization, Standarization

    • L2-regularization to avoid overfitting
      • L2-regu + least square = ridge regression
      • $f(w) = \frac{1}{2}   XW-y   ^2 + \frac{\lambda}{2}   W   ^2$
      • solved by normal equation (L16, p56) or gradient descent (L16,p58 cost is still O(ndt))
      • advantage - L16, p68 bonus slides
    • L1-regularization
      • disadvantages for the “search and score” method:
        • L0-norm is hard to minimize
        • still O(d^2) optional choices
      • we can use L1-regularization, which combine feature selection and regularization
        • $f(w) = \frac{1}{2}   XW-y   ^2 + \lambda   W   _1$
        • also called “LASSO” regularization
        • faster than “search and score”
      • compare with L2-reg:
        • answer is not unique
        • needs to be solved by iteration
    • Sparsity
      • L1 tends to set w = 0
    • Loss Vs. regularization
      • L1-loss function is robust to(cares less) outliers
      • L1-regularization is robust to(cares less) irrelevant features
    • different regularizations
      • L0-regularization: for selecting features
      • L2-regularization: avoid overfitting
      • L1-regularization: both
      • L17, p31
    • Standarization
      • not required: decision tree, naive Bayes, least squares
      • required: KNN, regularized least squares
      • $\mu_{j} = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$
      • $\theta_{j} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(X_{ij} - \mu_{j})^2}$
      • $X_{ij} = \frac{X_{ij} - \mu_{j}}{\theta_{j}}$

        (RBF)Radial Basis Functions - L17

    • besides polynomial bases, we can use other bases
    • Non-parametric bases
      • size of basis grows with ‘n’
      • can model complicated functions
    • Gaussian RBF (L17, p7)
      • a sum of bumps

        Regression Classifier - L18,19

  • Loss functions
    • we shouldn’t penalize “too right” prediction
    • Binary class
      • hinge loss
        • $f(w) = \sum_{i=1}^n\max{0, 1-y_iw^Tx_i}$
        • add L2-reg = SVM
      • logistic loss
        • $f(w) = \sum_{i=1}^n\log(1+\exp(-y_iw^Tx_i))$
      • probabilistic classifier
        • map $w^Tx_i$ to [0, 1]
        • binary class - sigmoid function
    • Multi-classes
      • one-vs-all classification
        • training: for each class “c”, train a binary classifier to predict whether example is “c”
        • prediction: get a score for each class, predict c with the highest score
      • Mutli-class hinge loss
        • $\sum_{c\neq y_i} \max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i}$ or
        • $\max {(\max{0, 1- w_{y_i}^Tx_i + w_c^Tx_i})}$
        • add L2-reg = multi-class SVM
      • Multi-class softmax loss ( L20, p5)
        • Multi-class Logistic regression: $f(w) = \sum_{i=1}^{n}[-w_{y_i}^Tx_i + \log (\sum_{c=1}^k \exp(w_c^Tx_i))] + \frac{\lambda}{2} \sum_{c=1}^k \sum_{j=1}^d w_{cj}^2$
      • probability: sofmax function

L20 Feature engineering

L21 Kernel Trick

  • For linear regression, what we do:
    • from X, form Z
    • compute $V = (Z^TZ+\lambda I)^{-1}(Z^Ty)$
  • For kernel regression, what we do:
    • form inner product K from X
    • compute $U = (K+\lambda I)^{-1}y$
  • Polynomial kernels
  • Guassian RBF kernels

    L22 Stochastic Gradient

L23,24 Boosting, MLE, MAP

likelihood prior

Part 4 Unsupervised Learning based on Linear Models (Latent-Factor Models)

laten-factor model uses:

  • L25, 26 PCA
    • Given X, k, output 2 matrices: Z(n*k), W(k*d)
    • each row of W - $w_c$: the factor/PC
    • each row of Z - $z_i$: weights / features/ new coordinates
    • $f(W,Z) = \frac{1}{2}   ZW-X   _F^2$
    • REMEMBER to center the data
    • variance remaining
    • non-uniqueness of PCA
  • L27 Sparse Matrix, Robust PCA
    • NMF leads to sparsity
    • Robust PCA - use abosolute error instead of squares
    • Regularized PCA
      • $L_2$ reg PCA: $f(W,Z) = \frac{1}{2}   ZW-X   _F^2 + \frac{\lambda_1}{2}   Z   _F^2 + \frac{\lambda_2}{2}   W   _F^2$
      • $L_1$ reg PCA: use L1-norm
      • also leads to sparsity
      • advantange and disadvantage over NMF
  • L28 Collborative filtering

Loss Vs. Regularization

  • L0-norm: number of non-zero values
  • L1-norm: sum of absolute values
  • L2-norm: squared sum of values
  • $L_{\infin}$ norm: largest value

Regularizations

  • L0-reg: feature selection
  • L1-reg(LASSO): regularization and feature selection
  • L2-reg: avoid overfitting; make least squares solution unique
  • $L_{\infin}$ reg: make variables have the same magnitude

    Losses

  • least squares loss
    • Gaussian likelihood
  • L1-loss (absolute loss)
    • robust regression
    • robust to outliers
    • huber loss (smooth approximation)
  • $L_{\infty}$ loss
    • max of aboslute loss
    • brittle regression
    • care outliers
  • 0-1 loss
    • correct classification
  • hinge loss
    • upper bound on 0-1 loss
  • logistic loss
    • smooth function
  • multi-class hinge loss
    • hinge loss for multi class
  • softmax loss
    • logistic loss for multi class
    • train all $w_c$ simultaneously for multi-classification