We need to find the features that are important for predicting y.
- Association approach - for each feature j, compute correlation between feature $x^j$ and $y$ (however it ignores variable interactions)
- Regression Weight approach - fit ‘w’ based on all features, take features where $w_j$ is large
- has major problem with collinearity: two copies of irrelevant feature or the relevant feature have the same value (L15. p.13)
- Search and Score
Search and Score
- Define score function f(S)
- Search for the variables ‘S’ with the best score
Score Function
The score shouldn’t be training error - train error goes down as you add features
Validation error? Yes!
“Number of Features” Penalties
$score(S) = \frac{1}{2}\sum_{i=1}^n(w_x^Tx_{is} - y_i)^2 + size(s)$
We can use L0-Norm to replace size(s) $score(S) = \frac{1}{2}\sum_{i=1}^n(w_x^Tx_{is} - y_i)^2 + \lambda|w|_0$
How to handle too many choices for $S$ - Forward Selection
- Start with empty set {}
- Compute score for each feature
- Add from the best score
- Combine every other feature with the best feature to find the best
- Check if the combination improves the score
- If yes, go back to step 2, else stop.