CPSC340 Clustering

clustering

K-Means

The most popular clustering method

Input:

Algorithm:

Property	KNN	K-Means
Task	Supervised Learning	Unsupervised Learning
Meaning of “K”	Number of neighbours	Numer of clusters
Intialization	No training phase	Training is sensative to initilialzation
Model Complexity	More complicated for small K	Simpler for small K
Parametric?	Non-Parametric	Parametric - K depends on n

We need to calculate $n$ examples to $k$ clusters, each time costs $d$ time. So the total complexity for calculating distance is $O(ndk)$

Give a set of data, run K-Means, use the means to replace the data belongs to the specific cluster.

Two hyperparams:

For each example $x_i$:

If $x_i$ is already in a cluster, do nothing
Else test whether $x_i$ is a core point ( >= minNeighbour neighours within $\epsilon$)
- if not, do nothing
- else, make a new cluster and call expand cluster function

some points are not assigned
sensitvie to the choice of $\epsilon$ and minNeighbours (but not sensivtive to initialization)
finding cluster for a new point is expensive (need to compute distances to all core points)
in high dimensions, need a lot of points to fill the space

Sometimes we have different densities:

we can use hierarchial clustering to produce a tree of clusterings.

Cost: $O(n^3d)$ - each step costs $n^2d$, might only merge 1 new point every step