20

Apr

Introduction to K-means in R

Quoting Wikipedia:

“k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into `k`

clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Description: Given a set of observations `(x1, x2,..., xn)`

, where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into `k(<= n)`

sets `S = {S1, S2,..., Sk}`

so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

where `μi`

is the mean of points in `Si`

.”

This section will show the very basics of the K-means function in R using an artificial dataset. The purpose of this post is to teach the basics and the calls of the functions in R

**Setting an artificial dataset**

First, we are going to simulate a database using the `rnorm`

function to simulate data entries:

Population `x`

of length `100`

with a mean of `1`

and a standard deviation of `0.5`

Population `y`

of length `100`

with a mean of `5`

and a standard deviation of `1.0`

Population `z`

of length `100`

with a mean of `4`

and a standard deviation of `0.4`

**Plot the distributions (omitting population source)
**

**Plot the distributions (considering population source)**

We can note that the most dispersed population is `y`

and overlaps population `z`

by having similar mean values.

**Plot the dataset as a scatter-plot
**

Again, we can note that the most dispersed population is `y`

and overlaps population `z`

by having similar mean values.

**Plot K-means results when k=2
**

**Plot K-means results when k=3**

**Plot K-means results when k=4**

By looking at the last three figures, how can we say is the best clustering solution to our problem?. We have `k=2`

, `k=3`

and `k=4`

. But What parameter is best?. For this, we can compute a “calibration plot” by comparing the error value (“within groups sum of squares”) and `k`

values.

**Plot calibration curve
**

Here we can see that the largest error values comes from `K<3`

, suggesting underfitting. As we know that we have three different populations, we can manually (and naturally) pick parameter `k=3`

supporting our decision by comparing the error values at `k=3`

and `k=4`

, it does not seem to be an “interesting gaining by minimizing error”. Therefore we can suggest that `k>=4`

would probably represent overfitting of our dataset.

Code is available clicking here

Tags: Computer Science, Machine Learning, R