Data Scientist - Benjamin Tovar

Introduction to K-means in R

20

Apr

 

Introduction to K-means in R

Quoting Wikipedia:

“k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Description: Given a set of observations (x1, x2,..., xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k(<= n) sets S = {S1, S2,..., Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

k-means-formula

where μi is the mean of points in Si.”

 

Methods

This section will show the very basics of the K-means function in R using an artificial dataset. The purpose of this post is to teach the basics and the calls of the functions in R

Setting an artificial dataset

First, we are going to simulate a database using the rnorm function to simulate data entries:

Population x of length 100 with a mean of 1 and a standard deviation of 0.5
Population y of length 100 with a mean of 5 and a standard deviation of 1.0
Population z of length 100 with a mean of 4 and a standard deviation of 0.4

Results

Plot the distributions (omitting population source)

k_means_intro1

Plot the distributions (considering population source)

k_means_intro2We can note that the most dispersed population is y and  overlaps population z by having similar mean values.

Plot the dataset as a scatter-plot

k_means_intro3Again, we can note that the most dispersed population is y and  overlaps population z by having similar mean values.

Plot K-means results when k=2

k_means_intro4

Plot K-means results when k=3

k_means_intro5

Plot K-means results when k=4

k_means_intro6

By looking at the last three figures, how can we say is the best clustering solution to our problem?. We have k=2, k=3 and k=4. But What parameter is best?. For this, we can compute a “calibration plot” by comparing the error value (“within groups sum of squares”) and k values.

Plot calibration curve

k_means_intro7Here we can see that the largest error values comes from K<3, suggesting underfitting. As we know that we have three different populations, we can manually (and naturally) pick parameter k=3  supporting our decision by comparing the error values at k=3 and k=4, it does not seem to be an “interesting gaining by minimizing error”. Therefore we can suggest that k>=4 would probably represent overfitting of our dataset.

Code

Code is available clicking here

 

twittergoogle_plusredditlinkedin

Tags: , ,


Leave a comment
 

Your email address will not be published. Required fields are marked. *