Data Scientist - Benjamin Tovar

Introduction to text mining in R




Introduction to text mining in R

I was checking some Machine Learning challenges at Hackerrank and found a particular challenge which consist on document classification. The source is over here. I downloaded the dataset and decided to make my own text mining analysis instead. The dataset consist of 5485 documents distributed among 8 different classes, perfect to compute wordclouds and cool ggplots in R.


First of all, I decided to plot the frequency of each document class in the dataset. We observe that class_1 and class_2 are over-represented with more than 1500 samples while the remaining classes are under-represented with less than 500 samples each.

text_mining_intro_1Now, the “meat” of the plate. Plot the first 500 specific words per class in a comparison cloud (see wordcloud package documentation).

text_mining_intro_2Cool right?, now plot the first 2000 specific words per class in a comparison cloud . By looking at the cloud, we can make some assumptions because the challenge did not include a defined description of the input data (for example, what is the meaning of each class?).  First, we can suggest that class_1 is related to money topics because it highlights words such as profit, share, sales and so on. Another remarkable assumption is that class_5 includes documents related to agricultural topics by highlighting words such as grain, farmers, USDA and so on.

text_mining_intro_3To change a little bit the direction of the analysis, lets plot a commonality cloud, which is a cloud of words shared across documents (see wordcloud package documentation).

text_mining_intro_4Finally, I decided to adjust each word frequency per document to remove the bias related to sample size (class_1 has 2840 samples while class_5 has 41 samples). The adjustment is performed by each column (a column represents a class while rows represents words) divided by its maximum value found in that column (download code and data, link at the bottom of the post). Words with values equal to 1.0 indicates the most frequent word per class and the remaining words values are in function of that top word.


Code and dataset

Code, full HP plots in PDF format (because word clouds are cool) and datasets available at Github


Tags: , ,

One Comment

  • Brandon Shelton

    27. May, 2015

    Very cool. I’m still learning all the powers of R, so I’m excited to follow your posts here!

Leave a comment

Your email address will not be published. Required fields are marked. *