Introduction to text mining in R
I was checking some Machine Learning challenges at Hackerrank and found a particular challenge which consist on document classification. The source is over here. I downloaded the dataset and decided to make my own text mining analysis instead. The dataset consist of 5485 documents distributed among 8 different classes, perfect to compute
wordclouds and cool
First of all, I decided to plot the frequency of each document class in the dataset. We observe that
class_2 are over-represented with more than 1500 samples while the remaining classes are under-represented with less than 500 samples each.
Now, the “meat” of the plate. Plot the first 500 specific words per class in a comparison cloud (see wordcloud package documentation).
Cool right?, now plot the first 2000 specific words per class in a comparison cloud . By looking at the cloud, we can make some assumptions because the challenge did not include a defined description of the input data (for example, what is the meaning of each class?). First, we can suggest that
class_1 is related to money topics because it highlights words such as
profit, share, sales and so on. Another remarkable assumption is that
class_5 includes documents related to agricultural topics by highlighting words such as
grain, farmers, USDA and so on.
To change a little bit the direction of the analysis, lets plot a commonality cloud, which is a cloud of words shared across documents (see wordcloud package documentation).
Finally, I decided to adjust each word frequency per document to remove the bias related to sample size (
class_1 has 2840 samples while
class_5 has 41 samples). The adjustment is performed by each column (a column represents a class while rows represents words) divided by its maximum value found in that column (download code and data, link at the bottom of the post). Words with values equal to
1.0 indicates the most frequent word per class and the remaining words values are in function of that top word.
Code, full HP plots in PDF format (because word clouds are cool) and datasets available at Github