Spam comment analysis in R
Imagine login into your blog and find out more than hundred spam messages, not cool!. I am not letting the spammers win, so I decided to crack some patterns and try to understand/learn something about these little annoying bots.
For this post, I am performing spam comment analysis in R of
162 comments posted in my blog from August 4th to October 17th . I wish there would be more, but looks like they got already deleted in the database. Additionally, I am including a dump in CSV and the code (bottom section).
First things first, characterize if there are features shared between more than two spam messages (this is,
ip_country_name appears 8 times, indicating that all 162 messages come from these 8 unique countries).
Compare the frequency of each country (each message has a country of origin) and the average number of hours between comments. What we observe is that the majority of messages comes from the US and then from France. Additionally, France has a mean ~8 hours between messages winning the first place of the fastest spammer championship. For the analysis I excluded occurrences with less than 3 events.Same analysis as above, the majority of messages from the same IP were around within 5 hours of difference.
Same analysis (again).
comment_author: www.google.com spammed me more frequently and with less distance among messages.
comment_author: free porn shown an average of ~60 hours between messages.
Now, compute a heatmap based on the contingency matrix of country versus email domain. Here I found that the US, France and Germany have preference for
gmail domains. Additionally, the US spamming accounts uses other domains not presented in comments from other countries.
While hunting down clusters of email domains, I found something interesting, two perfect clusters. First one includes three domains:
care2 and (ironically)
nospammail, the second includes
Same cluster hunting season, but now looking at countries. As noticed in the other figures, the US, France and Germany behave similarly when talking about email domain preferences. Also, two perfect clusters were found, one is formed between
Russian Federation and
China, the other one is formed between
Taiwan and the
Now, computing distinctive words included in the spam corpus between country of origin. It looks like Italian spammers wants me to play/buy/download video games, United States spammers wants me to blog and improve my site, German spammers wants me to relax taking vacations to the beach while the France spammers also looks like they are inviting me to play video games.
Looking at the last figure, it looks like the spammers are not that bad, they just wants me to really feel great by downloading the best free goodies from these people.
Code, full HP plots in PDF format (because word clouds are cool) and datasets available at Github