Data Scientist - Benjamin Tovar

Spam comment analysis in R

18

Oct

 

Spam comment analysis in R

Imagine login into your blog and find out more than hundred spam messages, not cool!. I am not letting the spammers win, so I decided to crack some patterns and try to understand/learn something about these little annoying bots.

For this post, I am performing spam comment analysis in R of 162 comments posted in my blog from August 4th to October 17th . I wish there would be more, but looks like they got already deleted in the database. Additionally, I am including a dump in CSV and the code (bottom section).

Results

First things first, characterize if there are features shared between more than two spam messages (this is, ip_country_name appears 8 times, indicating that all 162 messages come from these 8 unique countries).

unique_entries_per_feature

Compare the frequency of each country (each message has a country of origin) and the average number of hours between comments. What we observe is that the majority of messages comes from the US and then from France. Additionally, France has a mean ~8 hours between messages winning the first place of the fastest spammer championship. For the analysis I excluded occurrences with less than 3 events.message_stats_by_ip_country_nameSame analysis as above,  the majority of messages from the same IP were around within 5 hours of difference.

message_stats_by_comment_author_IP

Same analysis (again). comment_author: www.google.com spammed me more frequently and with less distance among messages. comment_author: free porn shown an average of ~60 hours between messages.

message_stats_by_comment_author

Now, compute a heatmap based on the contingency matrix of country versus email domain. Here I found that the US, France and Germany have preference for gmail domains. Additionally, the US spamming accounts uses other domains not presented in comments from other countries.

heatmap_M

While hunting down clusters of email domains, I found something interesting, two perfect clusters. First one includes three domains: fastmail, care2 and (ironically) nospammail, the second includes moose-mail and allmail.

heatmap_D2Same cluster hunting season, but now looking at countries. As noticed in the other figures, the US, France and Germany behave similarly when talking about email domain preferences. Also, two perfect clusters were found, one is formed between Russian Federation and China, the other one is formed between Taiwan and the Virgin Islands.

heatmap_D1

Now, computing distinctive words included in the spam corpus between country of origin. It looks like Italian spammers wants me to play/buy/download video games, United States spammers wants me to blog and improve my site, German spammers wants me to relax taking vacations to the beach while the France spammers also looks like they are inviting me to play video games.

comparison_cloud_top_2000_wordsFinally, a small word cloud with words in common shared among these countries.

commonality_cloud

Conclusions

Looking at the last figure, it looks like the spammers are not that bad, they just wants me to really feel great by downloading the best free goodies from these people.

Code and dataset

Code, full HP plots in PDF format (because word clouds are cool) and datasets available at Github

twittergoogle_plusredditlinkedin

Tags: ,


Leave a comment
 

Your email address will not be published. Required fields are marked. *