Consider a scenario where a security agency has data of millions of blog posts from users on the internet. Some of the posts are from the dark web and the agency wants to find out if it can analyze the stylometric properties of the text to attribute it to users that have regular blogs. This seems like a far fetched scenario but it is totally possible that a malicious user uses a personal blog as well as a blog on the dark web and performs malicious activities. The stylometric patterns in his writing can be leveraged to catch him. The goal of this post to see how we can use machine learning to successfully attribute authors to text in a large-scale setting. For those not familiar with authorship attribution, it refers to the task of identifying authors based on the stylometric properties of their textual data. It can be used in many security tasks such as forensic linguistics, exposing underground forums, etc.
Before we dive into the machine learning part, let us take a look into the data set we will use for authorship attribution. We will use the a cleaned and pruned version of Blog Authorship Corpus in which textual posts from 19,320 bloggers were gathered from blogger.com in August 2004. According to the data set’s description, “The corpus incorporates a total of 681,288 posts and over 140 million words – or approximately 35 posts and 7250 words per person.”
Looking at the histogram above, most authors on average have around 200 or fewer articles. Therefore, for experimentation, we only select up to 250 articles for each author with random sampling. Different numbers of authors are selected to train the machine learning model. This is done in order to illustrate the impact of authorship attribution methods as the number of authors increase. Since this post is about large-scale authorship attribution, up to 100 authors are selected for the experiments. For all experiments, data are split into 80% training and 20% testing. Each experiment is run 5 times and the final results are the average of all runs. Cross-validation is not done since there was no parameter tuning in the model.
Authorship Attribution Classifier
The purpose of this article is to illustrate with a simple example the effectiveness of machine learning in authorship attribution. Therefore, instead of using more complex models like Recurrent Neural Networks, Transformers, etc., we will instead focus on a simple Random Forest Classifier with TFIDF scores as features. There are many more models   that work better than what we will be trying but for the sake of simplicity, we will use a simple albeit a reasonably accurate model. TF-IDF stands for term frequency-inverse document frequency and it is a popular method to convert text into numerical features. Term frequency (tf) is the count of a term i in a document j while inverse document frequency (idf) indicates the rarity and importance of each word in the corpus. Document frequency is calculated by totaling the number of times a term i appears in all documents. TF-IDF gives us weights as tfidf scores for each term in a document which is a product of tf and idf.
For each author, we get a vector representation for each article of an author which consists of the tfidf weights for each word in the article. Using this feature representation as input to our model, we train a Random Forest (RF) model with default parameters from Scikit-Learn. Again, default values are kept in order to illustrate whether simpler models like RF can do authorship attribution. Better performance can be obtained by tuning the parameters but that is outside the scope of this post.
Number of Articles per Author
First, experiments are performed to look at the performance as we increase the data for each author. For this experiment, we keep the number of authors as 2 and do multiclass classification with different number of articles per author. The figure below depicts that adding more data for each author improves performance significantly. Given only 50 articles for each blog author, we obtain 70% accuracy but increasing the data to 250 articles per author gives us more than 90% accuracy, precision, and recall. This demonstrates that during authorship attribution, more data helps us perform better classification.
Number of Authors
This section is the crux of this post which is about the scalability of authorship attribution using machine learning. In order to measure how well Random Forest performs on a large number of authors, we try increasing the number of authors incrementally while keeping 250 articles per author. Besides measuring accuracy, recall, and precision, we also measure the top-5 accuracy of the model which indicates the ability of the model in narrowing down a good set of candidates during its attribution process. Higher top-5 accuracy means the model is effective in finding the right author in its top 5 predictions, which can help the user narrow down the set of authors significantly.
The figure above shows the results for the effect of the number of authors. As we increase the number of authors, our accuracy decreases but is still far better than a random guess. For instance, a random guess among 100 authors would yield 1% accuracy but our model is garnering more than 40% accuracy and more than 60% top-5 accuracy which demonstrates the effectiveness of machine learning in doing authorship attribution. These results can be enhanced considerably by using better models such as Xgboost, Transformers, Recurrent Neural Networks, and Convolutional Neural Networks. One can also try better feature engineering schemes such as Sentence Embeddings.
This was a short post where I wanted to highlight how machine learning can solve a cybersecurity task i.e authorship attribution. You can repeat all experiments and run the above model on your own data set by following the instructions on the Github repository for this post. Going forward, each post will have a code repository on GitHub associated with it.
Github Repository: https://github.com/faizann24/Authorship-Attribution