There is hardly a week when you go to Google News and don’t find a news article about Phishing. Just in the last week, hackers are sending phishing emails to Disney+ subscribers, ‘Shark Tank’ star Barbara Corcoran lost almost $400K in phishing scam, a bank issues phishing warnings, and almost three-quarter of all phishing websites now use SSL. Since phishing is such a widespread problem in the cybersecurity domain, let us take a look at the application of machine learning for phishing website detection. Although there have been many articles and research papers on this topic [Malicious URL Detection] [Phishing Website Detection by Visual Whitelists] [Novel Techniques for Detecting Phishing], they do not always provide open-source code and dive deeper into the analysis. This post is written to address these gaps. We will use a large phishing website corpus and apply a few simple machine learning methods to garner highly accurate results.
The best part about tackling this problem with machine learning is the availability of well-collected phishing website data sets, one of which is collected by folks at the Universiti Malaysia Sarawak. The ‘Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking’ dataset consists of 30,000 websites out of which 15,000 are phishing and 15,000 are legitimate. Each website in the data set comes with HTML code, whois info, URL, and all the files embedded in the web page. This is a goldmine for someone looking to apply machine learning for phishing detection. There are several ways this data set can be used. We can try to detect phishing websites by looking at the URLs and whois information and manually extracting features as some previous studies have done . However, we are going to use the raw HTML code of the web pages to see if we can effectively combat phishing websites by building a machine learning system. Among URLs, whois information, and HTML code, the last is the most difficult to obfuscate or change if an attacker is trying to prevent a system from detecting his/her phishing websites, hence the use of HTML code in our system. Another approach is to combine all three sources, which should give better and more robust results but for the sake of simplicity, we will only use HTML code and show that it alone garners effective results for phishing website detection. One final note on the data set: we will only be using 20,000 total samples because of computing constraints. We will also only consider websites written in English since data for other languages is sparse.
Byte Pair Encoding for HTML Code
For a naive person, HTML code does not look as simple as a language. Moreover, developers often do not follow all the good practices while writing code. This makes it hard to parse HTML code and extract words/tokens. Another challenge is the scarcity of many words and tokens in HTML code. For instance, if a web page is using a special library with a complex name, we might not find that name on other websites. Finally, since we want to deploy our system in the real world, there might be new web pages using completely different libraries and code practices that our model has not seen before. This makes it harder to use simple language tokenizers and split code into tokens based on space or any other tag or character. Fortunately, we have an algorithm called Byte Pair Encoding (BPE) that splits the text into sub-word tokens based on the frequency and solves the challenge of unknown words. In BPE, we start by considering each character as a token and iteratively merge tokens based on the highest frequencies. For instance, if a new word “googlefacebook” comes, BPE will split it into “google” and “facebook” as these words could be frequently there in the corpus. BPE has been widely used in recent deep learning models .
There have been numerous libraries to train BPE on a text corpus. We will use a great one called tokenizer by Huggingface. It is extremely easy to follow the instruction on the github repository of the library. We train BPE with a vocabulary size of 10,000 tokens on top of raw HTML data. The beauty of BPE is that it automatically separates HTML keywords such as “tag”, “script”, “div” into individual tokens even though these tags are mostly written with brackets in an HTML file e.g <tag>, <script>. After training, we get a saved instance of the tokenizer which we can use to tokenize any HTML file into individual tokens. These tokens are used with machine learning models.
TFIDF with Byte Pair Encoding
Once we have tokens from an HTML file, we can apply any model. However, contrary to what most people do these days, we will not be using a deep learning model such as a Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN). This is mainly because of the computational complexity and the relatively small size of the data set for deep learning models. The figure above shows a histogram of tokens from BPE in 1000 HTML files. We can see that these files contain thousands of tokens whose processing will incur high computational cost in more complex models like CNN and RNN. Moreover, it is not necessary that token order matters for phishing detection. This will be empirically evident once we look at the results. Therefore, we will simply apple TFIDF weights on top of each token from the BPE.
As explained in the previous post on Authorship Attribution, TFIDF stands for term frequency, inverse document frequency and can be calculated by the formula given below. Term frequency (tf) is the count of a term i in a document j while inverse document frequency (idf) indicates the rarity and importance of each word in the corpus. Document frequency is calculated by totaling the number of times a term i appears in all documents. TF-IDF gives us weights as tfidf scores for each term in a document which is a product of tf and idf.
Machine Learning Classifier
Sticking with simplicity, we will use a Random Forest Classifier (RF) from scikit-learn. For training the classifier, we split the data into 90% training and 10% testing. No cross-validation is done since we are not trying to extensively tune any hyper-parameters. We will stick with the default hyperparameters of Random Forest from the scikit-learn implementation. Contrary to deep learning models that take a long time to train, RF takes less than 2 minutes on a CPU to train and demonstrate effective results as are shown next. To show robustness in performance, we train the model 5 times on different splits of the data and report the average test results.
The table above shows the results on test data averaged across 5 experiments. Looking at the surface, these seem like great results especially without any hyperparameter tuning and with a simple model. However, these are not so great. The model has 98% precision for both classes which means it gives around 2% false positives when it is detecting phishing websites. That is a huge number in the security context. False positives are the websites that the machine learning model deems to be phishing but are in fact legitimate. If users frequently encounter false positives, they have a bad user experience and they might not want to use the model anymore. Moreover, the security folks encounter threat alert fatigue when dealing with false positives. False positives are further quantified in the confusion matrix below where x-axis shows the actual classes and y-axis has the predicted classes. Even though the model is achieving a high accuracy score, there are 11 instances where the model predicted “Phishing” for the website but in reality, it was a safe website.
|16 (False Negative)||912 (True Negative)||Legitimate|
|920 (True Positive)||11 (False Positive)||Phishing|
Now that we know there is still a problem with the model and we cannot deploy it as it is, let us look at a potential solution. We are going to use the Receiver Operating Curve (ROC) to look at the false and true positive rates. In the figure below, it is easy to see that for up to 80% true positive rate, we have a 0% false-positive rate which is something we can use for decision making.
The ROC curve demonstrates that for a particular confidence threshold (red dot), the true positive rate would be around 80-90% while the false positive rate would be close to zero. To prove this, let us look at different confidence thresholds and plot metrics against them. To apply a confidence threshold of x%, We will only keep websites where the model is more than x% confident that the website is either legitimate or a phishing one. When we do this, the total number of phishing websites (true positive rate) we can identify decreases but our accuracy increases considerably and precision also becomes close to 100%.
The above figure demonstrates the effect of confidence threshold on test accuracy, the number of false positives, and the true positive rate. We can see that when we are using the default threshold of 0.5, we have 11 false positives. As we start to increase our confidence score, our true positive rate decreases but the number of false positives starts getting very low. Finally, at the last point in the graph, we have zero false positives for precision. This means that whenever our model says a website is trying to phish, it is always accurate. However, since our true positive rate has declined to 82%, the model can only detect around 82% phishing websites now. This is how machine learning could be used in cybersecurity by looking at the tradeoff between false positives and true positives. Most of the time, we want an extremely low false-positive rate. In such settings, one can adopt the approach above to get effective results from the model.
Before concluding this post, let us discuss a few limitations of the methods we have seen above. First, our data set is pretty decent sized but it is not comprehensive at all for all the types of phishing websites out there. There might have been millions of phishing websites in the last couple of years but the data set contains 15,000 only. As hackers are advancing their techniques, newly made phishing websites might not be making the same mistakes that the old ones were making which might make them hard to detect using the model above. Secondly, since TFIDF feature representation does not take into account the order in which code is written, we can potentially lose information. This problem does not arise in deep learning methods as they can sequentially process sequences and take into account the order of the code. Moreover, since we are using raw HTML code, an attacker can observe the predictions of the model and spend some time trying to come up with obfuscations in the code that will render the model ineffective. Finally, someone can use off the shelf code obfuscators to obfuscate the HTML code which will again render the model useless since it has only seen plain HTML code files. However, despite some of these limitations, machine learning can still be very effective in complementing phishing blacklists such as the ones used by Google Safe Browsing. Combining blacklists with machine learning systems can provide better results than relying on blacklists alone.
As I discussed in the first post of this blog, I will always open-source the code for the projects I discuss in this blog. Keeping the tradition alive, here is the link for replicating all experiments, training your own phishing detection models, and testing new websites using my pre-trained model.
Github Repository: https://github.com/faizann24/phishytics-machine-learning-for-phishing