The challenge
Increasing fraudulent attempts to obtain sensitive information
Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords, and credit card details by disguising oneself as a trustworthy entity in an electronic communication such as email, instant messaging and SMS. Users are often lured by communications purported to be from trusted parties, such as social networking websites, auction sites, banks, online payment processors, and IT administrators.
This malicious activity often directs users to enter personal information via a fake website that matches the look and feel of the legitimate website.
To mitigate the negative impacts of phishing, security software providers, financial institutions, and academic researchers have studied various approaches to building an automated phishing website detection system. These methods have included the use of blacklists, and investigating the website content, URL and web-related features. These methods have included the use of blacklists, investigation of the website content, URL and website-related features. Typically, these algorithms consider the HTML code of a webpage, the hyperlink of the site ( e.g. www.csiro.au ), and formatting e.g. colour and bold/italic texts.
Our response
Compression based algorithms
Our goal is to detect and predict phishing websites before they can do any harm to the users. Previous phishing detection methods employed machine learning algorithms. They used traditional classification techniques like naive Bayes, logistic regression, k-nearest neighbours, support vector machines, decision trees and artificial neural networks. These algorithms are not able to cope with the dynamic nature of phishing, as the fraudsters are constantly changing the webpage design and hyperlink every couple of hours.
By combining different algorithmic techniques, researchers at CSIRO's Data61 and UNSW have designed a novel and more effective phishing detection solution, PhishZip1. The PhishZip algorithm uses file compression to distinguish phishing websites from legitimate ones. File Compression is the process of encoding information using fewer bits than the original representation, to reduce the file size (e.g. from 10MB to 4MB). We use the DEFLATE file compression algorithm to compress both legitimate and phishing websites and separating them by examining how much they get compressed. Legitimate and phishing websites have different compression ratios.
We introduce a systematic process of selecting meaningful words, which are associated with phishing and non-phishing websites, by analysing the likelihood of word occurences, and the optimal likelihood threshold. These words are used as the pre-defined dictionary for our compression models. They are used to train the algorithm into identifying instances where a proliferation of these key words indicate a malicious website. Compression ratio measures the distance, or cross-entrophy between the predicted website and the phishing and non-phishing website content distribution. High compression ratio is associated with low entrophy, which indicates that the contribution distribution is similar to the common word distribution in phishing and non-phishing websites.
Unlike Machine-based learning models, PhishZip's approach does not require model training or HTML parsing. Instead we compress the HTML file to determine whether it is a phishing website. Thus, classification with compressed algorithms is faster and simpler.
The results
Preventing financial loss and protecting privacy
The project has a significant impact on phishing and spamming emails and websites. We have used this algorithm on several phishing websites which are clones of PayPal, Facebook, Microsoft, ING Direct and other popular websites. We found the Phish Zip has correctly been able to identify phishing websites with more than 83 per cent accuracy. This will result in the prevention of huge financial losses and protect the privacy of users' personal data, including passwords and credit card numbers. These are targeted towards organisations with email and website servers.
We have tested our algorithms on large datasets. One of the phishing dataset repositories that we are using is PhishTank 2. This is a very comprehensive database of phishing websites. An example of the legitimate and phishing PayPal website is shown below.
The project is a joint collaboration with the University of NSW, with co-authored paper, 'PhishZip: A new Compression-based Algorithm for Detecting Phishing Websites' which was published in the IEEE Conference on Communications and Network Security (CNS 2020). PhishZip can be used as web service to detect and block phishing websites.
The next step is to build a complete suite of software tools and services that detect, predict and prevent phishing and spam websites for mobile, laptop and desktop users.
PhishZip is a currently evolving research project, however, if you are interested in early access, please contact us.
References
- PhishZip: A new Compression-based Algorithm for Detecting Phishing Websites, Rizka Purwanto, Arindam Pal, Alan Blair and Sanjay Jha, IEEE Conference on Communications and Network Security (CNS2020), Avignon, France.
- PhishTank : A comprehensive repository of phishing websites.