Intelligent phishing url detection using association rule mining
© The Author(s). 2016
Received: 1 September 2015
Accepted: 2 May 2016
Published: 10 July 2016
Phishing is an online criminal act that occurs when a malicious webpage impersonates as legitimate webpage so as to acquire sensitive information from the user. Phishing attack continues to pose a serious risk for web users and annoying threat within the field of electronic commerce. This paper focuses on discerning the significant features that discriminate between legitimate and phishing URLs. These features are then subjected to associative rule mining—apriori and predictive apriori. The rules obtained are interpreted to emphasize the features that are more prevalent in phishing URLs. Analyzing the knowledge accessible on phishing URL and considering confidence as an indicator, the features like transport layer security, unavailability of the top level domain in the URL and keyword within the path portion of the URL were found to be sensible indicators for phishing URL. In addition to this number of slashes in the URL, dot in the host portion of the URL and length of the URL are also the key factors for phishing URL.
KeywordsPhishing Web security Association rule mining
Phishing is a malicious website that impersonates as a legitimate one to get sensitive data like credit card number or bank account password. A phisher uses social engineering and technical deception to fetch private information from the web user. The phishing web pages generally have alike page layouts, blocks and fonts to mimic legitimate web pages in an endeavor to influence web users to obtain personal details such as username and password. Over the last few years, online banking has become very popular as more financial institutions have begun to offer free online services. With the increase in online theft, financial crimes have changed from direct attacks to indirect attack. Phishing  is a quickly growing type of fraud and is taken into account as one of the foremost dangerous threats within the web which cause folks to mislay guarantee  in on-line transactions. It is relatively a current web crime as compared with virus, hacking and remains an ominous threat to client and business round the world.
According to the RSA’s online fraud report , the year 2013 has been confirmed to be a record year where many phishing attacks have been launched globally. Additional1y, RSA estimates that over USD $5.9 billion was lost by global organizations due to phishing attacks at the same period. The Internet Security Threat Report 2014  reports that cybercrimes are prevailing and damaging threats from cybercriminals still emerge over businesses and customers. According to RSA monthly fraud report January 2014, the  big data analytics and broader intelligence will lead to faster detection resulting in lower financial losses. Data mining techniques are used to extract helpful information by analyzing the past information then predicting the future incidents.
In this paper, the rules are generated using association rule mining to detect phishing URL. The remaining section in the papers is organised as follows: The outline of literature survey is shown in second section. The system architecture is illustrated in third section. The features that are generated from the URL are discussed in fourth section. Fifth section explains the methodology used in detecting phishing URL. Sixth section presents an association rule mining technique to discover the rules concerning phishing URL and in the last section conclusions are presented.
Phishing is a major danger to web users. The fast growth and progress of phishing techniques create an enormous challenge in web security. Zhang et al.  proposed CANTINA, a completely unique HTML content method for identifying phishing websites. It inspects the source code of a webpage and makes use of TF-IDF to find the utmost ranking keywords. The keywords obtained are given as input to google search engine and examined whether the domain name of the URL matches with N top search result and is considered as legitimate. This approach fully relies on google search engine. CANTINA+ proposed by Xiang et al.  is an upgraded version of CANTINA, in which new features are included to achieve better results. In particular, the authors include the HTML Document Object Model, third party and google search engines with machine learning technique to identify phishing web pages.
Huang et al.  proposed SVM based technique to detect phishing URL. The features used are structural, lexical and brand names that exist in the URL. Liebana-Cabanillas et al.  proposed completely different technique to search out the variables that are most often utilized in financial institutions so as to predict the trust among electronic banking. Yuancheng et al.  proposed semi supervised based method for detection of phishing web page. The features of the web image and DOM properties are considered. Transductive Support Vector Machine is applied to detect and classify phishing web pages. Islam et al.  proposed filtering phishing email with the message content and header using multi-tier classification model.
Chen et al.  have proposed a hybrid approach that mixes extraction of key phrase, textual, financial data to ascertain the vicious of phishing attack using supervised classification strategies. Nishanth et al.  have proposed a method in which the structured style of the financial data are mined using machine learning algorithms. Liu et al.  have proposed the visual approach to identify phishing web pages. The similarity between the pages is assessed by block, layout and overall style. Medvet et al.  also adopted the visual similarity between webpages to calculate the similarity among a legitimate site and the suspected phishing website. The features used to verify page similarity are text piece, their style, images and the overall appearance of the webpage.
Antony et al.  have proposed a technique that uses EMD to decide the resemblance of webpage. In this methodology, the webpages are converted into images and the features like color and coordinate are used to generate signatures. The distance of the webpage image signature is computed using EMD. The authors use a trained EMD threshold value to differentiate the legitimate and phishing webpages. Lam et al.  have proposed an image based approach for detecting phishing webpages. The authentic and the suspected pages are transformed into black and white image. The size and location of each blob are recorded and compared. The matched pair is selected by comparing the block pair with maximum similarity degree. The classifier categorizes the page based on the similarity score. Chen et al.  have proposed an approach that uses CCH to figure out the resemblance degree between fake and legitimate page.
Pshark is an approach proposed by Shah et al.  to detect and eliminate the identified phishing web page from host server. The WHOIS database is used to retrieve the information about the page. A notification is sent to the host server intimating that a phishing page resides in that server. He et al.  adopted heuristic method to categorize the legitimacy of the web page. The heuristic used in this system are based on the term identity set of a webpage. Aburrous et al.  have proposed fuzzy data mining technique to identify the phishing website. Zhang et al.  adopted domain feature enhanced classification model for the detection of Chinese phishing e- business websites.
Zhang et al.  proposed CANTINA, a completely unique HTML content method for identifying phishing websites. It inspects the source code of a webpage and makes use of TF-IDF to find the utmost ranking keywords. The keywords obtained are given as input to google search engine and examined whether the domain name of the URL matches with N top search result and is considered as legitimate. This approach fully relies on google search engine. CANTINA+ proposed by Xiang et al.  is an upgraded version of CANTINA, in which new features are included to achieve better results. In particular, the authors include the HTML Document Object Model, third party and google search engines with machine learning technique to identify phishing web pages. However, both the approaches rely on google search engine and the contents are downloaded from the webpages. But in our work the features related to URL is considered and thus downloading the content of the webpage is avoided. Moreover the system prediction is not exclusively based on querying search engine result.
Huang et al.  proposed SVM based technique to detect phishing URL. The features used are structural, lexical and brand names that exist in the URL. However, more features related to URL are considered in the proposed work. Neda et al.  proposed rule based classification algorithm to detect phishing URL. However the rule used in this is based on human experience rather than intelligent data mining technique. In the approach proposed by Han et al.  the system warns the user, when the user submits the username and password for the first time, although the current website is a legitimate website. This is because the information about the legitimate website is not maintained. This login problem is eliminated in our system as a repository of white list is effectively maintained.
System architecture for prediction of phishing URL
The proposed work focuses on identifying the relevant features that differentiate phishing websites from legitimate websites and then subjecting them to association rule mining. In order to identify the relevant features, certain statistical investigations and analysis were carried out on the phish tank (http://www.phishtank.com) and legitimate dataset. Based on the heuristics, fourteen features were defined and are subjected to association rule mining to effectively determine the legitimate and phished URL.
Heuristic 1: length of the host URL
Heuristics 2: number of slashes in URL
Heuristics 3: dots in host name of the URL
Heuristics 4: number of terms in the host name of the URL
Heuristics 5: special characters
The URL is unique in the cyberspace. The identity of the legitimate website is obtained from the host name of the URL. The hostname in the URL of the legitimate and phished dataset is investigated for understanding the presence of special characters in both the data sets. While examining, it was found that 77.75 % of phished URLs are with special characters.
Heuristic 6: IP address
In general, the legitimate websites are addressed by their domain name. In the dataset, the hostname in the URL is examined for the presence of IP address. It was found that 9.4 % of phished URLs contained IP address.
Heuristics 7: unicode in URL
Unicode provides a unique number for every character. On analyzing the input data set (1200 phishing URLs and 200 legitimate URLs), it is scrutinized that most of the phishing URLs are coded with unicode characters. The presence of Unicode in the host name of the URL indicates that the URL is a phished URL. It was found that 65.16 % of phished URLs contained unicode characters
Heuristics 8: transport layer security
The URL is broken down into host component and a path component. The URL uses Transport Layer Security to determine whether the URL is protected. The presence of HTTPs protocol is required when delicate information is transferred across network. Therefore the existence of Transport Layer Security is examined for the input URL. On analyzing the phishtank dataset, 99.16 % URLs were found without transport layer security.
Heuristics 9: Subdomain
The securityweek network,  reports that subdomain leads to increase uptime for phishing attack. Fraudsters, scam users by adding sub domains to make the link look legitimate. Adding subdomain to the URL makes the cyber space user believe that the URL belongs to the authentic website. Hence, the number of subdomain present in the phishing URLs is analyzed and 64 % of phished URLs are found to be with subdomain.
Heuristics 10: certain keyword in the URL
Heuristics 11: top level domain
The host name of the URL includes the top level domain, secondary level domain and the domain. The top level domain part of the URL is checked for existence. If the top level domain is not available in the host name of the URL then it is a phished URL otherwise the URL is a legitimate URL. Hence, the existence of the top level domain in the phishing URLs is analyzed and 66.5 % of phished URLs are found without top level domain.
Heuristics 12: number of dots in the path of the URL
Heuristics 13: hyphen in the host name of the URL
Heuristics 14: URL length
Association rule mining
Data mining is the method that tries to get patterns in massive information sets. The overall objective of data mining method is to extract information from data set and remodel it into comprehensible structure. The objective of the association rule mining is used to discover associations among items in a set, by mining essential knowledge from the database. The algorithm was proposed by Agrawal et al. . Support and confidence techniques are used to assess the association rules. Support is the proportion of transactions wherever the rule holds. Confidence is the conditional probability of C with reference to A or, in different words, the relative cardinality of C with reference to A.
Association rule mining to detect phishing URL
Legitimate data source
Yahoo most visited sites
Most visited sites google’s top 1000
Alexa’s top targeted sites
Netcraft’s most visited sites
Millersmile’s top targeted sites
Rule extracted from apriori
Association rules play a major role in finding interesting patterns. Association rules deliver information within the kind of “if–then” statements. The rules are computed from the information and are probabilistic in nature. Association rule mining is used to explore the hidden relationship between the attributes. In the proposed work, association rule mining has been used for detecting the frequently occurring features in phishing URLs. All the attributes selected in the data set are binary attributes and only the phishing URLs are mined using the apriori algorithm for identifying the recurring patterns. The strong rule generated by the apriori with 100 % confidence alone has been considered for further analysis and the other rules are discarded.
Rule extracted from predictive apriori
Execution time of algorithms
No. of instances
Predictive apriori (ms)
In this paper, the features of the URL are analyzed and are subjected to associative rule mining—apriori and predictive apriori. The rules obtained are interpreted to emphasize the features that are more prevalent in phishing URLs. The results obtained from rule mining have highlighted the useful features available in the phished URL. Analyzing the information available on phishing URL and considering confidence as indicator, the features such as transport layer security, unavailability of the top level domain in the URL and keyword within the path portion of the URL were found to be sensible indicators for phishing URL. In addition to this number of slashes in the URL, dot in the host portion of the URL and length of the URL are also the key factors for phishing URL.
SCJ carried out the studies, and drafted the manuscript. EBR provided full guidance and revised the manuscript to high standards. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- https://ers.trendmicro.com/guide/en_us/AG/AppA/Phish_Attack.htm. Accessed May 2015
- http://www.consumeraffairs.com/news04/2005/gartner.html. Accessed June 2015
- http://www.emc.com/collateral/fraud-report/rsa-online-fraud-report-012014.pdf. Accessed June 2015
- http://www.symantec.com/content/en/us/enterprise/other_resources/bistr_main_report_v19_21291018.en-us.pdf. Accessed June 2015
- RSA Anti-Fraud Command Center. RSA monthly online fraud report. (2014). http://www.emc.com/collateral/fraud-report/rsa-online-fraud-report-012014.pdf. Accessed July 2015
- Zhang Y, Hong JI, Cranor LF (2007) CANTINA: a content based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on world wide web, Banff, p 639–648
- Xiang G, Hong J, Rose CP, Cranor L (2011) CANTINA+: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans Inf Syst Secur 14:21View ArticleGoogle Scholar
- Huang H, Qian L, Wang Y (2012) A SVM based technique to detect phishing URLs. Int Technol J 11(7):921–925Google Scholar
- Liébana-Cabanillas F, Nogueras R, Herrera LJ, Guillén A (2013) Analysing user trust in electronic banking using data mining methods. Expert Syst Appl 40:5439–5447View ArticleGoogle Scholar
- Li Y, Xiao R, Feng J, Zhao L (2013) A semi-supervised learning approach for detection of phishing webpages. Optik 124:6027–6033View ArticleGoogle Scholar
- Islam R, Abawajy J (2013) A multi-tier phishing detection and filtering approach. J Netw Comput Appl 36:324–335View ArticleGoogle Scholar
- Chen X, Bose I, Leung ACM, Guo C (2011) Assessing the severity of phishing attacks: a hybrid data mining approach. Expert Syst Appl 50:662–672Google Scholar
- Nishanth KJ, Ravi V, Ankaiah N, Bose I (2012) Soft computing based imputation and hybrid data and text mining: the case of predicting the severity of phishing alerts. Expert Syst Appl 39:10583–10589View ArticleGoogle Scholar
- Liu W, Deng X, Huang G, Fu AY (2006) An antiphishing strategy based on visual similarity assessment. IEEE Computer Society 1089-7801/06 IEEE, IEEE Internet Computing
- Medvet E, Kirda E, Kruegel C (2008) Visual-similarity-based phishing detection. SecureComm. In: Proceedings of the 4th international conference on Security and privacy in communication netowrks. pp 22–25
- Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance. IEEE Trans Dependable Secure Comput 3(4):301–321View ArticleGoogle Scholar
- Lam IF, Xiao WC, Wang SC, Chen KT (2009) Counteracting phishing page polymorphism: an image layout analysis approach. Springer-Verlag, BerlinGoogle Scholar
- Chen KT, Chen JY, Huang CR, Chen JY (2009) Fighting phishing with discriminative keypoint features of webpages. IEEE Internet Comput 13:56–63View ArticleGoogle Scholar
- Shah R, Trevathan J, Read W, Ghodosi H (2009) A proactive approach to preventing phishing attacks using Pshark. In: Sixth international conference on information technology: new generations. IEEE, Las Vegas, pp 915–921
- He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL et al (2011) An efficient phishing webpage detector. Expert Syst Appl 38(10):18–27View ArticleGoogle Scholar
- Aburrous M, Hossain MA, Dahal K, Thabtah F (2010) Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Syst Appl 37(12):7913–7921View ArticleGoogle Scholar
- Zhang D, Yan Z, Jiang H, Kim T (2014) A domain-feature enhanced classification model for the detection of Chinese phishing e- business websites. Inf Manag 51:845–853View ArticleGoogle Scholar
- Abdelhamid N, Ayesh A, Thabtah F (2014) Phishing detection based associative classification data mining. ScienceDirect 41:5948–5959Google Scholar
- Han W, Cao Y, Bertino E, Yong J (2012) Using automated individual whitelist to protect web digital identities. Expert Syst Appl 39:11861–11869View ArticleGoogle Scholar
- http://www.securityweek.com/use-subdomains-leads-increased-uptime-phishing-attacks. Accessed June 2015
- Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. ACM-SIGMOD 22:207–216View ArticleGoogle Scholar
- Scheffer T (2001) Finding association rules that trade support optimally against confidence. In: Proceedings of the 5th European conference on principles and practice of knowledge discovery in databases (PKDD’01), Springer-Verlag, Freiburg