- Open Access
A novel lightweight URL phishing detection system using SVM and similarity index
Human-centric Computing and Information Sciences volume 7, Article number: 17 (2017)
The phishing is a technique used by cyber-criminals to impersonate legitimate websites in order to obtain personal information. This paper presents a novel lightweight phishing detection approach completely based on the URL (uniform resource locator). The mentioned system produces a very satisfying recognition rate which is 95.80%. This system, is an SVM (support vector machine) tested on a 2000 records data-set consisting of 1000 legitimate and 1000 phishing URLs records. In the literature, several works tackled the phishing attack. However those systems are not optimal to smartphones and other embed devices because of their complex computing and their high battery usage. The proposed system uses only six URL features to perform the recognition. The mentioned features are the URL size, the number of hyphens, the number of dots, the number of numeric characters plus a discrete variable that correspond to the presence of an IP address in the URL and finally the similarity index. Proven by the results of this study the similarity index, the feature we introduce for the first time as input to the phishing detection systems improves the overall recognition rate by 21.8%.
The phishing is a technique used by cybercriminals to mimic legitimate websites in order to obtain personal information such as login, password and credit card number which leads to an identity theft. However these criminals typically use phishing to subtract money; for that purpose they target online banking, online payment systems, e-commerce (electronic commerce) websites and m-commerce (mobile commerce) applications.
Despite all efforts made to counter the phishing threat, this attack still manage to cause serious damage, according to the FBI (Federal Bureau of Investigation)  the phishing attack cost $1.2 billion in the span of a year and 2 months between 1st October 2013 and 1st December 2014. Furthermore the colossal financial losses aren’t the only damages caused by the phishing attack since the number of phishing websites detected by the anti-phishing working group  increased by 250% from the last quarter of 2015 to the first quarter of 2016 moreover the number of unique phishing websites detected between January and March 2016 is 289,371 which is more than enough of a reason to make us question whether the current anti-phishing systems are efficient?
Detecting the phishing attack proves to be a challenging task. This attack may take a sophisticated form and fool even the savviest users: such as substituting a few characters of the URL with alike unicode characters. By cons, it can come in sloppy forms, as the use of an IP address instead of the domain name.
In this paper, after a summary of this field key researches, we will detail the characteristics of the URL that our system uses to do the recognition. Otherwise we will describe our recognition system, next in the practical part we will test the proposed system while presenting the results obtained. Last but not least we will enumerate the implications and advantages that our system brings as a solution to the phishing attack.
In the literature the cyber attack called phishing is treated in three different ways.
One of the approaches to counter phishing is the blacklist, that blacklist contains known phishing websites acquired by techniques such as user votes, those blacklists are typically deployed as plug-ins in browsers in order to check each URL entry in the blacklist. Then it prevents the user whenever he attempts a connection to one of these malicious websites which are included in the blacklist. To cite some examples: internet explorer phishing filter , google safe browsing for Firefox . However this approach still facing an issue since it offers no protection against the new phishing websites that are not included in the blacklist. Not to mention the slow update process of the blacklist and the typical short duration (some hours) of the phishing websites.
The other solutions proposed to the phishing issue in the literature are works that do not try to distinguish between the legitimate and phishing websites. Oppositely they opt to consolidate user authentication in order to overcome this problem. Among other things the study by Huang et al.  which proposes to replace the use of a permanent password by a (one-time password) that should be provided to the user by a third party under a message form. The problem with this technique is its total dependence on the third party. In other words when the latter is under attack, the security of the entire system is compromised. Another proposal by Yue et al.  is to send a group of purposely wrong logins and passwords instead of the actual user login and password when connecting to a phishing website, the detection of the attack is done by a plug-in in the user’s browser and this the Achilles heel of this proposal because it relies on a detection system made by a third party.
The URL based phishing detection system
Feature extraction and analysis
In the system we propose, we initially minutely observed and studied a 2000 records database including 1000 phishing website records built from the PhishTank database . In this paper the targeted websites of the phishing attack are vital therefore all the retained 1000 phishing website records must contain their respective target. Moreover the studied database consist also of 1000 legitimate websites which we collected ourselves by combining Alexa’s  500 top global website with 500 websites resulted from queries to google search engine, as for the queries we used to feed our database are (*.bank.*, *.commerce.*, *.trade.*) in consideration of the phishing attack and the websites more likely to be targeted. Our analysis shows that the URL portion of interest is composed of several parts as shown in Fig. 1.
For the remainder of this document the word URL will designate only part 6 of Fig. 1 that is to say, we are only interested in the second-level domain name (4 in Fig. 1) and the first level domain name (5 in Fig. 1) as well as to all sub-domains except the default sub-domain (www). The first segment was removed from the interest zone, because the HTTPS certificates are not part of the scope of this work.
Figure 2 shows just the URL part that interests us.
After a preliminary study on our database, we have discarded the at (@) and the underscore (_) from the URL characteristics used in order to perform the recognition because on the totality of the dataset we found no occurrence of these two URL characteristics; thus one deduces their irrelevance. Our approach is based on artificial intelligence to detect phishing websites for this purpose we use the following URL features.
URL_Size: this is the number of characters in the URL usually phishing websites have a more important size then legitimate websites.
Number_of_Hyphens: this feature counts the number of the character ‘-’ in a URL. Normally legitimate websites rarely have an occurrence of the character ‘-’.
Number_of_Dots: this attribute counts the number of the character ‘.’ (dots) in a URL (for example the number_of_dots = 4 in the following URL sub-domain2.sub-domain3.sub-domain4.mcomerce.com).
Number_of_Numeric_Chars: we count the number of numeric characters in a URL. Since generally there is no occurrence of numeric characters in domain names of legitimate websites.
IP_presence: this feature takes two values: 1 whenever there is an IP address in a URL otherwise 0.
Similarity_index: the mathematically calculated distance measuring the difference between two data (two strings in our case). It is equal to 100% when measured on two identical words. Several variations and algorithms have been developed to measure this similarity among other we cite the most prevalent in this field: Levenshtein  Jaro Winkler  Normalized Levenshtein  longest common subsequence  Q Gram  Hamming .
To calculate these characteristics for each pair of phishing website and its corresponding legitimate website extracted from the database as shown in Table 1. For presentation purposes in Table 1, we coded the distance from the initial letters of their names, whether:
NH for Number_of_Hyphens
ND for Number_of_Dots
NNC for Number_of_Numeric_Chars
IP for IP_presence
L for the classic Levenshtein distance
NL for Normalized Levenshtein distance
JW for Jaro Winkler distance
LCS for the longest common subsequence
QG for the Q-Gram distance
And finally H to the Hamming distance.
As shown in Table 1, the calculation of characteristics used by our system for the recognition of phishing website is done on the entire database (i.e. on 2000 records). A first reading has to infer those phishing websites:
have an average of eight characters more than the legitimate websites,
and may have to thirty-seven against four numeric characters only for legitimate websites.
We thought to study the relationship between these different distances metrics for visibility and comparison so we opted to calculate the correlation that may exist between them.
As shown in the last row of Table 2, the relationship between Hamming distance and Q-Gram distance, Levenshtein and longest common subsequence is manifested by a very strong correlation respectively 97, 98 and 98% when our system is tested on the entire database.
In the same course of action we thought to study the correlation that may exist between the other five URL features used in this work.
Overall as shown in Table 3, we can note the disassociation between the URL features however there is a relevant relationship between the URL_Size and the Number_of_Dots (ND) and the Number_of_Numeric_Chars (NNC) established by 75 and 63% correlation.
Phishing detection system
In this part, we will describe the characteristics used by our recognition system.
Support vector machine as well known as SVM is a supervised classification algorithm that can solve classification problems as well as regression problems. SVM was developed in 1995  based on statistical learning theory by Vapnik–Chervonenkis.
The kernel we used for our system is the Gaussian kernel rather known as RBF (radial basis function).
Let x and x′ two samples of the RBF kernel is defined as the following:
Knowing that X − X′2 is the Euclidean square distance between the two feature vectors and σ is a constant.
Moreover to validate the system we have chosen to use the fivefold cross-validation model.
In this model the database is randomly split into five equal sub-samples, from the five similar sub-samples a single sub-sample is retained for the final validation of the system while the other four sub-samples are used to train the model. Thus the cross-validation is repeated five times and each subsample is used for validation. After the final validation of the model, a single estimation is calculated which is the average estimation of the five iterations.
Figure 3 illustrate our phishing detection system.
Test of the system on the BD
In this section, we will describe the procedure of our tests and then we will present and interpret the results of these tests.
To extract the necessary characteristics to detect the phishing websites we have developed our own program to ensure the extraction of those features from the URL link and its respective target.
However, for the similarity distance calculation algorithms we have used the Debatty java string similarity library . Furthermore, we have used the Encog library  for all the algorithms of artificial intelligence our system needed.
Table 4 shows a fragment of our database containing successively five examples of phishing websites and four examples of legitimate websites.
Tables 5, 6, 7 and 8 show the respective error rates of each recognition algorithm tested during this work with our database plus each column in these tables represents the number of records used for each test, moreover each line of these tables represents the recognition algorithm used and the distance calculation method used, besides the other recognitions features which have already been introduced. Except for the first line of each of these tables, since we didn’t input any distance calculation method, in order to measure the impact of the similarity concept on the improvement of the recognition rate. Tables 5, 6, 7 and 8 describe all the tests conceived and implemented in this study; we chose to share the results of the Tables 6, 7 and 8 despite the unsatisfactory recognition rate because they make a point about the exceptional impact of the similarity index on the recognition rate.
As shown in Table 5, we can notice that for the SVM, the best method associated with the calculation of similarity is the Hamming distance since the recognition rate is 95.80%.
In contrast, in Table 6, the method based on Bayes networks associated with the similarity calculation method based on the distance from Q-Gram ensures a recognition rate of 65.60%.
While analyzing Table 7, it can be noted that the best associated method for calculating similarity is the normalized distance Levenshtein for the algorithm based on the network Naive Bayes, with a recognition rate of 67.20%.
In Table 8, the algorithm based on Probabilistic Neural networks PNN reached a recognition rate of 51.20% free of association with any similarity calculation method, unlike other tests.
The highest error rates are achieved during the absence of a method of calculating similarity.
Jaro-Winkler distance and Normalized Levenshtein distance are not optimal for this type of data despite that they improve overall the error rate.
Hamming distance, Q-Gram, Levenshtein and longest common subsequence have improved the error rate drastically.
The best recognition rates achieved in this study is 95.80% using SVM provided with the Hamming distance and several other features as input to our system.
Based on the results of this study, we can deduce that the substituted characters’ positions between the phishing websites and their legitimate counterparts are the most important aspect of the similarity index, in order to improve the recognition rate, since the Hamming distance allowed us to reach a higher recognition rate than that obtained while using the longest common subsequence and Q-Gram distance, which does not underline the positions of the substrings and the Q-Grams. We can also infer that the computation of added characters’ editions (insert, delete, replace) in the phishing websites links offered by the Levenshtein distance does not improve the recognition rate.
Affirmed by the results of our tests, we have demonstrated the potential impact of the use of the similarity distance on the detection of phishing websites. Indeed, in three tests performed on four, the introduction of the distance of similarity has significantly improved the recognition rate of our detection system. In the same context, the only case where the similarity did not have a positive impact on the phishing websites recognition rate is the test with Probabilistic Neural networks that records the worst recognition rate among all of our tests. This impact is most obvious in tests performed using the SVM method since the use of the Hamming distance as one of the input characteristics of our system has improved the recognition rate of 21.8%.
We are confident that our phishing website detection system will play a key role in the war against the scourge called phishing because as shown in Table 9 it’s light and more suited to the less “robust” devices such as smartphones and embedded systems since it requires only six features as an input parameter which makes it less “greedy” in terms of CPU and “memory” unlike other proposed systems. Furthermore, all characteristics used by our system are totally extracted from the URL and therefore, we do not need HTML elements of a website or to perform an image processing on the webpage of that latter to decide whether is it a phishing website or not. Besides our system does not need an HTTPS certificate to work; in other words, one bad CA wouldn’t compromise the security of our system.
In this paper, we have presented a phishing websites detection system 100% based on the URL. Our system has been tested on a database of 2000 records formed from legitimate websites and their phishing counterparts; our system has given very satisfactory and encouraging results precisely a 95.80% recognition rate as shown by the results of the tests. The used approach in this system rests on a powerful tool of AI precisely support vector machine, provided with the Hamming distance between the phishing website and its target and five other features extracted from the URL as input. The advantage of this system is its lightness and it can be incorporated into smartphones and tablets.
We see as perspective to this work to test this system constantly on gigantic phishing websites database to improve it if this is mandatory. We will also use the methods of probabilistic prediction on the phishing websites to predict potential target website based solely on the URL of the phishing website.
uniform resource locator
support vector machine
Federal Bureau of Investigation
hypertext markup language
particle swarm optimization support vector machine
domain name system
- TID algorithm:
Target Identification algorithm
hypertext transfer protocol secure
the classic Levenshtein distance
Normalized Levenshtein distance
Jaro Winkler distance
the longest common subsequence
the Q-Gram distance
the Hamming distance
Krebs B (2014) Report on the magnitude of the business money lost to the phishing attack. http://krebsonsecurity.com/2015/08/fbi-1-2b-lost-to-business-email-scams. Accessed 8 May 2017
APWG (2016) The fishing activities trends’ reports by the anti-phishing working group on the first quarter of 2016. https://docs.apwg.org/reports/apwg_trends_report_q1_2016.pdf. Accessed 8 May 2017
Microsoft (2005) Anti-phishing white paper. http://www-pc.uni-regensburg.de/systemsw/ie70/Anti-phishing_White_Paper.doc. Accessed 8 May 2017
Schneider F, Provos N, Moll R, Chew M, Rakowski B (2007) Phishing protection design documentation. https://wiki.mozilla.org/Phishing_Protection:_Design_Documentation. Accessed 8 May 2017
Xiang G, Hong J, Rose CP, Cranor L (2011) CANTINA+: A feature-rich machine learning framework for detecting phishing web sites. ACM Trans Inf Syst Secur 14(2):21. doi:10.1145/2019599.2019606
Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans Dependable Secur Comput 3(4):301–311. doi:10.1109/TDSC.2006.50
Li Y, Chu S, Xiao R (2015) A pharming attack hybrid detection model based on IP addresses and web content. Optik-Int J Light Electron Optics 126(2):234–239. doi:10.1016/j.ijleo.2014.10.001
Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: proceedings of the 32nd IEEE symposium on security & privacy, California, 22–25 May 2011, p. 447–462
Jeeva SC, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. Human-centric Comput Inf Sci 6:10. doi:10.1186/s13673-016-0064-3
Ramesh G, Krishnamurthi I, Kumar KSS (2014) An efficacious method for detecting phishing webpages through target domain identification. Decis Support Syst 61:12–22. doi:10.1016/j.dss.2014.01.002
Huang C-Y, Ma S-P, Chen K-T (2011) Using one-time passwords to prevent password phishing attacks. J Netw Comput Appl 34(4):1292–1301
Yue C, Wang H (2010) BogusBiter: a transparent protection against phishing attacks. ACM Trans Int Technol 10(2):1–31. doi:10.1145/1754393.1754395
Phishtank phishing websites database. http://data.phishtank.com/data/online-valid.csv. Accessed 8 May 2017
The top accessed 500 websites on the web. http://www.alexa.com/topsites. Accessed 8 May 2017
Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
Jaro MA (1995) Probabilistic linkage of large public health data files. Stat Med 14(5–7):491–498
Yujian L, Bo L (2007) A normalized levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095. doi:10.1109/TPAMI.2007.1078
Apostolico A, Guerra C (1987) The longest common subsequence problem revisited. Algorithmica 2(1):315–336. doi:10.1007/BF01840365
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211. doi:10.1016/0304-3975(92)90143-4
Hamming R (1950) Error-detecting and error-correcting codes. Bell Syst Tech J 29(2):147–160
Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag New York, Inc, New York
Debatty T (2015) The tdebatty java string similarity library. https://github.com/tdebatty/java-string-similarity. Accessed 8 May 2017
Heaton J (2015) Encog: library of interchangeable machine learning models for java and C#. J Mach Learn Res 16:1243–1247
MZ carried out the studies, and drafted the manuscript. BO provided full guidance and revised the manuscript to high standards. Both authors read and approved the final manuscript.
I would like to express my profound gratitude to my family for their constant support which was vital in order to finish this paper and the reviewers for their valuable comments.
The authors declare that they have no competing interests.
Availability of data and materials
The dataset used in this study was added as Additional file 1.
This work was supported by the Grant (018UM5S2014) from the Moroccan Center for Scientific and Technical Research.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.