The phishing is a technique used by cyber-criminals to impersonate legitimate websites in order to obtain personal information. This paper presents a novel lightweight phishing detection approach completely based on the URL (uniform resource locator). The mentioned system produces a very satisfying recognition rate which is 95.80%. This system, is an SVM (support vector machine) tested on a 2000 records data-set consisting of 1000 legitimate and 1000 phishing URLs records. In the literature, several works tackled the phishing attack. However those systems are not optimal to smartphones and other embed devices because of their complex computing and their high battery usage. The proposed system uses only six URL features to perform the recognition. The mentioned features are the URL size, the number of hyphens, the number of dots, the number of numeric characters plus a discrete variable that correspond to the presence of an IP address in the URL and finally the similarity index. Proven by the results of this study the similarity index, the feature we introduce for the first time as input to the phishing detection systems improves the overall recognition rate by 21.8%.
The phishing is a technique used by cybercriminals to mimic legitimate websites in order to obtain personal information such as login, password and credit card number which leads to an identity theft. However these criminals typically use phishing to subtract money; for that purpose they target online banking, online payment systems, e-commerce (electronic commerce) websites and m-commerce (mobile commerce) applications.
Despite all efforts made to counter the phishing threat, this attack still manage to cause serious damage, according to the FBI (Federal Bureau of Investigation)  the phishing attack cost $1.2 billion in the span of a year and 2 months between 1st October 2013 and 1st December 2014. Furthermore the colossal financial losses aren’t the only damages caused by the phishing attack since the number of phishing websites detected by the anti-phishing working group  increased by 250% from the last quarter of 2015 to the first quarter of 2016 moreover the number of unique phishing websites detected between January and March 2016 is 289,371 which is more than enough of a reason to make us question whether the current anti-phishing systems are efficient?
Detecting the phishing attack proves to be a challenging task. This attack may take a sophisticated form and fool even the savviest users: such as substituting a few characters of the URL with alike unicode characters. By cons, it can come in sloppy forms, as the use of an IP address instead of the domain name.
In this paper, after a summary of this field key researches, we will detail the characteristics of the URL that our system uses to do the recognition. Otherwise we will describe our recognition system, next in the practical part we will test the proposed system while presenting the results obtained. Last but not least we will enumerate the implications and advantages that our system brings as a solution to the phishing attack.
In the literature the cyber attack called phishing is treated in three different ways.
One of the approaches to counter phishing is the blacklist, that blacklist contains known phishing websites acquired by techniques such as user votes, those blacklists are typically deployed as plug-ins in browsers in order to check each URL entry in the blacklist. Then it prevents the user whenever he attempts a connection to one of these malicious websites which are included in the blacklist. To cite some examples: internet explorer phishing filter , google safe browsing for Firefox . However this approach still facing an issue since it offers no protection against the new phishing websites that are not included in the blacklist. Not to mention the slow update process of the blacklist and the typical short duration (some hours) of the phishing websites.
The other solutions proposed to the phishing issue in the literature are works that do not try to distinguish between the legitimate and phishing websites. Oppositely they opt to consolidate user authentication in order to overcome this problem. Among other things the study by Huang et al.  which proposes to replace the use of a permanent password by a (one-time password) that should be provided to the user by a third party under a message form. The problem with this technique is its total dependence on the third party. In other words when the latter is under attack, the security of the entire system is compromised. Another proposal by Yue et al.  is to send a group of purposely wrong logins and passwords instead of the actual user login and password when connecting to a phishing website, the detection of the attack is done by a plug-in in the user’s browser and this the Achilles heel of this proposal because it relies on a detection system made by a third party.
The URL based phishing detection system
Feature extraction and analysis
In the system we propose, we initially minutely observed and studied a 2000 records database including 1000 phishing website records built from the PhishTank database . In this paper the targeted websites of the phishing attack are vital therefore all the retained 1000 phishing website records must contain their respective target. Moreover the studied database consist also of 1000 legitimate websites which we collected ourselves by combining Alexa’s  500 top global website with 500 websites resulted from queries to google search engine, as for the queries we used to feed our database are (*.bank.*, *.commerce.*, *.trade.*) in consideration of the phishing attack and the websites more likely to be targeted. Our analysis shows that the URL portion of interest is composed of several parts as shown in Fig. 1.
For the remainder of this document the word URL will designate only part 6 of Fig. 1 that is to say, we are only interested in the second-level domain name (4 in Fig. 1) and the first level domain name (5 in Fig. 1) as well as to all sub-domains except the default sub-domain (www). The first segment was removed from the interest zone, because the HTTPS certificates are not part of the scope of this work.
Figure 2 shows just the URL part that interests us.
After a preliminary study on our database, we have discarded the at (@) and the underscore (_) from the URL characteristics used in order to perform the recognition because on the totality of the dataset we found no occurrence of these two URL characteristics; thus one deduces their irrelevance. Our approach is based on artificial intelligence to detect phishing websites for this purpose we use the following URL features.
URL_Size: this is the number of characters in the URL usually phishing websites have a more important size then legitimate websites.
Number_of_Hyphens: this feature counts the number of the character ‘-’ in a URL. Normally legitimate websites rarely have an occurrence of the character ‘-’.
Number_of_Dots: this attribute counts the number of the character ‘.’ (dots) in a URL (for example the number_of_dots = 4 in the following URL sub-domain2.sub-domain3.sub-domain4.mcomerce.com).
Number_of_Numeric_Chars: we count the number of numeric characters in a URL. Since generally there is no occurrence of numeric characters in domain names of legitimate websites.
IP_presence: this feature takes two values: 1 whenever there is an IP address in a URL otherwise 0.
Similarity_index: the mathematically calculated distance measuring the difference between two data (two strings in our case). It is equal to 100% when measured on two identical words. Several variations and algorithms have been developed to measure this similarity among other we cite the most prevalent in this field: Levenshtein  Jaro Winkler  Normalized Levenshtein  longest common subsequence  Q Gram  Hamming .
To calculate these characteristics for each pair of phishing website and its corresponding legitimate website extracted from the database as shown in Table 1. For presentation purposes in Table 1, we coded the distance from the initial letters of their names, whether:
NH for Number_of_Hyphens
ND for Number_of_Dots
NNC for Number_of_Numeric_Chars
IP for IP_presence
L for the classic Levenshtein distance
NL for Normalized Levenshtein distance
JW for Jaro Winkler distance
LCS for the longest common subsequence
QG for the Q-Gram distance
And finally H to the Hamming distance.
As shown in Table 1, the calculation of characteristics used by our system for the recognition of phishing website is done on the entire database (i.e. on 2000 records). A first reading has to infer those phishing websites:
have an average of eight characters more than the legitimate websites,
and may have to thirty-seven against four numeric characters only for legitimate websites.
We thought to study the relationship between these different distances metrics for visibility and comparison so we opted to calculate the correlation that may exist between them.
As shown in the last row of Table 2, the relationship between Hamming distance and Q-Gram distance, Levenshtein and longest common subsequence is manifested by a very strong correlation respectively 97, 98 and 98% when our system is tested on the entire database.
In the same course of action we thought to study the correlation that may exist between the other five URL features used in this work.
Overall as shown in Table 3, we can note the disassociation between the URL features however there is a relevant relationship between the URL_Size and the Number_of_Dots (ND) and the Number_of_Numeric_Chars (NNC) established by 75 and 63% correlation.
Phishing detection system
In this part, we will describe the characteristics used by our recognition system.
Support vector machine as well known as SVM is a supervised classification algorithm that can solve classification problems as well as regression problems. SVM was developed in 1995  based on statistical learning theory by Vapnik–Chervonenkis.
The kernel we used for our system is the Gaussian kernel rather known as RBF (radial basis function).
Let x and x′ two samples of the RBF kernel is defined as the following:
Knowing that X − X′2 is the Euclidean square distance between the two feature vectors and σ is a constant.
Moreover to validate the system we have chosen to use the fivefold cross-validation model.
In this model the database is randomly split into five equal sub-samples, from the five similar sub-samples a single sub-sample is retained for the final validation of the system while the other four sub-samples are used to train the model. Thus the cross-validation is repeated five times and each subsample is used for validation. After the final validation of the model, a single estimation is calculated which is the average estimation of the five iterations.
In this section, we will describe the procedure of our tests and then we will present and interpret the results of these tests.
To extract the necessary characteristics to detect the phishing websites we have developed our own program to ensure the extraction of those features from the URL link and its respective target.
However, for the similarity distance calculation algorithms we have used the Debatty java string similarity library . Furthermore, we have used the Encog library  for all the algorithms of artificial intelligence our system needed.
Table 4 shows a fragment of our database containing successively five examples of phishing websites and four examples of legitimate websites.
Tables 5, 6, 7 and 8 show the respective error rates of each recognition algorithm tested during this work with our database plus each column in these tables represents the number of records used for each test, moreover each line of these tables represents the recognition algorithm used and the distance calculation method used, besides the other recognitions features which have already been introduced. Except for the first line of each of these tables, since we didn’t input any distance calculation method, in order to measure the impact of the similarity concept on the improvement of the recognition rate. Tables 5, 6, 7 and 8 describe all the tests conceived and implemented in this study; we chose to share the results of the Tables 6, 7 and 8 despite the unsatisfactory recognition rate because they make a point about the exceptional impact of the similarity index on the recognition rate.
As shown in Table 5, we can notice that for the SVM, the best method associated with the calculation of similarity is the Hamming distance since the recognition rate is 95.80%.
In contrast, in Table 6, the method based on Bayes networks associated with the similarity calculation method based on the distance from Q-Gram ensures a recognition rate of 65.60%.
While analyzing Table 7, it can be noted that the best associated method for calculating similarity is the normalized distance Levenshtein for the algorithm based on the network Naive Bayes, with a recognition rate of 67.20%.
In Table 8, the algorithm based on Probabilistic Neural networks PNN reached a recognition rate of 51.20% free of association with any similarity calculation method, unlike other tests.
As shown in Fig. 4 and to give more visibility to Table 5 we can notice the influence of similarity on the error rate of the recognition system, we can, therefore, note that:
The highest error rates are achieved during the absence of a method of calculating similarity.
Jaro-Winkler distance and Normalized Levenshtein distance are not optimal for this type of data despite that they improve overall the error rate.
Hamming distance, Q-Gram, Levenshtein and longest common subsequence have improved the error rate drastically.
The best recognition rates achieved in this study is 95.80% using SVM provided with the Hamming distance and several other features as input to our system.
Based on the results of this study, we can deduce that the substituted characters’ positions between the phishing websites and their legitimate counterparts are the most important aspect of the similarity index, in order to improve the recognition rate, since the Hamming distance allowed us to reach a higher recognition rate than that obtained while using the longest common subsequence and Q-Gram distance, which does not underline the positions of the substrings and the Q-Grams. We can also infer that the computation of added characters’ editions (insert, delete, replace) in the phishing websites links offered by the Levenshtein distance does not improve the recognition rate.
Affirmed by the results of our tests, we have demonstrated the potential impact of the use of the similarity distance on the detection of phishing websites. Indeed, in three tests performed on four, the introduction of the distance of similarity has significantly improved the recognition rate of our detection system. In the same context, the only case where the similarity did not have a positive impact on the phishing websites recognition rate is the test with Probabilistic Neural networks that records the worst recognition rate among all of our tests. This impact is most obvious in tests performed using the SVM method since the use of the Hamming distance as one of the input characteristics of our system has improved the recognition rate of 21.8%.
We are confident that our phishing website detection system will play a key role in the war against the scourge called phishing because as shown in Table 9 it’s light and more suited to the less “robust” devices such as smartphones and embedded systems since it requires only six features as an input parameter which makes it less “greedy” in terms of CPU and “memory” unlike other proposed systems. Furthermore, all characteristics used by our system are totally extracted from the URL and therefore, we do not need HTML elements of a website or to perform an image processing on the webpage of that latter to decide whether is it a phishing website or not. Besides our system does not need an HTTPS certificate to work; in other words, one bad CA wouldn’t compromise the security of our system.
In this paper, we have presented a phishing websites detection system 100% based on the URL. Our system has been tested on a database of 2000 records formed from legitimate websites and their phishing counterparts; our system has given very satisfactory and encouraging results precisely a 95.80% recognition rate as shown by the results of the tests. The used approach in this system rests on a powerful tool of AI precisely support vector machine, provided with the Hamming distance between the phishing website and its target and five other features extracted from the URL as input. The advantage of this system is its lightness and it can be incorporated into smartphones and tablets.
We see as perspective to this work to test this system constantly on gigantic phishing websites database to improve it if this is mandatory. We will also use the methods of probabilistic prediction on the phishing websites to predict potential target website based solely on the URL of the phishing website.
uniform resource locator
support vector machine
Federal Bureau of Investigation
hypertext markup language
particle swarm optimization support vector machine
Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Trans Dependable Secur Comput 3(4):301–311. doi:10.1109/TDSC.2006.50
Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: proceedings of the 32nd IEEE symposium on security & privacy, California, 22–25 May 2011, p. 447–462
Jeeva SC, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. Human-centric Comput Inf Sci 6:10. doi:10.1186/s13673-016-0064-3
Additional file 1. In the supplemental material section the entire data-set used in this study testes.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.