Ranked selection of nearest discriminating features
© James and Dimitrijev; licensee Springer. 2012
Received: 20 August 2011
Accepted: 28 May 2012
Published: 24 June 2012
Feature selection techniques use a search-criteria driven approach for ranked feature subset selection. Often, selecting an optimal subset of ranked features using the existing methods is intractable for high dimensional gene data classification problems.
In this paper, an approach based on the individual ability of the features to discriminate between different classes is proposed. The area of overlap measure between feature to feature inter-class and intra-class distance distributions is used to measure the discriminatory ability of each feature. Features with area of overlap below a specified threshold is selected to form the subset.
The reported method achieves higher classification accuracies with fewer numbers of features for high-dimensional micro-array gene classification problems. Experiments done on CLL-SUB-111, SMK-CAN-187, GLI-85, GLA-BRA-180 and TOX-171 databases resulted in an accuracy of 74.9±2.6, 71.2±1.7, 88.3±2.9, 68.4±5.1, and 69.6±4.4, with the corresponding selected number of features being 1, 1, 3, 37, and 89 respectively.
The area of overlap between the inter-class and intra-class distances is demonstrated as a useful technique for selection of most discriminative ranked features. Improved classification accuracy is obtained by relevant selection of most discriminative features using the proposed method.
Many of the contemporary databases used in data classification research [1–10] uses considerably large number of data points to represent an object sample. High dimensional feature vectors that result from these samples often contain intra-class natural variability reflected as noise and irrelevant information [11, 12]. The noise in feature vectors occurs due to inaccurate feature measurements, whereas irrelevancy of a feature depends on the natural variability and the redundancy within the feature vector. Further, relevance of a feature is application dependent. For example, consider a hypothetical image consisting of image regions that correspond to faces and some other objects. When using this image in a face recognition application, the relevant pixels in the image are in the face regions while the pixels in the remaining regions are irrelevant. In addition, face regions themselves can have irrelevant information due to intra-class variability such as occlusions, facial expressions, illumination changes, and pose changes. Natural variability that occurs in high dimensional data has significant impact on lowering the performance of all pattern recognition methods. To improve the recognition performance of classification techniques methods, in the recent past, most of the effort has been to compensate or remove intra-class natural variability from the data samples through various feature processing methods.
Dimensionality reduction [13–15] and feature selection [6–9] are two types of feature processing techniques that are used to automatically improve the quality of data by removing irrelevant information. Dimensionality reduction methods are popular because they achieve the purpose of reducing the number of features and noise in a feature vector with the mathematical convenience of feature transformations and projections. However, the assumption of correlations between the features in the data is a core aspect of dimensionality reduction methods that can result in inaccurate feature descriptions. Further, irrelevant information from the original data is not always possible to remove in a dimensionality reduction approach. Improving the quality of resulting features using linear and more recently non-linear dimensionality reduction methods has consistently been a field of intense research and debate in the recent past . An alternative to dimensionality reduction approach, instead of trying to improve overall feature quality, feature selection tries to remove irrelevant features from the high dimensional feature vector thereby improving the performance of classification systems. Feature selection have been an intense field of study in the recent years, gaining importance in parallel with the dimensionality reduction methods. Feature selection provides an advantage over dimensionality reduction methods because of its ability to distinguish and select the best available features in a data set [6–10, 16]. This means that feature selection methods can be applied to both the original feature vectors and to the feature vectors that result from the application of dimensionality reduction methods. From this point of view, feature selection can be considered as an essential component required for developing high performance pattern classification systems that use high dimensional data [1–3, 17]. Since higher dimensional feature vectors contain several irrelevant features that reduce the performance of pattern recognition methods, feature selection by itself can be used in most of the modern data classification methods to combat the issues resulting from the curse of high dimensionality [18, 19].
Feature selection problems revolve around the correct selection of feature subset. In a search-criteria approach to feature selection, feature selection is reduced to a search problem that detects an optimal feature subset based on the selected criteria. Exhaustive search ensures optimal solution, however, with increase in dimensionality such a search is computationally prohibitive. In the present literature, there exists no other distinct way to optimally select the features without reducing classification performance.
The existing research in feature selection has been focused on excluding features that are determined as most redundant using various search strategies and criteria assessment techniques[20–25]. In this paper, we propose a new method for feature selection based solely on individual feature discriminatory ability as an alternative to the existing search and criteria driven feature selection methods. The discriminatory ability of each feature is measured by the area of overlap between inter-class and intra-class distances that are obtained from feature to feature comparisons. Experimental results of a classification task based on microarray and image databases validate the effectiveness and accuracy of features obtained by our feature selection method.
Feature selection methods can be classified in three broad categories: filter model [26, 27], wrapper model [28, 29] and hybrid and embedded model [30, 31]. In order to evaluate and select features, filter models exclusively use characteristics about the data, warper models uses mining algorithms, and hybrid models combine the use of characteristics about the data with data-mining algorithms. In general, these feature selection methods consists of three steps: (1) feature subset generation, (2) evaluation, and (3) stopping criteria . Subset generation process is used to arrive at a starting set of features using different types of forward, backward or bidirectional search methods . Some of the most common techniques employed are complete search such as branch and bound  and beam search , sequential search such as sequential forward selection, sequential backward elimination, bidirectional selection , and random search such as random-start hill-climbing and simulated annealing . The generated subset is evaluated for goodness using either an independent or a dependent criterion. Independent criterion is generally used in filter model, the popular ones are distance, dependency and consistency measures [35–37]. The dependent criteria is generally used in wrapper model requiring tuning of data-mining algorithms. The wrapper models perform better, however are computationally expensive and less robust to parameter changes in data-mining algorithms [38–41]. The goodness of the subsets using a selection criteria is assessed against stopping criteria such as minimum number of features, optimal number of iterations and lower classification error rates.
It can be noted that in conventional feature selection methods, features or subset of features are selected based on the rank as obtained by evaluating features against a selection criterion such that redundancy of features in the training set is minimized. The best performing methods for classification that rely on data-mining strategies include feature relevance calculations to select features holistically [20–22]. However, data-mining based solutions result in features that tend to be sensitive to minor changes in training data. Further, an increase in dimensionality makes the data-mining algorithms computationally intensive and often require problem specific optimization techniques. Contrary to data-mining based solutions, criteria driven methods based on filter models are computationally less complex and are more robust to minor changes in training data [23–25]. In such methods, the accuracy of initial selection of subsets using exhaustive forward or backward search of the features  would significantly impact the accuracy of features obtained with a given feature selection criterion. In addition, as pointed out in  optimal selection of subsets is intractable and in some problems are NP-hard . Further, variations in the nature of data from one database to another make the optimal selection of an objective function difficult and a high classification accuracy using selected features from such methods are not always guaranteed. Because of such deficiencies, hybrids of filter and wrapper models also reflect these problems at various levels of feature selection.
The determination of inter-feature dependency as described by filter models, and wrapper models lay the foundations of present day feature selection methods. These models arrive at features that are often tuned to suite a classifier using several machine learning strategies at selection or criteria assessment stage. Some of the recent approaches that attempt to improve the performance of the conventional feature selection methods use the ideas of neighborhood margins [44–46], and manifold regularization using SVMs . However, similar to wrapper methods that uses specific mining techniques, these recent methods are computationally complex and require additional optimization methods to speedup calculations. In addition, optimal performance of the selected features on classifiers are highly sensitive to minor changes in training data and tuning parameters. Due these reasons, the practical applicability and robustness of such methods on large sample high dimensional datasets are questionable.
Conventional feature selection methods apply multiple level processing on a given feature vector to find a subset of useful features for classification using several machine learning techniques and search strategies. The presented work on the contrary draws specific attention to select most discriminating features from a single step process of discriminating subset selection. As distinct from the general idea of optimizing feature subsets for classification oriented filter and warper models, here we focus on developing an approach to determine relevant features from a training set solely by calculating their individual inter-class discriminatory ability.
Discriminant feature selection based on nearest features
Although not popular in feature selection literature, perhaps the simplest way to understand discriminatory nature of feature in a training set with two classes can be by using a search using naive bayes classifier. A low probability of error of individual features as obtained using baysian classifier would indicate good discriminatory ability and asserts the usefulness of the feature.
A standard approach in feature selection literature is to directly apply training and selection criteria on the feature values. However, when natural variability in the data is high and number of training samples are less, even minor changes in feature values would introduce errors in the bayes probability calculations. Classification methods such as SVM on the other hand try to get around this problem by normalising the feature values and by parametric training of the classifiers against several possible changes in features values. In classifier studies, this essentially shifts the focus from feature values to distance values. Instead of directly optimising the classifier parameters based on feature values, the distance functions itself is trained and optimised.
In this work, we attempt to develop a technique of feature selection by using the new concept of distance probability distributions. This is a very different concept to that of filter methods that applies various criterion such as inter-feature distance, bayes error or correlation measures to determine set of features having low redundancy. Instead of complicating the feature selection process by different search and filter schemes to remove redundant features and to maintain relevant features, we focus our work in using all features that are most discriminative and useful for a classifier. Further, rather than looking at feature selection as a problem of finding inter-feature dependencies for reducing number of features, we treat each feature individually and arrive at features that would have the ability to contribute to classifiers performance improvement.
Taking the minimum value of across different classes ensures that features that could discriminate well for any one of the class among many and such features can be considered as useful for classification. The features are ranked in descending order based on the value of , a value of 0 would force the feature to take a low rank while a value of 1 would force the feature to take top rank. Let R represent the set of , arranged in the order of their ranks, each rank representing feature or group of features. R set can be used to form a rank based probability distribution by normalising the .
It is well known that almost every other ranked distributions of empirical nature originating from realistic back end data follow a power law distribution. The top ranked features in a ranked distribution often retain most of the information. This effect is observed in different problems and applications, and has formed the basis of Winner-take-all and Pareto principles.
In other words, the features x j corresponding to the ranks that fall below the cumulative area threshold θ is selected to form X with size L. The selection threshold θ for selecting the top ranked features is done using the proposed Def 1.
The selection threshold θ is equal to the standard deviation σ of the distribution of , where .
where the value of ε=0.01 is a small number, and z s is the set of most relevant discriminative independent features x j , with s≤J.
The selected features z s are ranked based on the total number of correct class identification w∗in descending order. The top ranked features represent the most discriminant features while the lower ranked ones are relatively of lower in class discriminatory ability when using a nearest neighbour classifier. Such a ranking of the features for a given classifier identifies itself as the best responding features for that classifier.
Results and discussion
The role of feature selection methods in a high dimensional pattern classification problem is to select the minimum number of features that maximize the recognition accuracy. In this section, we demonstrate how the newly proposed selection method performs this task on standard databases used for bench marking feature selection methods.
Advancements in measurement techniques and computing methodologies have resulted in the use of microarray data in application to genetics, medicine, and patient diagnosis. The high dimensional feature vectors in the microarray data often contain large number of features that are not useful in the process of classification. The main role of our feature selection method is to identify the gene expressions from a microarray data that are most useful for classification.
Five benchmark microarray based gene expression databases are used in this study: GLI-85 (also known as GSE4412), GLA-BRA-180 (also known as GDS1962), CLL-SUB-111 (also known as GSE2466), TOX-171 (also known as GDS2261), and SMK-CAN-187 (also known as GSE4115).
Selection threshold and classification
Feature ranking and classification
When the relative area of overlap for all the features is small, applying the threshold based selection results in the use of almost all available features for classification. The use of complete set of features in the process of automatic classification is often not a feasible option due to the issues of curse of dimensionality. In such situations, ranking the features and selecting a group of top ranked features can be used for both the dimensionality reduction and selection of the best available features for classification. The simplest and common approach for selection of the top ranks is by individual searches that evaluate each feature separately. Leave one out cross-validation is performed using the training set of individual features that are selected based on a specified value of selection threshold. The selected features are ranked based on the recognition error by evaluating it individually with a nearest neighbor classifier.
The highest recognition accuracies on gene expression databases when selecting features within the top 100 ranked features obtained by three different classifiers
Total number of features
Selected number of features
Selected number of features
Selected number of features
Comparison of maximum recognition accuracies on gene-expression databases using up to 100 top ranked features obtained by different feature-selection methods and a nearest neighbor classifier
Information gain 
Total number of features
Selected number of features
Selected number of features
Selected number of features
In this paper, we presented a feature selection method for gene data classification that is based on the assessment of discriminatory ability of individual features within a class. The area of overlap between inter-class and intra-class distance distributions of individual features is identified as a useful measure for feature selection. A common framework to select the most important set of features is provided by applying a selection threshold. The ability of the proposed method to select the most discriminatory features resulted in improved classification performance with a smaller number of features, although the number of features that are required for achieving high recognition accuracy varies from one database to another. The presented feature selection technique can be used in the automatic identification of cancer causing genes and would help facilitate early detection of specific diseases or conditions.
We would like to thank the anonymous reviewers for their constructive comments which has helped to improve the overall quality of the reported work.
- Guyon I, Elisseeff A: An introduction to variable and feature selection. J Machine Learning Res 2003, 3: 1157–1182.Google Scholar
- Saeys Y, Inza I, Larraage P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007,23(19):2507–2517. 10.1093/bioinformatics/btm344View ArticleGoogle Scholar
- Inza I, Larranaga P, Blanco R, Cerrolaza A: Filter versus warpper gene selection approaches in dna microarray domains. Artif Intelligence Med 2004, 31: 91–103. 10.1016/j.artmed.2004.01.007View ArticleGoogle Scholar
- Ma S, Huang J: Penalized feature selection and classification in bioinformatics. Brief Bioinfrom 2008,9(5):392–403. 10.1093/bib/bbn027MathSciNetView ArticleGoogle Scholar
- James AP, Maan A: Improving feature selection algorithms using normalised feature histograms. IET Electron lett 2011,47(8):490–491. 10.1049/el.2010.3672View ArticleGoogle Scholar
- Liu H, Motoda H: Feature selection for knowledge discovery and data mining. 1998. Boston, Kluwer Academic Publishers Boston, Kluwer Academic PublishersView ArticleGoogle Scholar
- Donoho D: Formost large underdetermined systems of linear equations, the minimal l1-norm solution is also the sparest solution. Comm Pure Appl Math 2006, 59: 907–934. 10.1002/cpa.20131MathSciNetView ArticleGoogle Scholar
- Fan J, Samworth R, Wu Y: Ultrahigh dimensional feature selection: Beyond the linear model. J Machine Learning Res 2009, 10: 2013–2038.MathSciNetGoogle Scholar
- Glocer K, Eads D, Theiler J: Online feature selection for pixel classification. 2005. ACM New York, USA, pp 249–256Google Scholar
- Zhao Z, Liu H: Multi-scource feature selection via geometry dependent covariance analysis. J Machine Learning Res, Workshop Conference Proc Volume 4: New Challenges Feature Sel Data Min Knowledge Discovery 2008, 4: 36–47.Google Scholar
- James AP, Dimitrijev S: Nearest Neighbor Classifier Based on Nearest Feature Decisions. Comput J 2012. doi:10.1093/comjnl/bxs001Google Scholar
- James A, Dimitrijev S: Inter-image outliers and their application to image classification. Pattern Recognit 2010,43(12):4101–4112. 10.1016/j.patcog.2010.07.005View ArticleGoogle Scholar
- Lee JA, Verleysen M: Nonlinear Dimensionality Reduction. 2007. New York, Springer New York, SpringerView ArticleGoogle Scholar
- Thangavel K, Pethalakshmi A: Dimensionality reduction based on rough set theory: A review. Appl Soft Comput 2009,9(1):1–12. 10.1016/j.asoc.2008.05.006View ArticleGoogle Scholar
- Sanguinetti G: Dimensionality Reduction of Clustered Data Sets. Pattern Anal Machine Intelligence, IEEE Trans 2007,30(3):535–540.View ArticleGoogle Scholar
- Zhao Z, Wang J, Sharma S, Agarwal N, Liu H, Chang Y: An intergrative approach to identifying biologically relevant genes. 2010, pp 838–849.Google Scholar
- Liu H, Yu L: Toward intergrating feature selection algorithms for classification and clustering. IEEE Transactions Knowledge Data Eng 2005,17(3):1–12.View ArticleGoogle Scholar
- Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expressions. Bioinformatics 2004,20(15):2429–2437. 10.1093/bioinformatics/bth267View ArticleGoogle Scholar
- Liu H, Li J, Wong L: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genone Inform 2002, 13: 51–60.Google Scholar
- Sikonja MR, Kononenko I: Theoritical and emperical analysis of Relief and Relief. Machine Learning 2003, 53: 23–69. 10.1023/A:1025667309714View ArticleGoogle Scholar
- Weston J, Elisseff A, Schoelkopf B, Tipping M: Use of the zero norm with linear models and kernel methods. J Machine Learning Res 2003, 3: 1439–1461.MathSciNetGoogle Scholar
- Song L, Smola A, Gretton A, Brogwardt K, Bedo J: Supervised feature selection via dependence estimation. 2007. ACM New York, USA, pp 823–830Google Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat 2004, 32: 407–449. 10.1214/009053604000000067MathSciNetView ArticleGoogle Scholar
- Zhu J, Rosset S, Hastie T, Tibshirani R: 1-norm support vector machines. 2003. NIPS foundation, La Jolla, CA p 8Google Scholar
- Cawley GC, Talbot NLC, Girolami M: Sparse multinomial logistic regression via bayesian L1 regularisation. 2007. NIPS foundation, La Jolla, CA, pp. 209-216Google Scholar
- Hall MA: Correlation based feature selection for discrete and numeric class machine learning. 2000. San Fransisco, Morgan Kaufmann, 17:359–366Google Scholar
- Liu H, Setiono R: A probabilistic approach to feature selection: a filter solution. 1996. San Fransisco, Morgan Kaufmann, pp 319–327Google Scholar
- Kohavi R, John G: Wrappers for Feature Subset Selection. Artif Intelligence 1997,97(1–2):273–324. 10.1016/S0004-3702(97)00043-XView ArticleGoogle Scholar
- Caruana R, Freitag D: Greedy attribute selection. 1994. San Fransisco, Morgan Kaufmann, pp 28–36Google Scholar
- Das S: Filters, warppers and boosting: based hybrid for feature selection. 2001. San Fransisco, Morgan Kaufmann, pp 74–81Google Scholar
- Ng AY: On feature selection: learning with exponentially many irrelevant features as training examples. 1998. San Fransisco, Morgan Kaufmann, pp 404–412Google Scholar
- Dash M, Liu H: Feature selection for classification. Intell Data Anal 1997,1(3):131–156.View ArticleGoogle Scholar
- Narendra PM, Fukunaga K: Branch and bound algorithm for feature subset selection. IEEE Trans Comput 1977,26(9):917–922.View ArticleGoogle Scholar
- Doak J: An evaluation of feature selection methods and their application to computer security. 1992. Tech. rep., University of California, DavisGoogle Scholar
- Liu H, Motoda H: Feature selection for knowledge discovery and data mining. 1998. Boston, Kluwer AcademicView ArticleGoogle Scholar
- Almuallim H, Dietterich TG: Learning boolean concepts in the presence of many irrelavent features. Artif Intelligence 1994,69(1–2):278–305.MathSciNetView ArticleGoogle Scholar
- Ben-Bassat M: Pattern recognition and reduction of dimensionality. 1982. North holand, pp 773-791View ArticleGoogle Scholar
- Blum AL, Langley P: Selection of relevant features and examples in machine learning. Artif Intelligence 1997, 97: 245–271. 10.1016/S0004-3702(97)00063-5MathSciNetView ArticleGoogle Scholar
- Dash M, Liu H: Feature selection for clustering. 2000, pp 110–121.Google Scholar
- Di JG, Brodley CE: Feature subset selection and order itdentification for unsupervised learning. 2000. San Fransisco, Morgan Kaufmann, pp 247–254Google Scholar
- Kim Y, Street W, Menczer F: Feature selection for unsupervised learning via evolutionary search. 2000. ACM New York, USA, pp 365–369Google Scholar
- Jain A, Zongker D: Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 1997, 19: 153–158. 10.1109/34.574797View ArticleGoogle Scholar
- Blum A, Rivest R: Training a 3-Node Neural Networks in NP-Complete. Neural Networks 1992, 5: 117–127. 10.1016/S0893-6080(05)80010-3View ArticleGoogle Scholar
- John GH, Kohavi R, Pflegler K: Irrelavent feature and the subset selection problem. 1994. San Fransisco, Morgan Kaufmann, pp 121–129Google Scholar
- Abe S, Thawonmas R, Kobayashi Y: Feature selection by analysing class regions approximated by ellipsoids. IEEE Trans Syst, Man Cybernetics– Part C: App Rev 1998, 28: 282–287. 10.1109/5326.669573View ArticleGoogle Scholar
- Neumann J, Schnorr C, Steidl G: Combined SVM-based feature selection and classification. Machine Learning 2005, 61: 129–150. 10.1007/s10994-005-1505-9View ArticleGoogle Scholar
- Xu Z, King I, Lyu MR-T, Jin R: Discriminative semisupervised feature selection via manifold regularization. IEEE Trans. on Neural Networks 2010,21(7):1033–1047.View ArticleGoogle Scholar
- Freije WA, Castro-Vargas FE, Fang Z, Horvath S, Cloughesy T, Liau LM, Mischel PS, Nelson SF: Gene expression profiling of gliomas strongly predicts survival. Cancer Res 2004,64(18):6503–6510. 10.1158/0008-5472.CAN-04-0452View ArticleGoogle Scholar
- Sun L, Hui AM, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, James AP: Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell 2006,9(4):287–300. 10.1016/j.ccr.2006.03.003View ArticleGoogle Scholar
- Haslinger C, Schweifer N, Stilgenbauer S, Dhner H, Lichter P, Kraut N, Stratowa C, Abseher R: Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. J Clin Oncol 2004,22(19):3937–3949. 10.1200/JCO.2004.12.133View ArticleGoogle Scholar
- Piloto S, Schilling T: Ovo1 links Wnt signaling with N-cadherin localization during neural crest migration. Development 2010,137(12):1981–1990. 10.1242/dev.048439View ArticleGoogle Scholar
- Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, Gilman S, Dumas YM, Calner P, Sebastiani P, Sridhar S, Beamis J, Lamb C, Anderson T, Gerry N, Keane J, Lenburg ME, Brody JS: Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 2007,13(3):361–366. 10.1038/nm1556View ArticleGoogle Scholar
- Peng H, Long F, Ding C: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Machine Intell 2005,27(8):1226–1238.View ArticleGoogle Scholar
- Cover TM, Thomas JA: Elem Inf Theory. 1991. New York, WileyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.