Although not popular in feature selection literature, perhaps the simplest way to understand discriminatory nature of feature in a training set with two classes can be by using a search using naive bayes classifier. A low probability of error of individual features as obtained using baysian classifier would indicate good discriminatory ability and asserts the usefulness of the feature.
A standard approach in feature selection literature is to directly apply training and selection criteria on the feature values. However, when natural variability in the data is high and number of training samples are less, even minor changes in feature values would introduce errors in the bayes probability calculations. Classification methods such as SVM on the other hand try to get around this problem by normalising the feature values and by parametric training of the classifiers against several possible changes in features values. In classifier studies, this essentially shifts the focus from feature values to distance values. Instead of directly optimising the classifier parameters based on feature values, the distance functions itself is trained and optimised.
Proposed method
In this work, we attempt to develop a technique of feature selection by using the new concept of distance probability distributions. This is a very different concept to that of filter methods that applies various criterion such as inter-feature distance, bayes error or correlation measures to determine set of features having low redundancy. Instead of complicating the feature selection process by different search and filter schemes to remove redundant features and to maintain relevant features, we focus our work in using all features that are most discriminative and useful for a classifier. Further, rather than looking at feature selection as a problem of finding inter-feature dependencies for reducing number of features, we treat each feature individually and arrive at features that would have the ability to contribute to classifiers performance improvement.
Suppose there are M classes in a training set having patterns with a set of J features, with
ω
ij
as class label for feature j, where i∈{1.M} and j∈{1.J}. And let
x
jk
be a feature in the k th training pattern that can be used to calculate the inter-class and intra-class distance probability distributions. The intra-class distances of the j th feature in a training set is equal to the distance , where k∈{1.K}, with within a class in training set with K samples. The inter-class distances of a feature
x
jk
in a training set belonging to a class
ω
ij
is equal to the distance , where is a feature at j belonging to a sample in another class other than that of
x
jk
. We can represent the set of classes that does not belong to the class
ω
ij
as . Then the intra-class distance probability distribution of feature j in class
ω
ij
is and the corresponding inter-class distance probability distribution is . The area of overlap of these distributions can be seen as the probability of error of feature at j for a class label at i and represents the discriminatory ability of feature. Since, in practice we are dealing with samples in discrete form the probability density can be represented in discrete from with m bins, and the area of overlap P(j|i) can be represented as:
(1)
The relative area of overlap of feature among all the classes can be then found as:
(2)
The minimum area of overlap for feature across different classes can be then calculated as a measure to establish the discriminatory ability of feature:
(3)
Taking the minimum value of across different classes ensures that features that could discriminate well for any one of the class among many and such features can be considered as useful for classification. The features are ranked in descending order based on the value of , a value of 0 would force the feature to take a low rank while a value of 1 would force the feature to take top rank. Let R represent the set of , arranged in the order of their ranks, each rank representing feature or group of features. R set can be used to form a rank based probability distribution by normalising the .
It is well known that almost every other ranked distributions of empirical nature originating from realistic back end data follow a power law distribution. The top ranked features in a ranked distribution often retain most of the information. This effect is observed in different problems and applications, and has formed the basis of Winner-take-all and Pareto principles.
The ranked distribution is formed with represent the normalised value of for the feature at j having a rank r. The cumulative ranked distribution is obtained as:
(4)
The top ranked values of
c
r
can used to select the most discriminative set of features. Applying the winner-take-all principle, and in the lines of 20−80 concept of rank-size distributions, it is logical to assume that the top ranked features would have maximum amount of discriminative information. The subset of features X having a size L∈[1,J] from the ranked features can be selected based on a selection threshold θ.
In other words, the features
x
j
corresponding to the ranks that fall below the cumulative area threshold θ is selected to form X with size L. The selection threshold θ for selecting the top ranked features is done using the proposed Def 1.
Definition 1
The selection threshold θ is equal to the standard deviation σ of the distribution of , where .
If each feature in X is uncorrelated and independent, the features within X will be very few or no be redundant features. The selection of X based on the discriminatory ability is sufficient to ensure good classification performance. However, in feature selection problem, there is a chance that the subset of discriminant feature would have very similar features, and such features become redundant in improving classification performance. Identifying the independence of discriminant features would ensure the detection of least redundant features. For two features, {
x
r
,x r + 1}, ranked in order of and values, let p(
x
r
) and p(x r + 1) be the probability density functions, and p(
x
r
,x r + 1) be the joint probability density function, where r∈[1,L] is the rank of a feature in X corresponds to an index j in the original feature space. Then the features are independent if it can be established that p(
x
r
,x r + 1)=p(
x
r
)p(x r + 1). This idea of independence testing is utilised in finding an independence score of a feature. The area score between the probability densities
p
(
x
r
,x r + 1) and
p
(
x
r
)
p
(x r + 1) in discrete domain is calculated as:
(6)
The independence score
I
r
of feature
x
r
with respect to remaining L−1 features in X is determined as:
(7)
A value of
I
r
=1 would indicate that
x
r
is an independent feature in X (or
x
j
in the feature set with j th feature in the original feature space corresponding to the r th rank feature in X), while a value of
I
r
would indicate that
x
r
is redundant and should be removed. The independence score
I
r
corresponding to the feature at j in the sample along with the discriminatory score can be used to select the most independent set of discriminant features.
(8)
where the value of ε=0.01 is a small number, and
z
s
is the set of most relevant discriminative independent features
x
j
, with s≤J.
These subset of top ranked features are considered as useful for classification. However, parameters and nature of decision boundary imposed by a specific classifier need to be considered before these features can be used for classification. Consider using a nearest neighbour classifier, then the relative importance of feature
z
s
∈X can be rated based on the recognition performance of using individual feature
z
s
alone for classification. Assuming the independence of features, using a leave one out cross validation, the classification accuracy of s th feature and j th sample in training set with size J, and l∈J is found by the identification of the class as:
(9)
The selected features
z
s
are ranked based on the total number of correct class identification w∗in descending order. The top ranked features represent the most discriminant features while the lower ranked ones are relatively of lower in class discriminatory ability when using a nearest neighbour classifier. Such a ranking of the features for a given classifier identifies itself as the best responding features for that classifier.