A multilevel features selection framework for skin lesion classification

Melanoma is considered to be one of the deadliest skin cancer types, whose occurring frequency elevated in the last few years; its earlier diagnosis, however, significantly increases the chances of patients’ survival. In the quest for the same, a few computer based methods, capable of diagnosing the skin lesion at initial stages, have been recently proposed. Despite some success, however, margin exists, due to which the machine learning community still considers this an outstanding research challenge. In this work, we come up with a novel framework for skin lesion classification, which integrates deep features information to generate most discriminant feature vector, with an advantage of preserving the original feature space. We utilize recent deep models for feature extraction, and by taking advantage of transfer learning. Initially, the dermoscopic images are segmented, and the lesion region is extracted, which is later subjected to retrain the selected deep models to generate fused feature vectors. In the second phase, a framework for most discriminant feature selection and dimensionality reduction is proposed, entropy-controlled neighborhood component analysis (ECNCA). This hierarchical framework optimizes fused features by selecting the principle components and extricating the redundant and irrelevant data. The effectiveness of our design is validated on four benchmark dermoscopic datasets; PH2, ISIC MSK, ISIC UDA, and ISBI-2017. To authenticate the proposed method, a fair comparison with the existing techniques is also provided. The simulation results clearly show that the proposed design is accurate enough to categorize the skin lesion with 98.8%, 99.2% and 97.1% and 95.9% accuracy with the selected classifiers on all four datasets, and by utilizing less than 3% features.


Introduction
Melanoma belongs to the category of inoperable type of skin cancers, and its occurrence rate has increased tremendously over the past three decades [1]. According to statistics provided by the World Health Organization (WHO), almost 132,000 new cases of melanoma are reported each year worldwide. It has been reported [2] that diagnosis of melanoma, in its early stages, significantly increases chances of the patient's survival. Dermatoscopy, also knows as dermoscopy is a non-invasive clinical procedure used for melanoma detection, in which physicians apply gel on the affected skin, prior to examining it with a dermoscope. It allows recognition of sub-surface structures of the infected skin that are invisible to naked eye. With this clinical procedure, the skin lesion is amplified up to 100 times, thereby easing the examination [3].
For the diagnosis of melanoma, dermatologists mostly rely on ABCD rule [4], sevenpoint checklist [5], and Menzie's method [6]. These aforementioned methods have been formally approved at the 2000 Consensus Net Meeting on Dermoscopy (CNMD) [7], and are widely exploited by the physicians for diagnostics. Even though, these methods of manual inspection have shown improved performance, due to a number of constraints, including a large number of patients, human error and infrastructure etc., they have not proven feasible. Additionally, melanoma, at its initial stages, exhibits a similar type of features like benign lesions, which makes it difficult to recognize; Fig. 1 presents two such examples. Furthermore, physician analysis may also be quite subjective, since it clearly depends on their clinical experience and human vision as well-making the diagnosis procedure quite challenging.
To handle such constraints, there still exists a requirement for an automated system that has a capacity to differentiate melanoma from benign at its very initial stages. Computer-aided diagnosis (CAD) system maybe useful for the physicians to use technological developments in the field of dermoscopy, and it may also provide a second opinion. The CAD systems adopt various machine learning techniques, for example, extracting various features (color, shape, and texture) from each dermoscopic image, followed by applying a state-of-the-art classifier [8,9]. These classification approaches mostly rely on 1 We exploit behavior of the selected layers of deep architectures, including DenseNet 201, Inception-ResNet-v2, and Inception-V3, on the performance of classifiers. 2 We propose to fine-tune the existing pre-trained models with smaller learning rate and keep weights of the initial layers frozen to avoid distortion of the complete model. We exploit feature fusion technique, which takes advantage of all the three selected architectures to generate a denser feature space. 3 We propose a hierarchical architecture for feature selection and dimensionality reduction, which in the initial step relies upon entropy for feature selection, followed by dimensionality reduction using neighborhood component analysis (NCA).
The rest of the article is organized as follows. In the following section, "Literature review" section, we present a detailed overview of the existing literature in this domain. "Mathematical model" section presents the mathematical model, whereas, materials and methods are discussed in "Materials and methods" section. The proposed framework is detailed in "Proposed framework" section, and "Results and discussion" section contains the experimental results and discussions. We conclude the manuscript in "Conclusion".

Literature review
In literature, several CAD systems [16,17] have been proposed for melanoma detection, which, to some extent, try to mimic the procedure performed by dermatologists, based on a range of features extracted using machine learning approaches. These systems mostly follow four primary steps [18]: (1) preprocessing, (2) lesion segmentation, (3) feature extraction and selection, and (4) classification.
Lesion image segmentation is one of the primary steps that have abiding effects [19] on this classification process. Accurate segmentation of a lesion is an arduous task due to a number of reasons; range of lesion sizes, shapes, colors, and skin texture. Secondly, sometimes there exists a smooth transition between skin color and lesion [19,20]. In addition to that, a few other constraints include specular reflection, presence of hair, falloff towards the edges, and air and immersion-fluid bubbles. Sumithra [21] proposed to initially remove the unwanted hair from lesion prior to applying the segmentation algorithm. Feature extraction was performed subsequently using color and texture features. For the classification both support vector machine (SVM), and K-nearest neighbor (KNN) were used. Similarly, Attia et al. [22] implemented a hybrid framework for hair segmentation by combining convolutional and recurrent layers. They utilized deep encoded features for hair delineation, which are later fed into recurrent layers to inscribe the spatial dependencies among the incoherent image patches. The segmentation accuracy calculated using Jaccard Index is 77.8% in comparison to the existing methods, 66.5%.
Joseph [23] used fast marching and 2D derivative of Gaussian in painting algorithm for hair artifact removal. Cheerla et al. [24] proposed automatic method for segmentation. They used otsu's thresholding for segmentation, and for texture feature extraction local binary patterns (LBP) [25] was utilized. Neural network classifiers were used for classification, which yielded 97% sensitivity and 93% specificity. Hawas et al. [26] proposed an optimized clustering estimation using neutrosophic graph-cut (OCE-NGC) algorithm for skin lesion segmentation. They made use of bio-inspired technique (genetic algorithm), which optimizes the histogram-based clustering procedure, which searches the optimal centroid/threshold values. In the following step, they grouped the pixels by using the generated threshold value using neutrosophic c-means algorithm. Finally, a graph-cut methodology [27] is implemented to segregate the foreground and background regions in the dermoscopic image. Authors claimed to achieve 97.12% average accuracy and 86.28% average Jaccard values. Similarly, [28] implemented a novel scheme (transform domain representation-driven CNN) for skin lesion segmentation. They trained the model from scratch and successfully managed to cope with the constraints including small data set, artifact removal, excessive data augmentation, and contrast stretching. Authors claimed to achieve 6% higher Jaccard index and a less training time on a publicly available ISBI 2016 and 2017 datasets. Euijoon et al. [29] proposed a saliency [30] based segmentation algorithm, in which detection of background was based on spatial layout including color and boundary information. To minimize detection error, they implemented Bayesian framework.
Features play a vital role in classification, which are extracted by following local, global or local-global scenarios [7]. Barata et al. [31] adopted a local-global method for detecting melanoma from dermoscopic images. Local methods were applied to extract features using bag-of-words, whilst, global methods were explored for the classification of skin lesions. Promising results were achieved in terms of greater sensitivity and specificity. Abbas et al. [32] suggested a perceptually oriented framework for border identification-combining the strengths of both edge and region based segmentation. Later, a hill-climbing [33] approach was efficiently utilized to identify the region-of-interest (ROI), followed by an adaptive threshold mechanism to detect the optimal lesion border.
Chatterjee et al. [34] proposed a cross-correlation based technique for feature extraction with an application to skin lesion classification. The authors considered both spatial and spectral features of lesion region based on visual coherency using cross-correlation technique. kernel patches are later selected based on the skin disease categories, which are later classified using proposed multi-label ensemble multi-class classifier. The acquired sensitivities of a set of classes including nevus, melanoma, BCC and SK diseases are 99.01%, 98.7%, 98.87%, and 99.41%. Lei et al. [35] proposed a lesion detection and recognition methodology-built on a multi-scale lesion-biased representation (MLR) and joint reverse classification. This proposed algorithm takes advantage of scales and rotations to detect lesion, compared to the conventional single rotation method. Omer et al. [36] provided a unique solution for skin lesion segmentation using global thresholding based on color features. As a following feature extraction step, they utilized 2D fast Fourier transform (2D-FFT) and 2D discrete Fourier transform (2D-DFT). Mahbod et al. [37] introduced an ensemble technique by combining inter and intra-architecture of CNN. The extracted deep features from each CNN network are later utilized in classification using multi-SVM classifiers. The proposed method proved to be robust in terms of feature extraction, fusion and classification for skin lesion images. Kahn et al. [18] presented a techniques for classification of skin lesion using probabilistic distribution, and for feature selection entropy based method was used. Al-masni et al. [38] investigated a set of deep frameworks both for segmentation and classification. Initially, they implemented a full resolution convolution network for lesion segmentation. Later, the lesion regions are used to extricate the features using multiple deep architectures including Inception-ResNet-v2, and DenseNet 201. Proposed framework is trained on three datasets, ISIC 2016, ISIC 2017, and ISIC 2018, to achieving the promising results. Similarly, a pool of researchers [39][40][41] are utilizing deep frameworks to detect multiple abnormalities with an application to skin lesion classification.
From the detailed review, it is concluded that various existing methods show improved performance on dermoscopic images, but the following conditions were already satisfied: 1 High contrast distinctness between the lesion area and the surrounding region. 2 Color uniformity inside the lesion area. 3 Marginal existence or absence of different artifacts including dark corners, hair, color chart, to name but a few.
Therefore, considering the aforementioned conditions, our primary focus is to develop a technique which efficiently handles the negation of given conditions.

Mathematical model
Given a dermoscopic image database, we are required to assign a label to each and every image-belonging to a class of either benign or malignant. Let us consider D ⊂ R (r×c×p) be a demoscopic image, ψ = ψ(j)|j ∈ R be a formally specified image dataset, where ψ 1 (j), . . . , ψ k (j) ⊂ ψ ∈ R are the pixel values of k-channels. The number of classes C is provided by the user, therefore a class is discriminated as ∼ ψ -a modified version of ψ , interpreted as ∼ ψ : ψ → ∼ ψ . The modeling of ψ to achieve output ∼ ψ is described in terms of: where ψ f represents the extracted features after applying transfer learning, ψ fu represents the fused features from fully connected layers of different architectures, and κ(ψ fu ) is the selected features' representation after processing through a hierarchical structural design.

Convolutional neural networks
CNN are one of the most powerful deep feedforward neural network models used for object detection and classification [42]. In CNN, all neurons are connected to a set of neurons in the next layer in a feedforward fashion. The CNN's basic architecture, as given in Fig. 2, incorporates three primary sub-blocks, comprising convolution, pooling, and fully connected layers.
1 Convolution layer A fundamental unit in the CNN architecture, called convolution layer, is supposed to detect and extract local features from an input image sample X (r×c×p) p , where r = c for a square input. Let us consider an input image sample, X p = {x 1 , x 2 , . . . , x n } , where n represents size of the training dataset. For each input image, the corresponding output is y p = {y 1 , y 2 , . . . , y n } , where y p ∈ {1, 2, . . . , C} , C represents the number of classes. Convolution layer includes a kernel that slides across the input image as X (r×c×p) * H (r ′ ×c ′ ×p) , and local features f ∈ f l are extracted using the following relation: where F l i provides feature map output for the layer, l; ω l i + b j l are the trainable parameters for layer, l; δ(.) represents an activation function. 2 Pooling layer Addition of a pooling layer is another substantial concept in CNN, which is considered to be a non-linear down sampling technique. It is a meaningful combination of two fundamental concepts, max pooling and convolution. Here 3 Fully Connected Layer Convolution and pooling layers are followed by a fully connected feedforward layer, FC. It follows the same principle of traditional fully connected feedforward network having set of inputs and output units. This layer extracts responses based on features' weights calculated from the previous layer.

Transfer learning
Conventional algorithms work by making an assumption that the feature characteristics of both training and testing data are quite identical and can be comfortably approximated [43]. Several pretrained models are trained on natural images, and hence not suitable for the specialized applications. Additionally, data collection for the real world applications is a tedious task. Therefore, TL is a solution to provide accurate classification with a limited number of training samples. This concept is briefly defined as a system's capability to transfer the skills and knowledge learnt while solving one class of problems to a different class of problems, (source-target relation), Fig. 3. The real potential of TL may be best leveraged when the target and source domain datasets are highly disparate in size, such that target domain dataset is significantly smaller than the source domain dataset [44]. Given a source domain, D S = x S 1 , y S 1 , . . . , x S i , y S i , . . . , x S n , y S n , where x S n , y S n ∈ R; with specified learning tasks, L S , and target domain having learning task L T , x T n , y T n ∈ R . Let ((m, n)|(n ≪ m) ) be a training data size and y D 1 and y T 1 are their respective labels. The fundamental function of TL is to boost the learning capability of the target function D T -utilizing the knowledge gained from the source D S and the target D T .

Pre-trained CNN models
Several researchers have proposed set of CNN architectures for computer vision applications like segmentation and classification, etc. [53,54]. In this work, we utilize three widely used pre-trained models for features extraction including Inception-V3, Inception-ResNet-V2 and DenseNet-201. The selection of these models is on the basis of their performance in terms of their Top-1 accuracy, Table 1.

Inception-V3
Inception-V3 is trained on ImageNet database. It comprises two fundamental units: feature extraction and classification. Inception-V3 employs inception units that allow the framework to escalate the depth and width of a network, but also lower the computational parameters.

Inception-ResNet-V2
Inception-ResNet-V2 is an extension of inception-V3, and is also trained on ImageNet database. In its core, it combines the inception with ResNet module. The remaining connections allow bypasses in the model to make the network behave more robustly. Inception-Resnet-v2 fuses the computational adeptness of the Inception units with the optimization leverage contributed by the residual connections.

DenseNet-201
DenseNet 201 is also trained on ImageNet database. It is designed on a more sophisticated connectivity pattern that iteratively integrates all output features in a regular feedforward fashion. Moreover, it mitigates the vanishing-gradient problem, reduces number of input/functional parameters, and strengthens feature propagation.

Dataset
In this work, we have performed our simulations on four publicly available datasets:  [55]. The ground truth is also provided, which is segmented manually with the help of physicians; classified as normal, atypical nevus (benign) or melanoma. 2 ISIC-MSK: The second dataset used in this research is International Skin Imaging Collaboration (ISIC) [56]. This dataset contains 225 RGB dermoscopic images, acquired from various international hospitals with the help of different devices. 3 ISIC-UDA: It is another subdataset of ISIC. We have collected 557 images having 446 training and 111 testing samples from ISIC-UDA dataset. 4 ISBI-2017: ISBI-2017 [57] is another publicly available dataset used for characterization of skin cancer in dermoscopic images. It contains 2750 images, with 2200 training and 550 testing samples. The ISBI-2017 dataset has three disease classes: melanoma, keratosis and benign; however, since keratosis is a common benign skin condition, we have divided the samples into two: malignant and benign.
Manual annotations of all datasets, discussed above, by dermatologists have been provided as ground truths for the evaluation purposes. Repartition of above mentioned datasets is shown in Table 2. Note that we have divided the target dataset into two sets with pre-defined 80% for training and 20% for testing. The training set comprises a combination of training set (70%)-used to train the models, and the validation set (10%) for models' evaluation/fine tuning.

Proposed framework
In dermoscopy, cancer classification is still an outstanding challenge, which is efficiently dealt with by the proposed design; discussed below. Most of the constraints enumerated in "Literature review" section are successfully undertaken, and a cascaded framework is proposed, which comprises four fundamental blocks: preprocessing, lesion segmentation, feature extraction and selection, and labeling/classification. Figure 4 summarizes the adopted methodology.

Preprocessing
The preprocessing step copes with image imperfections introduced at the initial step of acquisition, by eliminating multiple artifacts, such as hair or ruler markings. Contrarily, their presence may affect segmentation, which, in turn, leads to an inaccurate classification. Ideally, the collected image should be free from these artifacts, however, due to certain complications, its strenuous to remove the hair. Therefore, an algorithmic  approach, rather than the latter, is preferably followed. In this work, a widely used software, Dull Razor [58], is utilized, which is capable of localizing the hair and extricate them by implementing bilinear interpolation. Additionally, it also implements an adaptive median filter to smoothen the replaced hair pixel.

Lesion/image segmentation
Segmentation is one critical step that plays its primary role in classification of the skin lesion. In addition to solving various problems, including color variations, hair presence, and lesion irregularity, a robust segmentation method has a capacity to identify infected regions with improved accuracy. Once the images have been transformed to keep the same aspect ratio, the following two steps are performed in turn to complete the segmentation process: 1 Contrast stretching, to make lesion (foreground) region distinct compared to the background. 2 Segment the lesion region based on mean and mean deviation based segmentation procedure.
The immediate objective behind implementing contrast stretching scheme is to make foreground (lesion region) maximally differentiable compared to the background. Additionally, introduction of this pre-processing step refines images to much extent which leads to improved classification accuracy [59]. Initially, each channel of a three dimensional RGB image ( I D ∈ R r×c×p ) is processed independently to make foreground region visually distinguishable. A series of interlinked steps needs to be followed by each channel; those steps are enumerate below: 1 Initially, gradients are computed for each single channel using Sobel-Feldman operator, with a fixed kernel size of (3 × 3). 2 Divide each channel into equal sized blocks (4, 8, 12, …), and rearrange them in a descending order. Now weights are assigned to each block according to gradient magnitude.
where w i b (i = 1, . . . , 4) is a weight coefficient and ξ represents threshold values against computed gradient. 3 Compute the overall weighted gray value against each block where n k (b) represents number of gray pixels encased in block k. To get improved results, few aspects are stringently considered; (a) standard block size, (b) optimized weight criteria, and (c) selection of regions with maximum information. Upon assiduous examination of dermoscopic images, regions with maximum information (lesion) are in the possible range of 25% to 75%. Therefore, worst case is considered and we partition the image 12 basic cells, with a ratio of 8.3%. Later, based on the criteria of maximum information these cells are selected (summation of pixels against each cell). Finally, according to edge points, weights are assigned for each block, E c p .
where E c max represents cells with maximum edges. An addition of post log operation further refines the channel [18], I c (x, y) , compared to original, I s (x, y).
where β is chosen to be 3 by following a trial and error method.
Addition of a contrast stretching block facilitates segmentation step in extracting lesion area with improved accuracy. The probabilistic methods (mean segmentation and mean deviation based segmentation) are applied independently on a same image which are later subjected to image fusion in the following step.
Mean segmentation is calculated using: where ϕ thresh is Otsu's threshold, ς is a scaling factor-selected to be 7 by following trial and error method. C is a constant and its value is in the range of 0 to 1. Similarly mean deviation based segmentation is also calculated on enhanced image by following an activation function, having σ MD calculated to be 0.7979 by following trial and error method.
Segmented image from both distributions are later fused to get the resultant image.
Sample segmentation results are provided in Fig. 5, where it can be observed that they are visually similar when compared with the available ground truths. In some cases, the foreground and background are not distinct enough; the segmentation, in such cases, does not pan out sufficiently acceptable. This may be correlated with the images given in Fig. 6.

Deep features extraction
The proposed framework can be observed in Fig. 4, showing various stages from extraction to the final classification. Following the segmentation step, the proposed hierarchical design is applied on the extracted set of features to conserve the salient deep features.

Feature layers
It has been observed that the systems relying on deep features extracted from a single layer and utilizing a single pre-trained model, are not robust enough [60]. Therefore, alternative strategies are opted-multiple models and even multiple layers are utilized. The most discriminant features from all the three re-trained (transfer learning) models are selected by exploiting three fundamental output layers, fc1000 and predictions. During the training phase, transferred weights are kept frozen on their initial values to extract off-the-shelf deep features. A complete information regarding the selected deep layers, along with their notations, is provided in Table 3. The fully connected layers of Densenet-201, Inception-Resnet-V2, and Inception-V3 are selected as FV0, FV1, and FV2 respectively.

Fusion mechanism
Rather than utilizing independent features from the selected pre-trained models, we adopted a feature fusion strategy. Feature sets originating from different re-trained models are consolidated to generate a fused feature set to retain most discriminant features. Our objective here is to explore the classifier's behavior upon fusing multiple ConvNet fetures. A rudimentary strategy of feature fusion is opted by serially concatenating them to construct a resultant feature vector, which takes advantage of all feature spaces. Let us consider a joint vector FV ∈ R {1×3} = {FV i k } , where i ∈ {1, 2, 3}-representing selected pre-trained architecture, and k ∈ {1, 2, 3} be a selected layer.
The fused feature vector FV κ = FV i k ||FV j l , exhibits set of two or three pre-trained models, having κ = {1, . . . , 4} combinations. Its not imperative for the systems that adopt feature fusion strategy to perform better than those which are using single layer. Fusion strategy increases features redundancy, which makes the classifier behave inefficiently. Therefore, an addition of feature selection and dimensionality reduction steps not only decrease the redundancy but also computation time-leads to an improved classification accuracy. On contrary, overall classification accuracy increases.

Entropy-controlled NCA
Our proposed strategy revolves around the core concept-achieve best classification accuracy by exploiting minimum number of features. In this regard, a hierarchical framework is implemented, which consolidates both feature selection and dimensionality reduction-so as to avoid the problem of curse of dimensionality.

Feature selection
The resultant fused vector FV κ , may include redundant or irrelevant features which are formally passed through an attribute or variable selection procedure. This complete process of selecting a subset of most discriminant variables is termed as feature selection [60]. In the proposed work, the concept of entropy [61] is utilized, which has a capacity to analyze uncertain data and unveil the signal's randomness by exhibiting the system's disorder.
} be a set of training matrix containing N labels, where X ∈ {x j } N j=1 ∈ R ν is a ν-dimension feature vector, and T = {t j } N j=1 are the class labels with t j ∈ [0, 1] to be a binary class. This feature space has φ measure with the probability φ(X) = 1 , then the entropy is calculated as: where φ(x j ) is an observation probability for a particular features x i ∈ X . The basic purpose of applying entropy is to identify a set of unique features having natural variability, whilst entropy value tends towards 0 with minimum feature variability. The concept of entropy has been adopted in one of the recent works [18], where the authors proposed to apply entropy on a distance matrix generated from feature space-yielding restricted OA. On the other hand, in the proposed approach, we assign ranks to the features, FV E , having (R < N ) dimensions. The top 80% features with maximum entropy value are included to generate the resultant set. This rank based selection criteria at this stage only down-samples the original feature space, while keeping the original information conserved for the next level, dimensionality reduction.

Dimensionality reduction
Classifiers behave ineptly when there exists too many variables or these variables are highly correlated. At this stage, dimensionality reduction techniques play their vital role by reducing the number of random variables and retain the resultant vectors in the lower dimensions, FV S , where (S ≪ R) . For this application, we are implementing NCA as a dimensionality reduction technique, on contrary, it is mostly used as a feature selection method. NCA, originally introduced by Goldberger et al. [62], is a distance metric learning algorithm which selects the projection in the projected space by optimizing the performance of nearest neighbor classifier. NCA learns projections from both features and their associated labels that will be cogent enough at partitioning classes in the projected space. For the function, NCA optimizes the criterion related to leaveone-out (LOO) accuracy of a stochastic NN classifier in the projection of space induced by the training set. Selected entropy-controlled fused training vector, FV E , consists of {(x 1 , t 1 ), . . . , (x R , t R )} , where {x j , y j } ∈ R m . NCA learns a projection matrix Q ∈ R s×m , representing transformation that projects x j into s dimensional space, ̟ j = Qx j ∈ R s , and s ≤ m . The projection matrix Q construe a Mahalanobis distance metric, calculated between two samples x j and x k in the projected space.
The primary objective of this method is to learn a projection Q that maximizes the separation of a labeled data by construing the cost function, in the transformed space, based on soft-neighbor assignments. Stating a rationale that every sample x j keeps the neighboring sample x k as a reference with some associated probability, p jk .
where Υ (ψ) = exp(−φ/ς ) represents a kernel function having kernel width ς to an input argument that has a clear influence on the data samples probability-this additional step makes the model more robust and influential. Under the power of stochastic selection rule, the optimization criterion comfortably be defined by utilizing soft-neighbor assignments. The probability that the quantity x j will be assigned a correct class label.
The optimization criterion searches to maximize the correct labels under leave-one-out policy: To perform a featured reduction, as well to avoid the problem of overfitting, a regularization term > 0 is introduced as a standard weight in the cost function which can be tuned via cross validation [63], given as: This complete criterion gives rise to a gradient rule, used to maximize the projection matrix Q and solve by differentiating Ξ(Q) with respect to q k as follow: To maximize the objective function, several gradient optimizers can be employed. However, in this article, we employed conjugate gradient method. Algorithm 1 explains the proposed approach from feature extraction (after transfer learning) to final classification.

Results and discussion
Simulations are performed on four publicly available datasets, Table 2. Three families of state-of-the-art classifiers are utilized for classification including KNN, SVM, and Ensemble (ES). The evaluation of the proposed framework is carried out using three simulation setups: in the first, the classification results are obtained from a few selected individual layers of the pre-trained models. The second simulation setup incorporates two cases: while in the first, we simply fuse the selected layers; in the second, we combine NCA technique with the proposed feature reduction approach. We have also tested the proposed technique with other state-of-the-art classifiers. All the base parameters for the selected classifiers are given in Table 4. Additionally, a fair comparison with recent methods is also provided with remarks on the effectiveness of the proposed technique, in comparison to the state-of-the-art approaches. Figure 7 presents classification results of each of the different layers used on the four datasets discussed in "Dataset" section. It has been observed that the models that were pre-trained by CNN architectures are powerful features representatives. From the selected pre-trained models, it has been observed that DenseNet-201 and Inception-ResNet-V2 show almost similar performance on all datasets. For example, in ISIC-UDA dataset, OA of FV0 is found to be 80.5%, whereas, OA of FV1 is 81.6%. It has also been observed that Inception-V3 shows decline in performance; hence, it is not a suitable candidate for skin cancer detection.

Evaluation of the proposed technique
Prior to the feature selection and dimensionality reduction step, the extracted features from various architectural layers are concatenated. Table 5 shows reduction percentage of fused feature vectors achieved after applying a hierarchical framework of entropy and NCA, before the classification phase. It is evident from the figures that maximum reduction percentage achieved is 98.50% on PH 2 dataset, whilst, average reduction on all dataset is 95.17% . We create four combinational feature vectors from each dataset. Table 6 presents a comparison of classification results, in terms of OA, for two different cases: (1) simple fusion approach, (2) entropy-controlled NCA (proposed). The two cases are implemented on fused feature vector, and on four different datasets, using the selected classifiers. Discussion for the two cases are given below:  In Table 7, the average classification time and average accuracy of all datasets are shown. From this table it is evident that the proposed technique outperforms simple fusion approach with substantial time margin and with maximum classification accuracy. Additionally, a confidence interval is plotted in Fig. 8 against all selected datasets and using two different classifiers (F-KNN, ES-KNN), which works best compared to others. Moreover, to provide a better insight and to facilitate researchers working in this domain, a comprehensive comparison of set of classifiers is also provided, Table 8. From the stats, its quite clear that the classifiers belong to the family of KNN performs best both in terms of average classification accuracy (94.73%) and average computational time (1.30 s). The second best family in this domain is SVM-showing average classification accuracy of 93.83% and average computational time of 1.96 s. Ensemble and Tree family is not showing improved results in terms of average classification accuracy (89.87%, 84.91%), whilst, average computational time of ensemble family is 6.05 sec, but tree family is time efficient by taking only 1.57 s. Same trend is being followed in calculating AUC.

Comparison with state of the art techniques
A comprehensive comparison with existing techniques utilizing PH 2 , ISBI-2017 and ISIC-MSK datasets is given in Table 9. It can be clearly observed that our proposed methodology achieves best classification accuracy on all the given datasets. The maximum classification accuracy achieved by the previous works on PH 2 dataset is 96.00% using color and texture features, while using the proposed methodology, it is 98.80%.    Similarly on ISBI-2017 dataset, the maximum accuracy achieved by the proposed methodology is 95.90%, compared to other methods, e.g. [64] achieving 94.08% on the same dataset. Similarly on ISIC-MSK, the accuracy achieved by [18] is 97.20%, while the proposed methodology gives 99.20%.

Conclusion
Considering the recent success of deep architectures, we presented an effective approach for the classification of skin lesion. Comparing with conventional techniques, we introduced a hierarchical framework of discriminant features selection followed by a dimensionality reduction step. We exploited extracted information from the selected pre-trained models after fine tuning, which contributed significantly in the improvement of classification accuracy. With the proposed method, we utilized less than 3% of total features, which not only improves the classification accuracy by removing redundancy but also minimizes the computational time. After implementing this idea, we are in a position to put forth a few claims including: (a) fusion of extracted features from set of pre-trained models improves the overall accuracy, (b) an addition of feature selection and dimensionality reduction step significantly improve the classification results. As a future work, an improved segmentation criteria will be our primary focus along with the  extended feature selection criteria. Moreover, we will include a few more and challenging datasets in order to provide a comprehensive comparison.