 Research
 Open Access
Improving clustering performance using independent component analysis and unsupervised feature learning
 Eren Gultepe^{1}Email author and
 Masoud Makrehchi^{1}
https://doi.org/10.1186/s1367301801483
© The Author(s) 2018
 Received: 29 April 2018
 Accepted: 5 August 2018
 Published: 23 August 2018
Abstract
Objective
To provide a parsimonious clustering pipeline that provides comparable performance to deep learningbased clustering methods, but without using deep learning algorithms, such as autoencoders.
Materials and methods
Clustering was performed on six benchmark datasets, consisting of five image datasets used in object, face, digit recognition tasks (COIL20, COIL100, CMUPIE, USPS, and MNIST) and one text document dataset (REUTERS10K) used in topic recognition. Kmeans, spectral clustering, Graph Regularized Nonnegative Matrix Factorization, and Kmeans with principal components analysis algorithms were used for clustering. For each clustering algorithm, blind source separation (BSS) using Independent Component Analysis (ICA) was applied. Unsupervised feature learning (UFL) using reconstruction cost ICA (RICA) and sparse filtering (SFT) was also performed for feature extraction prior to the cluster algorithms. Clustering performance was assessed using the normalized mutual information and unsupervised clustering accuracy metrics.
Results
Performing, ICA BSS after the initial matrix factorization step provided the maximum clustering performance in four out of six datasets (COIL100, CMUPIE, MNIST, and REUTERS10K). Applying UFL as an initial processing component helped to provide the maximum performance in three out of six datasets (USPS, COIL20, and COIL100). Compared to stateoftheart nondeep learning clustering methods, ICA BSS and/or UFL with graphbased clustering algorithms outperformed all other methods. With respect to deep learningbased clustering algorithms, the new methodology presented here obtained the following rankings: COIL20, 2nd out of 5; COIL100, 2nd out of 5; CMUPIE, 2nd out of 5; USPS, 3rd out of 9; MNIST, 8th out of 15; and REUTERS10K, 4th out of 5.
Discussion
By using only ICA BSS and UFL using RICA and SFT, clustering accuracy that is better or on par with many deep learningbased clustering algorithms was achieved. For instance, by applying ICA BSS to spectral clustering on the MNIST dataset, we obtained an accuracy of 0.882. This is better than the wellknown Deep Embedded Clustering algorithm that had obtained an accuracy of 0.818 using stacked denoising autoencoders in its model.
Conclusion
Using the new clustering pipeline presented here, effective clustering performance can be obtained without employing deep clustering algorithms and their accompanying hyperparameter tuning procedure.
Keywords
 Spectral clustering
 Independent component analysis
 Sparse filtering
 Reconstruction ICA
 Unsupervised feature learning
 Clustering
 Graph Regularized Nonnegative Matrix Factorization
Introduction
Grouping observed data into cohesive clusters without any prior label information is an important task. Especially, in the era of bigdata, in which very large and complex amounts of data from various platforms are collected, such as image content from Facebook or vital signs and genomic sequences measured from patients in hospitals [1]. Often, these data are not labeled and a significant undertaking is typically required (usually by individuals with domain knowledge). Even in simple tasks, such as labeling images or video data can require thousands of hours [2, 3]. Therefore, using the unsupervised learning technique of cluster analysis can aide in the process of providing labels to observed data [4].
Classical clustering algorithms
Classical clustering algorithms used for analysis are Kmeans [5], Gaussian Mixture Models [6], and hierarchical clustering [4], all of which are based on using a distance measure to assess the similarity of observations. The choice for distance measure is typically data dependent. For instance, in image data, the similarity between pixels can be represented by the Euclidean distance, where as in text documents cosine distance matrix is typically used [7]. Moreover, appropriate feature representation of the observations is even more critical in order to obtain correct clusters of the data [8], since improved features provide a better representative similarity matrix.
Deep learningbased clustering
Early approaches for learning the appropriate feature space in clustering algorithms implemented deep autoencoders (DAEs) [9]. Song et al. [10] used DAEs to directly learn the data representations and cluster centers. Huang et al. [11] employed a DAE with locality and sparsity preserving constraints, which is followed by a Kmeans to obtain the cluster memberships. A more recent and popular approach by Xie et al. [8] learned the feature space and cluster membership directly using a stacked denoising autoencoder [12]. Following Xie et al. [8], there have been many studies proposing deep clustering algorithms to learn the feature space and cluster membership simultaneously using some form of an autoencoder [13–15]. A departure from the autoencoder framework was demonstrated by Yang et al. [16], who used recurrent and convolutional neural networks with agglomerative (hierarchical) clustering.
Spectral clustering
Another class of clustering algorithms, called spectral clustering [17, 18], is based on embedding the graph structure of the data through eigendecomposition (also known as spectral decomposition) of the Laplacian matrix [19]. Spectral clustering usually performs better than Kmeans and the aforementioned classical algorithms due to its ability to cluster nonspherical data [4]. A key issue in spectral clustering is to solve the multiclass clustering problem. This is accomplished by representing the graph Laplacian in terms of k eigenvectors, k being the number classes [20]. Then, either Kmeans clustering [18], exhaustive search [17], or discretization [21] is applied to this lower dimensional representation of the Laplacian to determine the final cluster memberships. Recently, autoencoders have been applied on the Laplacian to obtain the spectral embedding provided by the eigenvectors [22]. Another approach has been to use a deep learning network that directly maps the input data into the lower dimensional eigenvector representation, which is then followed by a simple clustering algorithm [23].
Research aim
The drawback of deep learning methods is that they tend to have many hyperparameters, such as learning rates, momentum, sparsity parameters, and number of features and layers [14, 24, 25]. All of which can make deep learning models difficult to train, since hyperparameters can severely effect performance [24]. Typically, choosing the correct hyperparameters requires expertise and ad hoc selection [14, 25]. However, the high degree of complexity in implementing deep learningbased algorithms [24] may be a limiting factor of their application in noncomputer science based research fields. To have realworld applicability, clustering applications need to have as few hyperparameters as possible [14].
In this study, the aim is to provide a parsimonious and accessible clustering processing scheme that incorporates deep learningstyle feature extraction, but without the complex hyperparameter tuning procedure. The goal is to bridge the gap between deep learningbased clustering methods and widely available standard clustering techniques. This is accomplished by using two procedures. First, we improve the clustering accuracy of standard clustering algorithms by applying independent component analysis (ICA) [26] blind source separation (BSS) after the initial matrix factorization step in principal component analysis (PCA) and graphbased clustering algorithms. Second, we improve the features used for constructing the distance matrix in graphbased clustering techniques by performing feature extraction using deep learninginspired feature learning techniques. Prior to any clustering algorithm, we implement the unsupervised feature learning (UFL) algorithms of ICA with reconstruction cost (RICA) [27] and sparse filtering (SFT) [28], both of which have only one tunable hyperparameter—the number features [28].
By implementing these two procedures we demonstrate that effective clustering performance that is on par with more complex deep learning clustering models can be achieved. Thus, the clustering methodologies provided herein are designed to be simple to implement and train in different data applications.
Materials and methods
Data
Description of datasets used in experiments
Dataset  COIL20  COIL100  CMUPIE  USPS  MNIST  REUTERS10K 

# Observations  1440  7200  2856  9298  70,000  10,000 
# Classes  20  100  68  10  10  4 
Dimensions  32 × 32 × 1  32 × 32 × 3  32 × 32 × 1  16 × 16 × 1  28 × 28  2000 
Type  Image, pixel  Image, pixel  Image, pixel  Image, pixel  Image, pixel  Text, tfidf 
Task  Object rec.  Object rec.  Face rec.  Digit rec.  Digit rec.  Topic rec. 
Feature extraction and unsupervised feature learning
For all image datasets, L_{2}normalization (each observation input feature vector is transformed to have a unit norm) was performed on the image pixel intensities (each observation input feature vector is transformed to have a unit norm). L_{2}normalization has been empirically shown to improve clustering performance [16]. The L_{2}normalization was not performed directly on the raw input data of the REUTERS10K text document dataset because for text documents, the tfidf matrix [33] was constructed. Nonetheless, each observation still has unit norm [7]. At this stage, the normalized image and text data can be used as input into the clustering algorithms.
Following normalization, the feature extraction component diverges into two separate stages. The first stage is performed by principal components analysis (PCA), which is a linear method of feature extraction and data compression [34]. PCA is performed on the normalized datasets by using singular value decomposition (SVD). SVD has been shown to be successful in topic recognition for text documents (Latent Semantic Indexing [35]) and face recognition in images (using eigendecomposition in Eigenfaces [36]). SVD provides a good baseline for comparison to more complex feature extraction algorithms [7, 37]. To determine the number of principal components, we used spectral clustering as our inspiration. In spectral clustering, the number eigenvectors used to embed the graph Laplacian is set to the number of clusters expected in the observed/input data. The goal is to embed the signal that is unique to each class into a vector. Thus, in our processing pipeline the number of principal components extracted is set to the number of classes in each dataset, as in spectral clustering [19]. Then the principal components obtained from PCA are used in Kmeans clustering. PCA with Kmeans clustering has been used in other studies as well, such as in Cai et al. [7]. However, in Cai et al. [7], the number of principal components is not limited to the number of classes, as in our processing. Rather, the full dimensionality of the original input data is preserved. For example in the CMUPIE dataset, for the pipeline proposed in our study, 68 principal components are extracted, whereas in [7] 32 × 32 = 1024 principal components are extracted.
After PCA is applied, blind source separation (BSS) can be applied to the extracted components. BSS is the problem of resolving the mixed signal sources, without knowing the nature of the mixture [38, 39]. ICA is mathematical model of BSS, where the mixed signals are separated into their original sources. This is accomplished by determining the mixing matrix that the maximizes nonGaussianity and minimizes the mutual information [26]. ICA has been shown to successfully separate source signals into biological meaningful signals in electroencephalography (EEG) [40, 41] and functional magnetic resonance imaging (fMRI) data [42, 43]. In our pipeline, ICA is applied to the components obtained by PCA, with the goal to provide more salient embedding vectors to the Kmeans clustering algorithm. In a recent study by Nascimento et al. [44] in 2017, without any PCA dimension reduction, applied ICA directly to the input data prior to hierarchical clustering. However, ICA following PCA has been shown to be more effective at separating the source signals [45]. In our study, the popular FastICA algorithm is used for the application of ICA [46].
The second stage of our feature extraction, which is separate from the first stage, uses unsupervised feature learning (UFL). UFL is a general term that is used to describe deep neural networktype feature extraction methods that do not use labels during model training. Examples of UFL are the deep learning methods of Restricted Boltzmann Machine (RBM) and autoencoders neural networks [9]. Many of the deep clustering methods use autoencoders in the core of their models [8, 10, 11, 13–15].
A disadvantage of deep neural networks such as autoencoders is their parameter complexity during training [24]. To overcome this complexity, RICA and SFT algorithms have been used as substitutes for the feature learning algorithm. These algorithms have greatly reduced the number of parameters needed to train an effective model, thus decreasing training time [27, 28]. Both algorithms have only one hyperparameter, which is the number of features or hidden nodes that need to be learned for proper feature extraction. A summary of RICA [27] and SFT [28] is given below, where \({\mathbf{X}} = \left( {x_{ij} } \right) \in {\mathbb{R}}^{q \times s}\) is the input data (with q as the of # observations and s as the # of features in a random image patch) and \({\mathbf{W}} = \left( {w_{ij} } \right) \in {\mathbb{R}}^{K \times N}\) is the weight matrix (with K as the # of learned features and N as the # of input features):
Reconstruction ICA
RICA relaxes the orthonormal constraint of ICA to learn an overcomplete feature representation. Overcomplete features are more robust to noise, sparse, and may capture the underlying structure of the data better [47]. The RICA algorithm minimizes W using the following objective function, \(\frac{1}{2}\left\ {{\mathbf{W}}^{T} {\mathbf{Wx}}  x} \right\_{2}^{2} + \lambda \left\ {{\mathbf{Wx}}} \right\\), where λ is a sparsity parameter.
Sparse filtering
SFT directly learns the kth feature f_{k} of the data by L_{1}minimization, without ever modeling the data itself. The feature f_{k} for the ith input \({\mathbf{x}}^{{(\varvec{i})}}\) is defined by the soft absolute function \(f_{k} = \sqrt {({\mathbf{w}}_{k}^{T} {\mathbf{x}}^{{(\varvec{i})}} )^{2} \,+ \in }\) where ∊ = 10^{−8}. The feature f_{k} is L_{2}normalized first by the rows and then the columns. Then L_{1}minimization is performed following the normalization steps to ensure equally distributed sparse feature activation.
The second stage of UFL feature extraction is only applied to the image datasets. This is due to the 2D image patch extraction process that is required to implement RICA and SFT [27, 28, 48]. A detailed discussion about how features are extracted and convolved to represent a new feature space can be found in [24, 48, 49].
In summary, there are three types of feature extraction streams for the input data: (1) L_{2}normalization of the input features, (2) using the L_{2}normalized data, PCA feature extraction with the added option of ICA BSS, and (3) unsupervised feature learning with RICA or SFT using the L_{2}normalized features. At this stage, Kmeans clustering can be directly applied to each of the three types of feature representations, without using any of the graphbased clustering techniques described in the subsequent subsection.
Graph embedding and clustering
After the feature representation of the input data is finalized, the graph embedding procedure can be performed on the new representations of the data. This is a twostep process in which the similarity graph is first constructed and then followed by a matrix factorization step. Once the matrix factorization is obtained, clustering can be performed.
We briefly review the process for constructing the similarity graph (for a detailed review see [19]), as used in this study. First, using the Gaussian similarity function, \(s\left( {x_{i} ,x_{j} } \right)\, = \,{ \exp }\left( {  \,\left\ {x_{i} \,  \,x_{j} } \right\^{ 2} / 2\sigma^{ 2} } \right)\), all pairwise similarities among the observations \({\mathbf{X}} = \left( {x_{ij} } \right) \in {\mathbb{R}}^{m \times n}\) (where m is the # of observations and n is the # of input features) are computed, controlled by the scaling factor σ. Using the similarities \({\mathbf{S}} = \left( {s_{ij} } \right) \in {\mathbb{R}}^{m \times m}\) (where m is the # of observations) and the knearest neighbor (kNN) criterion, the similarity graph, otherwise known as the weighted adjacency matrix W, is constructed [19]. Essentially, the kNN criterion simplifies the similarity matrix to a sparse matrix where each observation is connected to only its knearest neighbors and all other entries are set to 0.
To perform clustering using the similarity graph W, it needs to be represented by k eigenvectors or basis vectors by way of matrix factorization [7]. One method to accomplish this is through the standard spectral clustering algorithm [19]. In this study, the two popular spectral clustering algorithms, both of which require the construction of the normalized graph Laplacian, are used. The unnormalized graph Laplacian is defined as L = D − W, where \({\mathbf{D}} = \left( {d_{ij} } \right) \in {\mathbb{R}}^{m \times m}\) (where m is the # of observations) is the diagonal degree matrix with diagonal elements set to \(d_{ii} = \sum\nolimits_{i = 1}^{\text{m}} {w_{ij} }\) and all other entries set to 0. The first version of spectral clustering used, is the method formulated by Ng et al. [18] and called SPCSYM in our study, normalizes the graph Laplacian according to the following equation, \({\mathbf{L}}_{{\varvec{sym}}} = {\mathbf{D}}^{  1/2} {\mathbf{LD}}^{  1/2}\), where L_{sym} is a symmetric matrix. The second formulation by Shi and Malik [17], called SPCRW subsequently (also known as the Normalized Cut algorithm), performs the normalization by \({\mathbf{L}}_{{\varvec{rw}}} = {\mathbf{D}}^{  1} {\mathbf{L}}\), where L_{rw} is a related to a random walk matrix [19]. Both algorithms obtain embedding of the similarity graph by using the k eigenvectors (where k is the number of classes) computed from the eigendecomposition of their normalized graph Laplacians. SPCSYM and SPCRW both conclude with Kmeans clustering on the obtained eigenvectors [19].
The second method used in our study to obtain embedding of the similarity graph W, is the Graph Regularized Nonnegative Matrix Factorization (GNMF) algorithm [7]. GNMF extends Nonnegative Matrix Factorization (NMF) by including the geometrical information of the input data during the minimization phase of the NMF algorithm [7]. In NMF, the data and the decomposed components are assumed to be nonnegative [50], whereas in SVD the components may take on negative values. GNMF incorporates the unnormalized Laplacian L while solving for the NMF basis vectors of the input data [7]. Thus, GNMF embeds the graph structure into k basis vectors. To cluster with the GNMF, Kmeans clustering is performed on the k basis vectors. For comparison purposes and as in Cai et al. [7], we also provide the cluster memberships obtained from NMF that is performed directly on the data feature matrix.
On each of the aforementioned graph embedding algorithms, we apply ICA BSS immediately after the k eigenvectors (from SPCSYM and SPCRW) and basis vectors (from GNMF) are obtained. Then, Kmeans is applied to obtain the cluster memberships. ICA BSS is also directly applied on the symmetric Laplacian \({\mathbf{L}}_{{\varvec{sym}}} \varvec{ }\) matrix, which is called the ICASYM method. This is done to compare to ICA BSS performed on the preembedded vectors obtained from SPCSYM, SPCRW, and GNMF. In ICASYM, Kmeans clustering was applied on the k independent components as the final step.
Overall, graph embedding is obtained from four different methods, (1) SPCSYM, (2) SPCRW, (3) GNMF, and (4), ICASYM. Since the end goal of each method is to obtain clustering using Kmeans, these four methods are referred to as clustering methods rather than specific embedding methods.
Experimental setup
Cluster processing pipeline parameters used in experiments
Component^{a}  Feature extraction  Graph construction  Graph embedding  Clustering  

PCA  ICA BSS  UFL  kNN graph  Spectral clustering^{b}  GNMF  NMF  Kmeans  
Hyperparam. 1  # PC’s = # classes  # IC’s = # classes  Method: RICA & SFT  k: 5  L_{sym} & L_{rw}  Max iters. = 100  Max iters. = 100  200 repetitions 
Hyperparam. 2  –  –  # feats. = 256  σ = mean of kth nearest neighbors  –  α = 100  –  – 

Without graph embedding: Kmeans, PCA with Kmeans, and NMF with Kmeans.

With graph embedding: SPCSYM, SPCRW, ICASYM, and GNMF—all of which used Kmeans as a final step to obtain the clusters.
The clustering experiments were implemented in the academic and industry standard numerical computing environment MATLAB (version 2017b) on a Linux based operating system of Ubuntu (version 14.04). The computational environment was setup on a computer with 20 GB of memory and AMDFX6300 3.5 GHz 6Core CPU.
Performance evaluation
To evaluate the clusters, the standard protocol and performance metrics provided in other studies [8, 14, 52] are used. The number of clusters in all algorithms is provided by the number of unique groundtruth labels in each dataset. The first metric used to assess clustering performance is the normalized mutual information (NMI), which measures the dependence of two labels of the same data [53]. NMI is independent of the label permutations of the clusters and its values range from 0 (completely independent) and 1 (completely identical). The second measure, unsupervised clustering accuracy (ACC), is the common accuracy metric computed for the best matching permutation between clustered labels and groundtruth labels, provided by the Hungarian algorithm [54]. Implementation details about the two metrics can be found in Xu et al. [53]. Calculating the ACC and NMI allows the results obtained in this study to be compared with almost any other clustering study, since these two metrics are the most widely used for evaluation. This is evidenced especially in the most recent stateoftheart clustering methods [14, 23].
Results
Comparison of maximum performance across processing components
Method  Performance of applied processing (NMI, ACC)  

L_{2}  L_{2}, ICA  L_{2}, RICA  L_{2}, RICA, ICA  L_{2}, SFT  L_{2}, SFT, ICA  
COIL20  0.918, 0.857^{b}  0.92, 0.856^{b}  0.929, 0.894^{b}  0.914, 0.885^{b}  0.965, 0.93 ^{b}  0.946, 0.912^{a} 
COIL100  0.914, 0.774^{b}  0.914, 0.784^{a}  0.943, 0.813^{b}  0.962, 0.897 ^{a}  0.932, 0.765^{b}  0.954, 0.849^{a} 
CMUPIE  0.941, 0.85^{b}  0.986, 0.937 ^{c}  0.816, 0.716^{b}  0.848, 0.721^{d}  0.844, 0.759^{b}  0.866, 0.774^{b} 
USPS  0.845, 0.828^{a}  0.854, 0.81^{a}  0.868, 0.926 ^{a}  0.85, 0.794^{e}  0.852, 0.817^{b}  0.853, 0.813^{b} 
MNIST  0.774, 0.787^{a}  0.824, 0.882 ^{b}  0.79, 0.828^{b}  0.787, 0.822^{b}  0.79, 0.824^{b}  0.794, 0.853^{a} 
REUTERS10K  0.446, 0.656^{c}  0.46, 0.714 ^{c} 

COIL20: Feature learning using SFT with SPCSYM provided the highest clustering performance (NMI = 0.965, ACC = 0.93), which is better than the baseline SPCSYM clustering performance (NMI = 0.918, ACC = 0.857). With SFT and ICA BSS, GNMF also showed an improvement in both performance measures (NMI = 0.946, ACC = 0.912) compared to its baseline GNMF clustering performance (NMI = 0.913, ACC = 0.844).

COIL100: The best clustering performance (NMI = 0.962, ACC = 0.897) was obtained with GNMF clustering using RICA feature learning and ICA BSS. Without feature learning and ICA BSS, GNMF performance is (NMI = 0.9, ACC = 0.713). The addition of the RICA feature learning increases the performance to (NMI = 0.926, ACC = 0.74). The application of ICA BSS to GNMF using only L_{2}normalized features increases the clustering performance to (NMI = 0.914, ACC = 0.784).

CMUPIE: Clustering performance obtained with SPCSYM and ICA BSS is (NMI = 0.947, ACC = 0.879), which is an improvement over the baseline SPCSYM implementation (NMI = 0.941, ACC = 0.85). However, when the simpler method of clustering on the PCA reduced representation of the pixels, followed by ICA BSS, Kmeans clustering provides the top performance (NMI = 0.985, ACC = 0.937). Without any ICA BSS, clustering in the PCA representation provides only (NMI = 0.534, ACC = 0.231) performance. The best clustering performance using feature learning was obtained with SFT combined with ICA BSS using SPCSYM (NMI = 0.866, ACC = 0.774), which is still lower than baseline SPCSYM performance.

USPS: The best clustering performance was obtained with RICA feature learning and GNMF (NMI = 0.868, ACC = 0.926). The baseline GNMF performance was (NMI = 0.854, ACC = 0.828), which was higher than the baseline SPCSYM performance (NMI = 0.842, ACC = 0.814). When the ICA BSS was applied to GNMF, the performance of the NMI increased, but the accuracy decreased (NMI = 0.854, ACC = 0.810).

MNIST: The best clustering performance was obtained with SPCSYM/SPCRW using ICA BSS (NMI = 0.824, ACC = 0.882). The performance of the baseline SPCSYM and SPCRW were (NMI = 0.779, ACC = 0.756) and (NMI = 0.757, ACC = 0.67), respectively. When RICA feature learning was applied to SPCSYM, the performance increased to (NMI = 0.79, ACC = 0.828). When ICA BSS was combined with RICA feature learning for SPCSYM, the performance decreased slightly (NMI = 0.787, ACC = 0.822). With GNMF clustering, the baseline performance (NMI = 0.774, ACC = 0.787) improved when ICA source was applied (NMI = 0.813, ACC = 0.845). Improvement to the GNMF baseline performance was also achieved when SFT feature learning was combined with ICA BSS (NMI = 0.794, ACC = 0.853).

REUTERS10K: When PCA dimension reduction followed by ICA BSS is applied directly applied to the tfidf matrix, Kmeans clustering provides the top clustering performance (NMI = 0.46, ACC = 0.714). Without ICA BSS, PCA dimension reduction provides an accuracy of (NMI = 0.446, ACC = 0.656). The next best clustering performance was provided by NMF without (NMI = 0.318, ACC = 0.546) and with ICA BSS (NMI = 0.428, ACC = 0.638).
Comparison of clustering performance across different datasets and clustering techniques
Baseline  Performance of applied processing on different datasets (NMI, ACC)  

COIL20  COIL100  CMUPIE  USPS  MNIST  REUTERS10K  
Method  
Kmeans  0.735, 0.597  0.822, 0.615  0.532, 0.239  0.659, 0.694  0.527, 0.553  0.356, 0.541 
Deep learning  
AE+Kmeans (2016)  –, 0.818  –, 0.666  
NMFD (2014)  0.692, –  0.719, –  0.920, 810  0.287, 0.382  0.152, 0.75  
TSCD (2016)  0.928, 0.899  0.651, 0.692  
DEN (2014)  0.870, 0.725  
DBC (2017)  0.895, 0.793  0.905, 0.775  0.724, 0.743  0.917, 0.964  
IEC (2016)  0.787, 0.546  0.641, 0.767  0.542, 0.609  
AEC (2013)  0.651, 0.715  0.669, 0.760  
DCN (2016)  0.810, 0.830  
DEC (2016)  0.924, 0.801  0.586, 0.619  –, 0.818  –, 0.722  
DCEC (2017)  0.826, 0.790  0.885, 0.890  
DEPICT (2017)  0.974, 0.883  0.927, 0.964  0.917, 0.965  
JULESF (2016)  1.000,–  0.978, –  0.984, 0.980  0.858, 0.922  0.906, 0.959  
JULERC (2016)  1.000, –  0.985, –  1.000, 1.000  0.913, 0.950  0.913, 0.964  
VaDE (2016)  –, 0.945  –, 0.798  
IMSAT (2017)  –, 0.984  –, 0.719  
SpectralNet (2018)  0.924, 0.971  
Nondeep learning  
ACGDL (2012)  0.865, –  0.797, –  0.934, 0.842  0.824, 0.867  0.017, 0.113  
ACPIC (2013)  0.855, –  0.840, –  0.902, 0.797  0.840, 0.855  0.017, 0.015  
SEC (2011)  0.511, 0.544  0.779, 0.804  
LDMGI (2010)  0.563, 0.580  0.802, 0.842  
Ours^{a}  0.965, 0.93  0.962, 0.897  0.986, 0.937  0.868, 0.926  0.824, 0.882  0.460, 0.714 

COIL20: Our SPCSYM with SFT feature learning was 3.5 percentage points (p.p.) less than JULE (both JULESF and JULERC had perfect NMI) in NMI performance. The JULE algorithms denoted by JULESF and JULERC, are the same method except that the “–RC” is the finetuned version of the “straightforward” “SF” version [16]. Overall, SPCSYM with SFT ranked 2nd (out of 6) compared to 5 other deep clustering methods.

COIL100: GNMF with RICA feat learning and ICA BSS performed 1.6 p.p. and 2.3 p.p. less than JULESF and JULERC, respectively, in NMI. GNMF with RICA and ICA BSS ranked 2nd (out of 5) compared to 4 other deep clustering methods.

CMUPIE: The simple combination of PCA with ICA BSS performed almost on par with the much more complex JULE algorithm, which uses a combination of recurrent and convolutional neural networks. Specifically, before any model finetuning is applied to the JULE algorithm (JULESF), our PCA and ICA combination was 0.2 p.p. higher in NMI performance. However, this advantage is lost once the JULE algorithm is finetuned (JULERC). PCA with ICA BSS ranked 2nd (out of 5) compared to 4 other deep cluster methods.

USPS: Using RICA with GNMF performed third best after the JULERC and DEPICT algorithms, outperforming seven deep learning clustering methods. This new combination even outperformed JULESF by 1 p.p. in NMI and 0.4 p.p. in ACC. Overall, RICA with GNMF ranked 3rd (out of 9) compared to 8 other deep clustering methods.

MNIST: Our clustering algorithm performed better than seven other deep learning clustering algorithms by simply using ICA BSS after eigenvector decomposition of the normalized Laplacian used in SPCSYM. However, it ranked 8th (out of 15) compared to 14 other deep clustering methods. Nonetheless, it performed better than the popular algorithms of DEC by 0.8 p.p. in NMI and 3.8 p.p. in ACC and DCN by 1.4 p.p. in NMI and 5.2 p.p in ACC.

REUTERS10K: Our implementation of clustering in the PCA space with ICA BSS, performed better than AE+Kmeans by 4.8 p.p. in NMI, however it did not exceed any other deep learning clustering algorithms, ranking 4th (out of 5) in comparison to 4 other deep learning methods. Nonetheless, our PCA and ICA combination performed on par with the DEC and IMSAT algorithms, which were 0.5 and 0.8 p.p. in NMI, respectively, better than our clustering performance.
The running times for the new clustering algorithms using the different processing components are found to be similar to other studies. For the largest dataset, (MNIST), the running time for our best performing algorithm (SPCSYM with ICA BSS) is 20 min. With feature learning, such as SFT, in addition to graphbased clustering, the running increases to 5 h for GNMF and 4.9 h for SPCSYM. These times are comparable to JULESF (4.4 h) and JULERC (8.3 h) and DEPICT (2.8 h) [14]. For the USPS dataset (the second largest dataset), our best result using RICA feature learning with GNMF clustering took 18 min to run. Adding a separate ICA BSS (~ 0.5 s), does not noticeably increase the running time because clustering time is reduced by one second on the ICA components. Clustering with JULE (either SF or RC), and DEPICT takes about 28 min to run on the USPS dataset [14].
Discussion
It is demonstrated that standard clustering techniques, especially graphbased methods, can be used to achieve equivalent or better clustering performance than deep learning clustering algorithms when ICA blind source separation and unsupervised feature learning algorithms are applied. ICA BSS applied to either the PCA features vectors or to the graphembedding vectors increased the saliency of the classspecific feature embedding. This is the first study to apply ICA BSS as an improvement to the multiclass problem in graphbased clustering algorithms. UFL using RICA and SFT helped build an improved similarity graph representation of the original input data, which is critical in graphbased clustering algorithms such as spectral clustering and GNMF.
Furthermore, limiting the number principal components to only the number of classes in Kmeans combined with PCA, helped to extract only the pertinent classspecific features. First, PCA captures only the necessary variance by eliminating the extra information, and then BSS using ICA separates the signals pertaining to each class. Although unexpected, clustering success in the REUTERS10K text document dataset is not altogether surprising. Since, ICA has been shown to perform reasonably well on text documents in the task of topic classification and information retrieval [64–66].
The limitations of the clustering methodology presented in our study are mainly due to the processing speed of the similarity graph and the matrix factorizations [23]. However, the speed limitation of the matrix factorization is due to the initial eigendecomposition and NMF computations. Otherwise, once the k basis vectors or eigenvectors have been obtained, the ICA computation is fast since the dimensionality of these vectors are small. Nonetheless, as a simple and effective boost to spectral clustering and GNMF, ICA blind source separation can be used alongside UFL with RICA or SFT.
Given the multiple stages of processing in the proposed clustering pipeline, guidelines for its recommended usage are provided. Initially, PCA with the number components set to the number of classes and ICA BSS using Kmeans should be applied as a simple baseline. This should be an effective baseline for text documents. In image datasets, another effective baseline would be to use GNMF or SPCSYM using ICA BSS. GNMF or SPCSYM using UFL with RICA and SFT for feature extraction can also be used as alternate clustering algorithm. Finally, GNMF or SPCSYM using ICA BSS and UFL feature extraction can be used as a last attempt to obtain high clustering performance.
The proposed pipeline is effective for small (1000–2000 observations, such as COIL20, COIL100 and CMUPIE) to mediumlarge datasets (10,000–50,000 observations, such as USPS and REUTERS10K). Beyond 50,000 observations, deep clustering methods may be more suitable due to computation time of the similarity graphs, basis vectors, and eigenvectors [23]. The pipeline can be very useful for medical data, specifically in the analysis of electronic health records, where the number of observed patients is typically ~ 1000 [67], due to the difficulty of collecting pertinent patient data. On a dataset with less than 50,000 observations, deep clustering algorithms may require too much overhead in terms of hyperparameter setup, without providing a large improvement in terms of clustering performance.
Conclusions

Graphbased clustering performance can easily be improved by applying ICA blind source separation during the graph Laplacian embedding step.

Applying unsupervised feature learning to input data using either RICA or SFT, improves clustering performance.

Surprisingly for some cases, high clustering performance can be achieved by simply performing Kmeans clustering on the ICA components after PCA dimension reduction on the input data. However, the number of PCA and ICA signals/components needs to be limited to the number of unique classes.

Compared to stateoftheart nondeep learning clustering methods, ICA BSS and/or UFL with graphbased clustering algorithms outperformed all other methods.

With respect to deep learningbased clustering algorithms, our new pipeline obtained the following rankings on the six different datasets: (1) COIL20, 2nd out of 5; (2) COIL100, 2nd out of 5; (3) CMUPIE, 2nd out of 5; (4) USPS, 3rd out of 9; (5) MNIST, 8th out of 15; and (6) REUTERS10K, 4th out of 5.
These findings demonstrate the robustness of standard clustering methods when implemented with careful processing. Instead of developing complex deep learning clustering implementations, equivalent results can be achieved from existing standard techniques with the modifications presented herein. Furthermore, the new clustering implementation presented in this study may also be used as an empirical baseline to justify use of sophisticated deep clustering learning networks and to reduce computation times. For future studies, increasing the speed of building the k nearest neighbor graph used in spectral clustering and GNMF will be investigated. Also, application of the clustering pipeline to handwritten image data in another language such as Korean [68], would help to demonstrate the generalizability of the methods provided herein.
Declarations
Authors’ contributions
EG conceptualized the study, performed all experiments, and analyzed all the data. EG wrote the manuscript. MM advised on the manuscript preparation and technical knowledge. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
None.
Availability of data and materials
The dataset supporting the conclusions of this article will be available on the Mendeley Data repository.
Funding
This research was supported in part by the Natural Sciences and Engineering Research Council of Canada.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Kevin M (2012) Machine learning: a probabilistic perspective. MIT Press, CambridgeMATHGoogle Scholar
 Deng J, Dong W, Socher R, Li LJ, Li K, FeiFei L (2009) ImageNet: a largescale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. https://doi.org/10.1109/cvpr.2009.5206848
 Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s112630150816y MathSciNetView ArticleGoogle Scholar
 Friedman J, Hastie T, Tibshirani R (2008) The Elements of statistical learning. Springer, New York. https://doi.org/10.1007/b94608 MATHGoogle Scholar
 MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297Google Scholar
 Bishop CM (2006) Pattern recognition and machine learning. Springer, New YorkMATHGoogle Scholar
 Cai D, He X, Han J, Huang TS (2011) Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33:1548–1560. https://doi.org/10.1109/tpami.2010.231 View ArticleGoogle Scholar
 Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp 478–487Google Scholar
 Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507. https://doi.org/10.1126/science.1127647 MathSciNetView ArticleMATHGoogle Scholar
 Song C, Liu F, Huang Y, Wang L, Tan T (2013) Autoencoder based data clustering. In: Iberoamerican Congress on Pattern Recognition, Springer, pp 117–124. https://doi.org/10.1007/9783642418228_15
 Huang P, Huang Y, Wang W, Wang L (2014) Deep embedding network for clustering. In: Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, pp 1532–1537. https://doi.org/10.1109/icpr.2014.272
 Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408MathSciNetMATHGoogle Scholar
 Yang B, Fu X, Sidiropoulos ND, Hong M (2017) Towards Kmeansfriendly spaces: simultaneous deep learning and clustering. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, PMLR, International Convention Centre, Sydney, Australia, pp 3861–3870Google Scholar
 Dizaji KG, Herandi A, Deng C, Cai W, Huang H (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, pp 5747–5756. https://doi.org/10.1109/iccv.2017.612
 Jiang Z, Zheng Y, Tan H, Tang B, Zhou H (2016) Variational deep embedding: An unsupervised and generative approach to clustering. ArXiv Prepr. ArXiv161105148Google Scholar
 Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5147–5156. https://doi.org/10.1109/cvpr.2016.556
 Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22:888–905. https://doi.org/10.1109/34.868688 View ArticleGoogle Scholar
 Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856Google Scholar
 Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416MathSciNetView ArticleGoogle Scholar
 Zhang Z, Jordan MI (2008) Multiway spectral clustering: a marginbased perspective. Stat Sci 23:383–403MathSciNetView ArticleMATHGoogle Scholar
 Stella XY, Shi J (2003) Multiclass spectral clustering. In: Proceedings Ninth IEEE International Conference on Computer Vision. IEEE, pp 313–319. https://doi.org/10.1109/iccv.2003.1238361
 Tian F, Gao B, Cui Q, Chen E, Liu TY (2014) Learning deep representations for graph clustering. In: AAAI, pp 1293–1299Google Scholar
 Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) SpectralNet: spectral clustering using deep neural networks. ArXiv Prepr. ArXiv180101587Google Scholar
 Coates A, Ng A, Lee H (2011) An analysis of singlelayer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 215–223Google Scholar
 Chan TH, Jia K, Gao S, Lu J, Zeng Z, Ma Y (2015) PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Process 24:5017–5032. https://doi.org/10.1109/TIP.2015.2475625 MathSciNetView ArticleGoogle Scholar
 Hyavrinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York. https://doi.org/10.1002/0471221317 View ArticleGoogle Scholar
 Le QV, Karpenko A, Ngiam J, Ng AY (2011) ICA with reconstruction cost for efficient overcomplete feature learning. In: Advances in neural information processing systems, pp 1017–1025Google Scholar
 Ngiam J, Chen Z, Bhaskar SA, Koh PW, Ng AY (2011) Sparse filtering. In: Advances in neural information processing systems, pp 1125–1133Google Scholar
 Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil20)Google Scholar
 Sim T, Baker S, Bsat M (2002) The CMU pose, illumination, and expression (PIE) database, IEEE, pp 53–58. https://doi.org/10.1109/afgr.2002.1004130
 Matlab Codes and Datasets for Subspace Learning and Dimensionality Reduction. http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html. Accessed 12 Apr 2018
 LeCun Y (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist. Accessed 15 September 2017
 Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9781139924801 View ArticleGoogle Scholar
 Oja E (1992) Principal components, minor components, and linear neural networks. Neural Netw 5:927–935View ArticleGoogle Scholar
 Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391View ArticleGoogle Scholar
 Turk MA, Pentland AP (1991) Face recognition using eigenfaces, IEEE, pp 586–591. https://doi.org/10.1109/cvpr.1991.139758
 Gultepe E, Conturo TE, Makrehchi M (2018) Predicting and grouping digitized paintings by style using unsupervised feature learning. J Cult Herit 31:13–23. https://doi.org/10.1016/j.culher.2017.11.008 View ArticleGoogle Scholar
 Choi S, Cichocki A, Park HM, Lee SY (2005) Blind source separation and independent component analysis: a review. Neural Inf Process Lett Rev 6:1–57Google Scholar
 Belouchrani A, AbedMeraim K, Cardoso JF, Moulines E (1997) A blind source separation technique using secondorder statistics. IEEE Trans Signal Process 45:434–444View ArticleGoogle Scholar
 Delorme A, Makeig S (2004) EEGLAB: an open source toolbox for analysis of singletrial EEG dynamics including independent component analysis. J Neurosci Methods 134:9–21View ArticleGoogle Scholar
 Vigário RN (1997) Extraction of ocular artefacts from EEG using independent component analysis. Electroencephalogr Clin Neurophysiol 103:395–404View ArticleGoogle Scholar
 Calhoun VD, Adali T, Pearlson GD, Pekar J (2001) A method for making group inferences from functional MRI data using independent component analysis. Hum Brain Mapp 14:140–151View ArticleGoogle Scholar
 McKeown MJ, Sejnowski TJ (1998) Independent component analysis of fMRI data: examining the assumptions. Hum Brain Mapp 6:368–372View ArticleGoogle Scholar
 Nascimento M, e Silva FF, Sáfadi T, Nascimento ACC, Ferreira TEM, Barroso LMA, Ferreira Azevedo C, Guimarães SEF, Serão NVL (2017) Independent component analysis (ICA) basedclustering of temporal RNAseq data. PLoS ONE 12:e0181195View ArticleGoogle Scholar
 Bartlett MS, Movellan JR, Sejnowski TJ (2002) Face recognition by independent component analysis. IEEE Trans Neural Netw Publ IEEE Neural Netw Counc 13:1450View ArticleGoogle Scholar
 Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13:411–430View ArticleGoogle Scholar
 Lewicki MS, Sejnowski TJ (2000) Learning overcomplete representations. Neural Comput 12:337–365View ArticleGoogle Scholar
 Coates A (2012) Demystifying unsupervised feature learning. Dissertation, Stanford UniversityGoogle Scholar
 Coates A, Ng AY (2012) Learning feature representations with kmeans. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Springer, Berlin, pp 561–580. https://doi.org/10.1007/9783642352898_30 View ArticleGoogle Scholar
 Lee DD, Seung HS (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401:788View ArticleMATHGoogle Scholar
 Thompson SK (2012) Sampling, 3rd edn. Wiley, Hoboken. https://doi.org/10.1002/9781118162934 View ArticleMATHGoogle Scholar
 Yang Y, Xu D, Nie F, Yan S, Zhuang Y (2010) Image clustering using local discriminant models and global integration. IEEE Trans Image Process 19:2761–2773MathSciNetView ArticleMATHGoogle Scholar
 Xu W, Liu X, Gong Y (2003) Document clustering based on nonnegative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM, pp 267–273. https://doi.org/10.1145/860484.860485
 Kuhn HW (1955) The Hungarian method for the assignment problem. Nav Res Logist Q 2:83–97MathSciNetView ArticleMATHGoogle Scholar
 Li F, Qiao H, Zhang B, Xi X (2017) Discriminatively boosted image clustering with fully convolutional autoencoders. ArXiv Prepr. ArXiv170307980Google Scholar
 Liu H, Shao M, Li S, Fu Y (2016) Infinite ensemble for image clustering. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1745–1754Google Scholar
 Guo X, Liu X, Zhu E, Yin J (2017) Deep clustering with convolutional autoencoders. In: International Conference on Neural Information Processing, Springer, pp 373–382. https://doi.org/10.1007/9783319700960_39
 Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing selfaugmented training. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning. PMLR, International Convention Centre, Sydney, Australia, pp 1558–1567Google Scholar
 Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller B (2014) A deep seminmf model for learning hidden representations. In: International Conference on Machine Learning, pp 1692–1700Google Scholar
 Wang Z, Chang S, Zhou J, Wang M, Huang TS (2016) Learning a taskspecific deep architecture for clustering. In: Proceedings of the 2016 SIAM International Conference on Data Mining, SIAM, pp 369–377. https://doi.org/10.1137/1.9781611974348.42
 Zhang W, Wang X, Zhao D, Tang X (2012) Graph degree linkage: agglomerative clustering on a directed graph. In: European Conference on Computer Vision, Springer, pp 428–441. https://doi.org/10.1007/9783642337185_31
 Zhang W, Zhao D, Wang X (2013) Agglomerative clustering via maximum incremental path integral. Pattern Recognit 46:3056–3065View ArticleMATHGoogle Scholar
 Nie F, Zeng Z, Tsang IW, Xu D, Zhang C (2011) Spectral embedded clustering: a framework for insample and outofsample spectral clustering. IEEE Trans Neural Netw 22:1796–1808View ArticleGoogle Scholar
 Mirowski P, Ranzato M, LeCun Y (2010) Dynamic autoencoders for semantic indexing. In: Proceedings of the NIPS 2010 Workshop on Deep Learning 2010, pp 1–9Google Scholar
 Kolenda T, Hansen LK, Sigurdsson S (2000) Independent components in text. In: Advances in Independent Component Analysis, Springer, pp 235–256. https://doi.org/10.1007/9781447104438_13
 Chagnaa A, Ock CY, Lee CB, Jaimai P (2007) Feature extraction of concepts by independent component analysis. J Inf Process Syst 3:33–37View ArticleGoogle Scholar
 Gultepe E, Green JP, Nguyen H, Adams J, Albertson T, Tagkopoulos I (2013) From vital signs to clinical outcomes for patients with sepsis: a machine learning basis for a clinical decision support system. J Am Med Inform Assoc 21:315–325. https://doi.org/10.1136/amiajnl2013001815 View ArticleGoogle Scholar
 Ma M, Park DW, Kim SK, An S (2012) Online recognition of handwritten korean and english characters. J Inf Process Syst 8:653–668. https://doi.org/10.3745/JIPS.2012.8.4.653 View ArticleGoogle Scholar