Graph clusteringbased discretization of splitting and merging methods (GraphS and GraphM)
 Kittakorn Sriwanna^{1}Email author,
 Tossapon Boongoen^{1} and
 Natthakan IamOn^{1}
https://doi.org/10.1186/s1367301701038
© The Author(s) 2017
Received: 13 December 2016
Accepted: 30 May 2017
Published: 3 August 2017
Abstract
Discretization plays a major role as a data preprocessing technique used in machine learning and data mining. Recent studies have focused on multivariate discretization that considers relations among attributes. The general goal of this method is to obtain the discrete data, which preserves most of the semantics exhibited by original continuous data. However, many techniques generate the final discrete data that may be less useful with natural groups of data not being maintained. This paper presents a novel graph clusteringbased discretization algorithm that encodes different similarity measures into a graph representation of the examined data. The intuition allows more refined datawise relations to be obtained and used with the effective graph clustering technique based on normalized association to discover nature graphs accurately. The goodness of this approach is empirically demonstrated over 30 standard datasets and 20 imbalanced datasets, compared with 11 wellknown discretization algorithms using 4 classifiers. The results suggest the new approach is able to preserve the natural groups and usually achieve the efficiency in terms of classifier performance, and the desired number of intervals than the comparative methods.
Keywords
Background
Discretization is a data reduction preprocessing technique in data mining. It transforms a numeric or continuous attribute to a nominal or categorical attribute by replacing the raw values of a continuous attribute with nonoverlapping interval labels (e.g., 0–5, 6–10, etc.). Different data mining algorithms are designed to handle different data types. Some are designed to handle only either numerical data or nominal data, while some can cope with both. Because real datasets are always a combination of numeric and nominal vales, for an algorithm that only takes nominal data, numerical attributes need to be discretized into nominal attributes before the learning algorithm. After discretization, the subsequence mining process may be more efficient as the data is reduced and simplified resulting in more noticeable patterns [1–3]. Moreover, discretization is also expected to improve the predictive accuracy for classification [4] and Label Ranking [5].
Two main goals of discretization are to find the best intervals or the best cut points^{1} and to find the finite number of intervals or the number of cut points that are better adapted to the learning. Therefore, modelling a discretization requires the development of the following two subjects. First, discretization criterion is the criterion made for choosing the best cut points in order to split a set of distinct numeric value into intervals (Splitting, Top–Down methods) or merge a pair of adjacent intervals (Merging, Bottom–Up methods). Second, stopping criterion is a criterion for stopping the discretization process in order to yield the finite number of intervals.
The process of discretization typically consists of 3 main steps that are sorting attribute values, finding the cutpoint model or discretization scheme of the attributes by iterative splitting or merging, and finally, assigning each value in the attribute with a discrete label corresponding to the interval it falls into. Hence, the number of intervals obtained for attributes under question may be different depending on the number of possible cut points, the relation with the target class, for instance.
Discretization techniques can be classified in several different ways, such as supervised versus unsupervised, univariate versus multivariate, splitting versus merging, direct versus incremental, and more [6–9]. Supervised methods consider the class information, whereas unsupervised ones do not consider the class information. Splitting algorithms start from one interval and recursively select the best cut point to split the instances into two intervals, while merging methods begin with the set of single value intervals and iteratively merge adjacent intervals. The univariate category discretizes each attribute independently without considering its relationship between other attributes. However, multivariate methods also consider other attributes to determine the best cut points. Direct techniques require inputting the number of intervals supplied by the user. Example of this type of methods are equalwidth and equalfrequency discretization algorithms [10]. The number of intervals is equal for all attributes in these algorithms. In contrast, incremental methods do not require the number of intervals, but they require the stopping criterion to stop the discretization process in order to yield the best number of intervals of each attribute.
Most discretization algorithms proposed in the past were univariate methods [3]. ChiMerge [11], Chi2 [12], Modified Chi2 [13], and areabased [14] are examples of univariate, supervised, incremental and merging methods. They use statisticbased algorithms to determine the similarity of adjacent intervals. They divide the instances into intervals and then merge adjacent intervals based on \(\chi ^2\) distribution. CAIM [15], urCAIM [16], and CADD [17] are univariate, supervised, incremental and splitting methods. These algorithms use classattribute interdependence information as the discretization criterion for the optimal discretization with a greedy approach, by finding locally maximum values of the criterion. EntMDLP [18], D2 [19], and FEBFP [20], which are univariate, supervised, incremental and splitting methods, recursively select the best cut points to partition the instances based on the class information entropy [10]. PKID and FFD [21] are univariate and unsupervised methods that maintain discretization bias and variance by tuning the interval frequency and the interval numbers, especially used with NaiveBayes classifier.
The fact that univariate methods discretize only a single attribute at a time and do not consider interactions among attribute, may lead to important information loss [3] and not getting a global optimal result [22]. ICA [22], ClusterEntMDLP [23], HDD [3], and ClusterRSDisc [24] are examples of multivariate discretization methods. ICA, tries to get a global optimal result by transforming the original attributes to new attribute space that considers other attributes. Then, it discretizes the new attribute space using the univariate discretization method (EntMDLP). ClusterEntMDLP, similar to ICA, finds the pseudoclass by clustering the original data via kmeans [25] and SNN [26] clustering algorithms and then uses the target class and pseudoclass to discretize using the univariate discretization method (EntMDLP). It finds the best cut point by averaging both entropies that considers the target class and pseudoclass. HDD extends the univariate discretization method (CAIM) by improving the stopping criterion and taking into account information specific to other attributes. ClusterRSDisc attempts to obtain the natural intervals of attribute values by first partitioning using DBSCAN [27] clustering algorithm. After that, it discretizes an attribute based on the rough set theory.
Although several multivariate discretization algorithms overcome the drawback of univariate discretization methods by considering interactions among attributes, there are some weaknesses. Some of them have made use of cluster labels as a part of discretization criterion. As a matter of fact, different clustering algorithms or even the same algorithm with multiple trials may produce different cluster labels. Therefore, it is extremely difficult for a user to decide the proper algorithm and parameters [28–30]. Despite the reported improvement, some multivariate discretization algorithms such as EMD [31] do not concentrate the natural group of data. In fact, EMD is an evolutionary discretization algorithm that defines the fitness function based only on high predictive accuracy and lower number of intervals. Hence, the identified cut points may damage the natural group of data.
In order to solve the aforementioned problems, this study presents a novel graph clustering based algorithm that allows encoding of different similarity measures into a graph representation of the examined data. Instead of using cluster labels, the proposed method uses the similarity between data pair, which is the weighted combination between distance and targetclass agreement. Each pair of data is formed into graph, which is then partitioned in order to find the appropriate set of cutpoints. The insightful observations and the benefit of graph clustering are present as follows.
In the clustering process, the instances are partitioned into clusters based on their similarity. Instances in the same cluster are similar to one another and dissimilar to instances in other clusters [1, 32]. Clustering is a method for recognizing and discovering natural groups of similar elements in a dataset [33, 34]. Recently, clustering with graphs has been widely studied and become very popular. The method is to treat the entire clustering problem as a graph problem. The graph vertices are grouped into clusters based on the edge structure and property [35]. Graph clustering algorithms are suitable for data that does not comply with a Gaussian or a spherical distribution [36]. They can be used to detect clusters of any size and shape. Moreover, graph clustering is a very useful technique for detecting densely connected groups [37]. The goal of graph clustering is the separation of sparsely connected dense subgraphs from one another based on various criteria such as vertex connectivity or neighborhood similarity [38].
The main graph clustering formulations are based on graph cut and partitioning problems [39, 40]. According to the research of Foggia et al. [36], the performance of five graph clustering algorithms: fuzzy cmeans MST (minimum spanning tree) clustering [41], Markov clustering [34, 42], iterative conductance cutting algorithm [43], geometric MST clustering [44], and normalized cut clustering [40] are evaluated and compared. Based on this empirical study, it is confirmed that the normalized cut clustering provides good performance and appears to be robust across application domains. The normalized cut criterion avoids any unnatural bias for partitioning out small sets of points [40]. It is used in many works such as image segmentation and spectral clustering [45, 46].

The proposed discretization algorithms are incremental and multivariate methods. They find the number of cut points automatically and preserve the natural group of data by considering the correlation dependence among the attributes.

The algorithm encodes information of data under examination as a graph, where the weight of each edge is evaluated by both natural distance and class similarity. This is a novel with the capability to address both datawise relation and classspecific in the same decision making process.

The normalize cut criterion prevents an unnatural bias that partitions out a small set of points. It helps to avoid acquiring toosmallsize intervals with only a few members, thus demoting the overfit problem.
Graph clustering and partitioning problems
In this paper, the undirected weighted graph with no selfloops and global graph clustering are focused. Global graph clustering assigns all of the vertices to a cluster, whereas local graph clustering only assigns a certain subset of vertices [35]. In brief, the reviews in this section are on these two types of undirected weighed and global graph clustering.
Throughout this paper, let \(G = (V, E, \omega )\) be an undirected graph, \(V = \{v_1, \ldots , v_n\}\) is a set of vertices and the number of vertices \(n=V\), where each vertex \(v_i\) represents a data point \(x_i\), and \(E=\{(u,v) \mid u,v \in V\}\) is a set of edges and the number of edges \(m=E\), where \(\omega (u,v)\) is the positive weighting of the edges. Let the set of partitioning or clustering of G be \(\pi _k=\{C_1, \ldots , C_k\}\), where \(C_i\) is a cluster in k subclusters.
Clustering measures
Measure with vertices
This type of measures is based on similarities or distances of the object pairs. They are used to apply to many areas such as pattern recognition, information retrieval, and clustering [47]. The common distance measures that are used in clustering are Euclidean and Manhattan distance [48].
Measures with clusters
Graph clustering is dividing a graph into groups (cluster, subgraph) that vertices highly connect in the same group [50]. To compare the quality of a given cluster C, the cluster fitness measure (quality function) or indices for graph clustering [33, 44, 51] are used [35, 52]. This study categorizes the clustering indices into two groups: density measures, and cutbased measures.
Global graph clustering
The problem of global graph clustering is a dividing or grouping each vertex in the graph into clusters of predefined size, such that vertices in the same cluster are highly related and less related with vertices in the other clusters. The main clustering problem is NPhard; therefore, the algorithms selected are approximation algorithm, heuristic algorithm, or greedy algorithm [35] so that the computation time is reduced.
Spectral clustering
The main tool for spectral clustering is the Laplacian matrices technique [56]. The algorithm transforms the affinity matrix (similarity matrix) into Laplacian matrix of the graph and then finds the eigenvector and eigenvalue from the matrix. Each row of eigenvector represents a vertex (data point). The final stage is clustering or partitioning of those vertices by recursive twoway (bipartition) or direct kway [39].
The unnormalized kmeans algorithm [57] is an example of direct kway clustering. It clusters the vertices using kmean algorithm with the number of desired clusters. The twoway normalized spectral clustering (2NSC) and kway normalized spectral clustering (KNSC) algorithms are proposed by [40] to solve the generalized eigenvalue system using NCut criterion. 2NSC is a recusive twoway clustering that the number of clusters is controlled directly by the maximum allowed NCut value, whereas KNSC is direct kway clustering based on kmean algorithm.
Markov chains and random walks
The Markov cluster algorithm (MCL) [34] finds the cluster structure in the form of graphs using the mathematical bootstrapping procedure. The main idea of MCL is to simulate the flow within a graph, promote the flow where the current is strong and demote the flow where the current is weak. If there is any natural group in the graph, the current across between groups will wither away. The algorithm also simulates random walks within a graph, starting from a vertex and randomly travelling to another vertex that connected with many steps. Travelling within the same cluster is more likely than across the other cluster.
A novel graph clusteringbased approach
Pairwise affinity matrix
Graph clusteringbased discretization algorithm
Typically, the discretization algorithm discretizes attribute one by one. The process consists of three main processes. First, sorting an attribute \(A_i\) and finding all the possible cut points. Second, finding the cut point model of the attribute by iteratively finding the model one by one until discretization is complete on all of the numeric attribute. Finally, transforming the numeric attribute to the nominal attribute.

After sorting the attribute \(A_1\), discretization algorithm that is unsupervised and univariate method only considers the cut points as shown in Fig. 4a. The algorithm selects the cut point that all intervals have the similar number of data points or similar length.

For discretization algorithm that is the supervised and univariate method, after sorting \(A_1\), the algorithm only considers the class labels as shown in Fig. 4b. The algorithm discretized based on the purity class. The data points with the same class label are grouped. However, this method does not consider other attributes; therefore, it tends to lose the natural group.

The ClusterEntMDLP [23] is an example of supervised and multivariate method. This type of algorithms includes other attributes into consideration. The data are clustered and labelled. Then, the algorithm discretizes by considering class labels (see Fig. 4b for details) and cluster labels (see Fig. 4c for details) together.

In the proposed algorithm, the data point view is based on the graph as shown in Fig. 4d. The vertices represent the data points and the edges’s weights represent the scores of similarity of the vertex pairs (see “Pairwise affinity matrix”). As both class and cluster labels are considered, this data view preserves much more information than the other views.
The proposed splitting discretization algorithm

Firstly, reorder the pairwise affinity matrix of \(A_1\) in ascending as shown in the top matrix in Fig. 5. The matrix is \(10\times 10\) elements. Each element contains the pairsimilarity value, which in this example is represented by color, where a dark color indicates high similarity and a light color indicates low similarity.

Then, calculate the NAsso values of all possible cut points (vertical dashedlines). The example of calculating NAsso value of the cut point 3.5 is shown in the middle pairwise affinity matrix. The two square boxes in the matrix are the intracluster weights of 2 partitions. The row outside the box is the intercluster weight of the boxed. The algorithm calculates NAsso values of all 6 cut points and select a cut point at 3.5 according to its highest NAsso values. Then, it bipartitions a graph into 2 clusters.

Finally, iteratively find the new best cut point until the stopping criterion is satisfied. In the figure, the second cut point is 9 with the highest NAsso value of 1.039. Because the value does not improve the NAsso value at the previous cut point (\(cut=3.5\), \(NAsso=1.076\)), the stopping criterion is stratified and stops at the previous step (see next paragraph for the details). Hence, attribute \(A_1\) has 2 intervals with the cut point at 3.5.
The proposed merging discretization algorithm
In contrast to splitting methods, the basic merging methods (bottom–up methods) start from many initial clusters. Each initial cluster contains data points that have the same attribute value. The algorithm iteratively merges the adjacent pair of clusters that the data points of the pair are most similar until the stopping criterion is met.
For demonstrating an example of this process, the Toy dataset is used again as shown in Fig. 8. In the first step, the initial clusters are created by grouping the data points of attribute \(A_1\) that have the same attribute values and reordered the values in ascending. The initial clusters have 7 clusters, \(\pi _7=\lbrace C_1,C_2,\ldots ,C_7\rbrace\). After that, two adjacent clusters that have highest NAsso values are merged until all clusters are grouped into one cluster \((\pi _1)\). In Fig. 8, the adjacent clusters of the cut point at 1.5, two initial clusters that contain attribute values of 1 and 2 possess the highest NAsso value, therefore, this pair is merged first. The adjacent clusters (intervals) are merged further until only one cluster whose NAsso value is 1 remained. In the second step, the best set of clusters \((\pi _*)\) is searched by evaluating the NAsso values top–down. This bottom–up algorithm uses the stopping criterion as in the top–down algorithm (see Eq. 13). As one can see that, the set of 3 clusters \((\pi _3)\) separated by the cut point of 9, the \(NAsso(\pi _3)\) value is 1.039 that less than 1.076 of the \(NAsso(\pi _2)\) value in the set of 2 clusters, hence \(\pi _*=\pi _2\).
Numeric to nominal transformation
Performance evaluation
This section presents the performance evaluation of the graph clusteringbased approaches that are splitting approach (GraphS) and merging approach (GraphM). In order to show the goodness of the proposed approaches that can handle with real world applications, this study investigated 30 real standard datasets and 20 imbalanced datasets. This study evaluates the proposed methods by comparing with 11 discretization algorithms using 4 classifiers with the objective to evaluate and compare the performance of the selected discretization algorithms.
Investigated datasets
Description of 30 standard datasets
Dataset  n  d  \(d^u\)  \(d^o\)  c 

Australian  690  14  10  4  2 
Autos  205  24  14  10  6 
Banknote  1372  4  4  0  2 
Biodeg  1055  41  38  3  2 
Blood  748  4  4  0  2 
Bupa  345  6  6  0  2 
Cleve  295  13  6  7  2 
Column2C  310  6  6  0  2 
Column3C  310  6  6  0  3 
Ecoli  336  8  5  3  8 
Faults  1941  27  25  2  7 
Glass  214  9  9  0  6 
Haberman  306  3  3  0  2 
Hayes  132  5  5  0  3 
Heart  270  13  10  3  2 
Hepatitis  155  19  6  13  2 
ILPD  583  10  9  1  2 
Ionosphere  351  34  32  2  2 
Iris  150  4  4  0  3 
Liver  345  6  6  0  2 
Pima  768  8  8  0  2 
Seeds  210  7  7  0  3 
Segment  2310  19  18  1  7 
Sonar  208  60  60  0  2 
Tae  151  5  3  2  3 
Transfusion  748  4  4  0  2 
Vowel  990  13  11  2  11 
Wine  178  13  13  0  3 
Wisconsin  683  9  9  0  2 
Yeast  1484  9  7  2  10 
Description of 20 imbalanced datasets
Dataset  n  \(n_{{\text {min}}}\)  \(k_{{\text {min}}}\)  d  \(d^u\)  \(d^o\)  c 

Abalone19  4174  32  0.008  8  7  1  2 
Abalone918  731  42  0.057  8  7  1  2 
Ecoli0_vs_1  220  77  0.350  7  5  2  2 
Ecoli0137_vs_26  281  7  0.025  7  5  2  2 
Ecoli1  336  77  0.229  7  5  2  2 
Ecoli2  336  52  0.155  7  5  2  2 
Ecoli3  336  35  0.104  7  5  2  2 
Ecoli4  336  20  0.060  7  5  2  2 
Glass0  214  70  0.327  9  9  0  2 
Glass1  214  76  0.355  9  9  0  2 
Glass2  214  17  0.079  9  9  0  2 
Glass4  214  13  0.061  9  9  0  2 
Glass5  214  9  0.042  9  9  0  2 
Pageblocks0  5472  559  0.102  10  10  0  2 
Pima  768  268  0.349  8  8  0  2 
Segment0  2308  329  0.143  19  18  1  2 
Vehicle0  846  199  0.235  18  18  0  2 
Vowel0  988  90  0.091  13  11  2  2 
Wisconsin  683  239  0.350  9  9  0  2 
Yeast05679_vs_4  528  51  0.097  8  7  1  2 
Experiment design
An experiment is set up to investigate the performance of the proposed algorithms compared to 11 discretization algorithms, which have variously different techniques. For comparison, 4 classifiers of: C4.5 (j48) [61], KNearest Neighbors (KNN) [62], Naive Bayes (NB) [63], and Support Vector Machine (SVM) [64] classifiers which are in the top 10 algorithm in data mining [65, 66] are examined.
The standard datasets considered are partitioned using the tenfold crossvalidation procedure [67]. Each discretization algorithm performed with pairs of 10 training and 10 testing of each dataset. For the imbalanced datasets, each dataset was separated in fivefold already, this study evaluates them using fivefold crossvalidation procedure.
The performance measures
To evaluate the quality of the discretization algorithms, the following 3 measures are employed: number of intervals, running time, and predictive accuracy, respectively.
The number of intervals The result of the number of intervals is expected to be a small number as a large number of intervals may cause the learning to be slow and ineffective [7, 19], and the small number of intervals is easier to understand. However, too simple, too small discretization schemes may lead to lose classification performance [16].
The predictive accuracy The successful discretization algorithm should perform discretization such that the predictive accuracies are increased or without significant reduction of predictive accuracy. This study evaluates the classification performance of 30 standard datasets using predictive accuracy. The predictive accuracy results of the dataset are summarized by the mean of its 10 folds.
AUC Because the predictive accuracy is not a suitable measure for imbalanced data, this study evaluates 20 imbalanced datasets using AUC (area under the ROC curve) [68, 69]. The AUC is wildly used for classification evaluation for imbalanced data [70, 71], which measure the diagnostic accuracy of a test. The AUC value lies between 0 and 1, the higher value is the better average accuracy test. This study summarized AUC results of 20 imbalanced dataset by the mean of its 5 folds.
Statistical analysis
 1.
The Nemenyi posthoc test [74] is used to find the critical difference (CD). Two algorithms are significantly different if the corresponding average ranks differ by at least the CD (using 95% confident level).
 2.
The Holm posthoc test [75, 76] is used to find the pvalue of the posthoc Holm (\(p_{Holm}\)) of each pair comparison. The discretization algorithm that obtains the lowest rank value is set as a control algorithm. The control algorithm is compared against the rest algorithms.
Compared discretization algorithms
Ameva [77] An autonomous discretization algorithm (Ameva) is univariate, supervised and splitting methods. The discretization criterion of the algorithm based on \(\chi ^{2}\) values. There are two objectives of Ameva: maximize the dependency relationship between the target class and an attribute, and minimize the minimum number of intervals.
CAIM [15] ClassAttribute Interdependence Maximization discretization algorithm (CAIM) is a splitting method proposed by Kurgan and Cios. The goal of the algorithm is to find the minimum number of discrete intervals, while minimizing the loss of classattribute interdependency for the optimal discretization with a greedy approach. It iteratively finds the best cut point in order to split into two intervals until the stopping criterion is satisfied. The algorithm mostly generates discretization schemes that the number of intervals equal to the number of classes [16, 78].
ChiMerge [11] ChiMerge is a merging method introduced by Kerber. It uses a \(\chi ^{2}\) values as a discretization criterion. It divides the instances into intervals of distinct values and then iteratively merges the best adjacent intervals until the stopping criterion is fulfilled. The stopping criterion of ChiMerge is related to the \({\chi ^2}threshold\), and in order to compute the threshold the user must specify the significance level.
EMD [31] The Evolutionary Multivariate Discretizer (EMD) is a multivariate method based on CHC algorithm [79], the subclass of Genetic algorithm that is one of the powerful search methods. The algorithm defines a fitness function for two objectives of: the lower classification error (based on C4.5 and NB classifiers) and the lower number of cut points. A chromosome is encoded as a binary array of the cutpoints selection, 1 is selected and 0 otherwise. The chromosome is encoded for all possible cutpoints of continuous attributes. Therefore, the algorithm requires a lot of time in order to search for the optimal result, especially in high dimension of data and large number of instances.
FFD and PKID [21] These algorithms are unsupervised method proposed by Young and Webb. The key of the algorithms is to maintain discretization bias and variance by tuning the interval frequency and the interval numbers, especially used in NaiveBayes classier. The fixed frequency discretization (FFD) sets a sufficient interval frequency m, then discretizes such that all intervals have approximately the same number m of training instances with adjacent values. The proportional discretization (PKID) sets the interval frequency and the interval number to be proportional to the amount of training data in order to achieve a low variance and a low bias.
FUSINTER [80] This algorithm is a greedy merging method. It uses the same strategy as ChiMerge. The main characteristic of the algorithm is that it is based on the sensitivity measure of the sample size to avoid very thin partitioning. The algorithm first merges the adjacent intervals that all instances of the intervals are the same target class, and continues until no improvement is possible or the number of intervals reaches 1. The user must specify 2 parameters, \(\alpha\) and \(\lambda\), which are significance level and variable tuning, in order to control the performances of the discretization procedure.
HDD [3] HDD extends CAIM by improving the stopping criterion and discretization by taking other attributes into account (multivariate method). The algorithm considers the distribution of both target class and continuous attributes. The algorithm divides the continuous attribute space into a finite number of hypercubes that the objects within each hypercube belongs to the same decision class. However, the algorithm mostly generates a large number of intervals and has slow discretization time [16].
MChi2 [13] Modified Chi2 (MChi2) is proposed by Tay and Shen. It is a merging method using statisticbased \((\chi ^2)\) to determine the similarity of the adjacent intervals. The algorithm enhances Chi2 algorithm [12] by making the discretization process completely automatic. It replaces the inconsistency check in the Chi2 with a level of consistency that is approximated after each step of discretization. The algorithm also considers the factor of degree of freedom in order to improve the accuracy.
urCAIM [16] The algorithm proposed by Janssens et al. It improves CAIM which combines the CAIR [81], CAIU [82], and CAIM discretization criterion together in order to generate more flexible discretization schemes, require lower running time than CAIM, and improve predictive accuracy, especially in unbalanced data.
Zeta [83] This algorithm is introduced by Ho and Scott. It is a direct method that the user must specify the number of intervals, k, where each attribute is discretized into k intervals. The discretization criterion of the algorithm is based upon lambda [84] that is widely used to measure strength association between nominal variables. This criterion is defined as the maximum accuracy achievable when each value of a feature predicts a different class value.
Parameter settings
Parameters of classifiers and discretizers
Method  Parameters 

Classifier  
C4.5  Pruned tree \(=true\), confidence \(=0.25\), minimum example per leaf \(=2\) 
KNN  \(k=3\), distanced function \(=EuclideanDistance\) 
Discretizer  
ChiMerge  Confidence threshold \(=0.05\) 
EMD  Population size \(=50, ~ M_e=10,000, ~ \alpha =0.7, ~ R_{rate}=0.1, ~ R_{perc}=0.5\) 
FFD  Frequency size \(=30\) 
FUSINTER  \(\alpha =0.975, ~ \lambda =1\) 
HDD  Coefficient \(=0.8\) 
GraphS, GraphM  \(\beta =1.01\) 
Experiment results and analysis of standard datasets
Number of intervals
In the figure, the lowest average number of intervals per attribute of all datasets belongs to EMD (2.04), the second is Ameva (2.51), and the third is ChiMerge (2.99). Since EMD uses the predictive accuracy of C4.5 and NB as a part of the fitness function, some attributes excluded from the final classification model will have no cutpoint, i.e., the whole attribute domain is considered as only 1 interval. For ChiMerge, the number of intervals depends on the user’s specification on the significant level (\(\alpha\)). If the user sets this value high (\(\alpha\) close to 0), the algorithm will over merging, leading to a low number of intervals. EMD and ChiMerge generate one number of intervals of some attributes, the attributes are removed in classification learning. Unlike EMD and ChiMerge that are an implicit coupling of supervised feature selection and discretization, GrpahM and GraphS concentrate on the later, with the possibility to combine with many advanced feature selection methods. The average number of remove attributes of all discretization algorithms is summarized in Fig. 11c.
Many datasets of CAIM, and Zeta have the lowest number of intervals. CAIM discretizes with the number of intervals close to the number of target classes. If there are many target classes, the number of intervals of CAIM is high too (if there is enough number of cut points to split). Zeta is a direct method that fixes the number of intervals equal the number of target classes. Resulting in all attributes having the equal number of intervals.
Running time
Figure 11b presents the actual computational time (in seconds) required for creating the discretization scheme and discretization from the experimented datasets. These algorithms are implemented in Java and all experiments are conducted on an Intel(R) Xeon(R) CPU@2.40 GHz and 4 GB RAM. The execution time is measured using the System.currentTimeMillis() Java method (\(endTimestartTime\)).
The execution time averages from tenfold discretization (the time of classifier learning not included). The fastest technique over 30 standard datasets is PKID (0.109 s), while the slowest is EMD (113.435 s). With these, EMD is more than a 1000 times slower than PKID. That is because EMD is designed as an evolutionary process, which uses a wrapper fitness function with the chromosome encoding all cutpoint of examined attributes. Without applying any approximation heuristics, EMD naturally requires a long time to look for the optimal fitness value.
GraphS and GraphM are multivariate method similar to EMD and HDD, however, their running times are similar to the other univariate discretization algorithms. The average running time of GraphM is more than ten times faster than GraphS. It is because GraphM iteratively merged considering two adjacent intervals. In contrast, GraphS considered all cut points in order to find the best cut point and hence lose some time in searching. More details are discussed in “Time complexity and parameter analysis”.
Predictive accuracy
According to these results, the three discretization algorithms with the highest average predictive accuracy across 30 standard datasets with C4.5 classifier are EMD (78.20%), GraphS (77.88%), and GraphM (77.79%), respectively. The similar observation with KNN classifier are GraphS (78.04%), urCAIM (77.95%), and GraphM (77.73%). The three highest average predictive accuracy with NB classifier are urCAIM (77.24%), GraphS (77.20%), and EMD (77.12%). The results indicate that GraphS, GraphM, urCAIM, and EMD generally performed better than the rest techniques included in this experiment. With respect to the overall measures across three classifiers, the first highest accuracy belongs to GraphS (77.42%), the second is urCAIM (77.2%), and the third is GraphM (77.16%). As such, it has been demonstrated that the graph clusteringbased discretization algorithm is usually more accurate than many other wellknown counterparts.
Friedman rankings with critical differences (CD)
In the ranking of average number of intervals, EMD and ChiMerge are the first and second lowest ranking, there is no significantly different of the pair. Otherwise, EMD shows significantly lower number of intervals than the rest algorithms. In the ranking of average running times, urCAIM obtains the first ranking, whereas EMD appears to be the last. In addition, urCAIM is not significantly different to GraphM, CAIM, ChiMerge, FFD, PKID, Zeta, and GraphS.
In order to finely compute the ranking of predictive accuracy of each discretization algorithm, 300 predictive accuracy results over 30 standard datasets and tenfolds evaluation are examined. In Fig. 14, the lowest average accuracy rankings for C4.5 classifier belongs to EMD. However, it is not significantly better to GraphM, GraphS, urCAIM, and Ameva. GraphM obtains the lowest ranking for KNN classifier, however, it is not significantly different to urCAIM, GraphS, and EMD. Three lowest rankings for NB classifier are GraphM, GraphS, and EMD, there are no significantly different. In the average ranking for SVM classifier, GraphS obtains the lowest ranking, but it is not significantly better than ChiMerge, urCAIM, GraphM, and CAIM. In summary, the average ranking over all classifiers, GraphM, GraphS, and urCAIM obtain the first, second, and third lowest rankings, respectively, there are no significantly different. However, GraphM and GraphS are significantly better than the rest algorithms.
Given these findings, GraphM and GraphS can be useful not only for data analysis with high accuracy, but also with a reasonable time requirement. Also, the resulting discretized dimensions can be coupled with many effective feature selection approaches found in the literature.
Friedman rankings with \(p_{Holm}\)
Average Friedman rankings and \(p_{Holm}\) of the number of intervals, running time, and predictive accuracy for standard datasets
Discretizer  Raking  \(p_{Holm}\)  Discretizer  Raking  \(p_{Holm}\) 

Number of intervals  Running times  
EMD  2.7533  –  urCAIM  3.05  – 
ChiMerge  3.3883  0.045827  GraphM  3.5167  0.642579 
Ameva  4.39  0.000001  CAIM  4.35  0.392135 
CAIM  4.4767  0  ChiMerge  4.85  0.220322 
Zeta  4.7017  0  FFD  5.5667  0.058782 
GraphM  5.735  0  PKID  5.5833  0.058782 
GraphS  5.98  0  Zeta  5.8333  0.033841 
urCAIM  6.1883  0  GraphS  6.15  0.014349 
MChi2  8.2133  0  FUSINTER  9.0667  0 
FFD  9.7033  0  Ameva  9.2667  0 
PKID  10.5367  0  MChi2  9.5  0 
FUSINTER  11.9667  0  HDD  11.2667  0 
HDD  12.9667  0  EMD  13  0 
Accuracy of C4.5  Accuracy of KNN  
EMD  5.3033  –  GraphM  4.5667  – 
GraphM  5.4417  0.888248  urCAIM  5.0667  0.05 
GraphS  5.5467  0.888248  GraphS  5.0833  0.025 
urCAIM  5.7417  0.504152  EMD  5.5167  0.016667 
Ameva  6.2867  0.007941  Ameva  5.7667  0.0125 
ChiMerge  6.615  0.000185  Zeta  6.4167  0.01 
Zeta  6.7117  0.000057  ChiMerge  6.8  0.008333 
CAIM  6.735  0.000047  CAIM  6.8333  0.007143 
MChi2  6.94  0.000002  MChi2  7.1833  0.00625 
FFD  7.7  0  FFD  8.25  0.005556 
PKID  8.4983  0  PKID  8.65  0.005 
HDD  9.5333  0  FUSINTER  9.5333  0.004545 
FUSINTER  9.9467  0  HDD  11.3333  0.004167 
Accuracy of NB  Accuracy of SVM  
GraphM  5.0333  –  GraphS  4.7667  – 
GraphS  5.4333  0.208413  ChiMerge  4.95  0.638626 
EMD  5.9  0.012839  urCAIM  5.0833  0.638626 
ChiMerge  6.1  0.002385  GraphM  5.45  0.094907 
urCAIM  6.1833  0.001194  CAIM  5.8167  0.003839 
CAIM  6.2833  0.000423  EMD  6.55  0 
Zeta  6.9  0  MChi2  6.7333  0 
FFD  7.0167  0  Zeta  6.7833  0 
Ameva  7.1833  0  Ameva  7.0333  0 
MChi2  7.4333  0  PKID  8.3167  0 
PKID  7.6  0  FFD  8.7333  0 
FUSINTER  9.6167  0  FUSINTER  10.1  0 
HDD  10.3167  0  HDD  10.6833  0 
Accuracy of all classifiers  
GraphM  5.1229  –  
GraphS  5.2075  0.594723  
urCAIM  5.5188  0.025572  
EMD  5.8175  0.000037  
ChiMerge  6.1163  0  
CAIM  6.4171  0  
Ameva  6.5675  0  
Zeta  6.7029  0  
MChi2  7.0725  0  
FFD  7.925  0  
PKID  8.2663  0  
FUSINTER  9.7992  0  
HDD  10.4667  0 
The results of \(p_{Holm}\) test for a number of intervals, running times, and predictive accuracy of all classifiers is similar to the Nemenyi posthoc test. EMD is not significantly lower number of intervals than ChiMerge. urCAIM is not significantly faster than GraphM, CAIM, ChiMerge, FFD, PKID, and GraphS. For average ranking of predictive accuracy over all classifiers, GraphM, GraphS, and urCAIM obtain the first, second, and third lowest ranking. By using significant level \(\alpha =0.1\), GraphM is not significantly better accuracy than GraphS and urCAIM, in fact, GraphM is significantly better accuracy than the rest discretization algorithms. Besides, by using \(\alpha =0.05\), GraphM shows significantly better predictive accuracy than the other wellknown discretization algorithms.
Experiment results and analysis of imbalanced datasets
Number of intervals
The average number of intervals per attribute result is similar to the standard datasets, which the lowest and the highest average number of intervals belong to EMD (1.48) and HDD (154.51), respectively. Some discretization algorithms are an implicit coupling with supervised feature selection, especially EMD, ChiMerge, and MChi2 as shown in Fig. 15c, average number of remove attributes. The proposed methods obtain the lower number of intervals, the average values of GraphS and GraphM is 2.4. They do not remove any attributes. The proposed methods can be combined with advanced feature selection methods on the latter, the classification performance may improve.
Running time
The running time result is averaged from fivefold discretization, including two processes of the crating discretization scheme and transform numeric to nominal dataset. The fastest running time belongs to urCAIM (0.0757 s), while the slowest is EMD (36.4306 s) as shown in Fig. 15b, average running time in seconds. Generally a multivariate method requires higher running time than univariate method. GraphS and GraphM are multivariate methods same as EMD and HDD. However, their running time is faster than EMD and HDD, which the running time is similar to univariate method. The details of running time results are included as an Additional file 3.
AUC
Friedman rankings with critical differences (CD)
This ranking of average number of intervals is similar to the ranking of the standard datasets, EMD and ChiMerge show the first and second lowest ranking, while HDD shows the highest ranking. In the ranking of average running times, GraphM obtains the first ranking, however, it is not significantly different to urCAIM, Zeta, ChiMerge, FDD, PKID, and GraphS. The slowest running time still belongs to EMD.
The average ranking of AUC for C4.5, KNN, NB, SVM, and all classifiers shown in Fig. 18. In the ranking of AUC for C4.5 and SVM classifiers, the top three lowest rankings belong to EMD, GraphM, and GraphS. These tree discretizers are close in the rankings and are not significant differences. In the average ranking of AUC for KNN classifier, GraphS and GraphM obtain the lowest ranking. However, they are not significant differences to FFD, PKID, urCAIM, and MChi2. For NB classifier, the first and second lowest ranking belongs to FFD and PKID, where GraphS and GraphM lie in the fourth and seventh ranking, respectively. In summary, the average ranking over all classifiers, GraphS and GraphM obtain the first and second best ranking. However, they are not significantly different to urCAIM.
These ranking suggests that the proposed methods can be useful for imbalanced data, achieve the best ranking of AUC. Furthermore, they require lower running times and obtain a lower number of intervals.
Friedman rankings with \(p_{Holm}\)
Average Friedman rankings and \(p_{Holm}\) of the number of intervals, running time, and AUC for imbalanced datasets
Discretizer  Raking  \(p_{Holm}\)  Discretizer  Raking  \(p_{Holm}\) 

Number of intervals  Running times  
EMD  1.51  –  GraphM  2.65  – 
ChiMerge  2.095  0.288157  urCAIM  2.875  0.855034 
CAIM  3.73  0.000167  Zeta  4.3  0.360623 
Zeta  3.73  0.000167  ChiMerge  5.05  0.15396 
GraphM  6.295  0  FFD  5.225  0.14615 
GraphS  6.325  0  PKID  5.55  0.092665 
urCAIM  6.47  0  GraphS  6.45  0.012189 
Ameva  6.91  0  FUSINTER  7.775  0.000221 
MChi2  8.715  0  CAIM  8.6  0.000011 
FFD  10.22  0  MChi2  8.9  0.000003 
PKID  11.1  0  Ameva  9.85  0 
FUSINTER  11.81  0  HDD  10.825  0 
HDD  12.09  0  EMD  12.95  0 
AUC of C4.5  AUC of KNN  
EMD  4.8  –  GraphS  4.525  – 
GraphM  4.8  1.855328  GraphM  4.525  1.501358 
GraphS  4.85  1.855328  FFD  4.7  1.501358 
Ameva  5.9  0.137394  PKID  5.425  0.306705 
urCAIM  6.05  0.092927  urCAIM  5.825  0.073023 
ChiMerge  7  0.000389  MChi2  6.1  0.021202 
MChi2  7  0.000389  Ameva  6.6  0.000989 
CAIM  7.275  0.000049  ChiMerge  6.8  0.000253 
FFD  7.875  0  CAIM  8.125  0 
Zeta  7.975  0  EMD  8.15  0 
PKID  8.45  0  FUSINTER  9.05  0 
HDD  9.025  0  HDD  10.05  0 
FUSINTER  10  0  Zeta  11.125  0 
AUC of NB  AUC of SVM  
FFD  3.375  –  EMD  5.05  – 
PKID  4.025  0.237923  GraphS  5.15  1.949658 
MChi2  5.3  0.000947  GraphM  5.225  1.949658 
GraphS  5.875  0.000017  urCAIM  5.3  1.949658 
urCAIM  6.05  0.000005  CAIM  5.675  1.025834 
FUSINTER  6.15  0.000002  ChiMerge  5.95  0.511174 
GraphM  6.375  0  Ameva  6.35  0.109535 
Ameva  6.825  0  Zeta  6.575  0.03937 
ChiMerge  8.2  0  MChi2  6.9  0.006258 
CAIM  8.325  0  FFD  9.6  0 
EMD  8.8  0  PKID  9.625  0 
HDD  9.475  0  FUSINTER  9.725  0 
Zeta  12.225  0  HDD  9.875  0 
AUC of all classifiers  
GraphS  5.1  –  
GraphM  5.2312  0.633635  
urCAIM  5.8062  0.020656  
MChi2  6.325  0.000026  
FFD  6.3875  0.000012  
Ameva  6.4187  0.000008  
EMD  6.7  0  
PKID  6.8812  0  
ChiMerge  6.9875  0  
CAIM  7.35  0  
FUSINTER  8.7312  0  
Zeta  9.475  0  
HDD  9.6062  0 
The \(p_{Holm}\) results of the average ranking number of intervals shows that EMD is not a significant difference to ChiMerge. However, it is significant lower number intervals to the rest discretizers. In the average ranking running time, GraphM is not significantly faster than urCAIM, Zeta, ChiMerge, FFD, PKID, and GraphS. For an average ranking of AUC over all classifiers, GraphS is not significantly better AUC than GraphM and urCAIM. However, by using \(\alpha =0.05\), GraphS shows significantly better AUC than the other wellknown discretization algorithms.
Experiment results of Toy dataset
The results show that, ChiMerge and FFD created no cut points. The number of cut points selected by ChiMerge depends on the significant level (\(\alpha\)), and in this study \(\alpha\) is set to 0.05 (recommended by the authors of ChiMerge). Too high significant level (\(\alpha\) is close to 0) will lead to over merging. In order to allow the algorithm to select the cut points, the significant level should be reduced. FFD algorithm is an unsupervised method that the user must specify the frequency size. In this study the frequency size is set 30 (recommended by the authors of FFD). Since the number of the data points is no more than 30, all data points are grouped into one interval. In order to achieve the desired cut points, the frequency size should be smaller than the number of data points.
The result cutpoint selection using EMD is only a single cutpoint at 2.5 of attribute \(A_1\). Because the fitness function of EMD is weighted from the lower predictive error and the lower number of intervals, the attributes that do not use to create the classifier model mostly generate null cutpoint. Specific to the aforementioned result, this selected cutpoint also damages the natural groups of data, which is illustrated by the upperleft 5 data points.
The proposed algorithms of GraphS and GraphM show the same discretization results. The cut points of \(A_1\) and \(A_2\) are 3.5. Although the adjacent data points at the cut point 3.5 of \(A_1\) (3:7, 4:1) are the same target class, they highly different in natural groups, which the upper left 5 data points are the same natural group and the lower 5 data points is the other. The graphbased algorithms could preserve the natural groups of data points by selected this cut point. In addition, the proposed algorithm do not partition a right data point (12:5) to be alone same as FUSINTER, MChi2, and HDD. It is clearly shown that the graph clusteringbased discretization algorithms give a finite number of cut points, prevent an unnatural bias for undesiredly partitioning out some data points, and preserve the natural groups of data points. It is different from CAIM, Zeta, Ameva, and urCAIM that only considered the target class. Based on the purity class, CAIM, Zeta, Ameva, and urCAIM selected the cut point at 2.5 of \(A_1\). The result damaged the upper left natural group.
PKID is an unsupervised method that does not consider the target class and the natural groups. The algorithm discretizes by giving all intervals similar numbers of data points. Considering \(A_1\), there are 3 intervals and the number of data points in the intervals are 4, 4, and 2. For \(A_2\), the number of data points in the divided intervals are 3, 4, and 3.
FUSINTER, MChi2, and HDD algorithms resulted in very large numbers of intervals, especially HDD algorithm that selected every single possible cut point.
Time complexity and parameter analysis
Time complexity
Besides previous quality assessments, the computational requirements of the graph clusteringbased methods are discussed here. To estimate the time complexity, only the time taken to discretize a single attribute is considered. As the AF matrix is created once and share resources to discretize other attributes, the complexity time did not include the time of creating the matrix. In addition, for the ease of computing the denominator of NAsso, \(\omega (C_i, C_i) + \omega (C_i, \bar{C_i})\) (see Eq. 7), the sum of each row is calculated and stored in sumRowsVector.
For GraphM algorithm (see Fig. 9 for details), the time complexity of NAsso is similar to GraphS but GraphM does not take any time to find the best cut point. From line 7 to line 9, GraphM computes NAsso of a small matrices many times; however, the process is very fast. After that, from line 10 to line 17 the algorithm iteratively merges until all intervals merge into one. Each merging computes NAsso twice as that of the adjacent interval; therefore, the time complexity of GraphM is \(\mathcal {O}(2n^2)=\mathcal {O}(n^2)\).
Parameter analysis
An important observation of average accuracy of the standard dataset is that the proposed algorithms performed well with the \(\beta\) value are between 1.005 and 1.015. The accuracy and performance dropped if \(\beta\) is greater than 1.015 as the stopping criterion are quickly met and hence the process stops too early such that only two intervals are obtained. In contrast, if \(\beta\) is not considered \((\beta =1)\), the algorithm will be over partitioning and result in too many intervals. Therefore, a good \(\beta\) value that gives a decent number of intervals are in the rank 1.005–1.015. Similarly, the average AUC results of imbalanced datasets perform well with the \(\beta\) are in the rank 1.000–1.015, especially \(\beta =1.005\).
Conclusion
This paper presents two novel, highly effective graph clusteringbased discretization algorithms that are graph clusteringbased discretization of splitting method (GraphS) and merging method (GraphM). They aim to discretize by considering the natural groups and prevent partitioning any interval that will possess a small number of data points. The algorithms view the data points as a graph, where the vertices are the data points (instances) and the weighted edges are the similarity between the data points. The NAsso measure is used as the discretization criterion of the algorithms. The empirical study, with different discretization algorithms, classifiers, two types of dataset of standard datasets and imbalanced datasets, suggests that the proposed graph clusteringbased methods usually achieve superior discretization results compared to those of the previous wellknown discretization algorithms. The prominent future work may include an extensive study regarding the scoring of the weighted similarity edges by considering other distance measures. In addition, this methodology will also be applied to specific domains such as biomedical datasets; where discretization is not only required for accurate prediction, but also an interpretable learning model.
Declarations
Authors' contributions
KS drafts this manuscript, developed the algorithms and conducted experiments using the datasets and analysed the results. TB provided guidelines and helped draft the manuscript. NI being supervisor this research, suggested the methods and helped draft the manuscript. All authors corrected the manuscript. All authors read and approved the final manuscript.
Authors’ information
KS is a lecturer at the School of Computer and Information Technology, Chiang Rai Rajabhat University (CRRU), Thailand, since 2009. He received the BEng degree in computer engineering (first class honors) from Naresuan University (NU) in 2008 and MEng degree in computer engineering from Kasetsart University (KU) in 2012. Currently, he is a Ph.D. candidate in computer engineering at Mae Fah Luang University (MFU). His primary research interests are in the area of machine learning, data mining, data preprocessing, and data reduction. TB is a Lecturer at the School of Information Technology, Mae Fah Luang University, Chiang Rai, Thailand. He obtained Ph.D. in Artificial Intelligence from Cranfield University in 2003, and worked as PostDoctoral Research Associate (PDRA) at Aberystwyth University, during 20072010. His PDRA work focused on antiterrorism using data analytical and decision support synthesizes. He has been the leader of research projects in exploiting biometrics technology for antiterrorism in southernprovinces of Thailand, funded by Ministry of Defense. He also serves as a committee and reviewer of several venues, IEEE SMC, IEEE TKDE, Knowledge Based Systems, International Journal of Intelligent Systems Technologies and Applications, for instance. NI is an Assistant Professor at the School of Information Technology, Mae Fah Luang University. She received Ph.D. in Computer Science from Aberystwyth University in 2010, funded by Royal Thai Government. Her Ph.D. work won the Thesis Prize of 2012 by Thai National Research Council. Her present research of improving face classification for antiterrorism and crime protection has been funded by Ministry of Science and Technology. She serves as an editor for International Journal of Data Analysis Techniques and Strategies; as a committee and reviewer of several venues, IEEE SMC, IEEE TKDE, Machine Learning, for instance.
Acknowledgements
The authors would like to thank KEEL software [59, 60] for distributing the source code of discretization algorithms, and the authors of EMD [31] for EMD program, and the authors of urCAIM [16] for distributing the urCAIM program.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All datasets in this research, including 30 standard datasets and 20 imbalanced datasets can be found at website http://archive.ics.uci.edu/ml and http://sci2s.ugr.es/keel/category.php?cat=imb, respectively. In addition, these datasets are included in Additional file 4.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San FranciscoMATHGoogle Scholar
 Sriwanna K, Puntumapon K, Waiyamai K (2012) An enhanced classattribute interdependence maximization discretization algorithm. Springer, BerlinView ArticleGoogle Scholar
 Yang P, Li JS, Huang YX (2011) Hdd: a hypercube divisionbased algorithm for discretisation. Int J Syst Sci 42(4):557–566MathSciNetView ArticleMATHGoogle Scholar
 Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512View ArticleMATHGoogle Scholar
 de Sá CR, Soares C, Knobbe A (2016) Entropybased discretization methods for ranking data. Information Sciences 329:921–936 (special issue on Discovery Science) View ArticleGoogle Scholar
 RamírezGallego S, García S, MouriñoTalín H, MartínezRego D, BolónCanedo V, AlonsoBetanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev 6(1):5–21Google Scholar
 Garcia S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750View ArticleGoogle Scholar
 Sang Y, Li K (2012) Combining univariate and multivariate bottomup discretization. MultipleValued Logic and Soft Computing 20(1–2):161–187Google Scholar
 Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6(4):393–423MathSciNetView ArticleGoogle Scholar
 Dougherty J, Kohavi R, Sahami M et al (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning: proceedings of the Twelfth international conference, vol 12, pp 194–202Google Scholar
 Kerber R (1992) Chimerge: discretization of numeric attributes. In: Proceedings of the tenth national conference on artificial intelligence. Aaai Press, San Jose, pp 123–128Google Scholar
 Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642–645View ArticleGoogle Scholar
 Tay FE, Shen L (2002) A modified chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(3):666–670View ArticleGoogle Scholar
 Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S (2014) An effective discretization method for disposing highdimensional data. Inf Sci 270:73–91MathSciNetView ArticleMATHGoogle Scholar
 Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153View ArticleGoogle Scholar
 Cano A, Nguyen DT, Ventura S, Cios KJ (2016) urcaim: improved caim discretization for unbalanced and balanced data. Soft Computing 20(1):173–188View ArticleGoogle Scholar
 Ching JY, Wong AK, Chan KCC (1995) Classdependent discretization for inductive learning from continuous and mixedmode data. IEEE Trans Pattern Anal Mach Intell 17(7):641–651View ArticleGoogle Scholar
 Fayyad UM, Irani KB (1993) Multiinterval discretization of continuousvalued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence. Chambéry, France, 28 Aug–3 Sept 1993, pp 1022–1029Google Scholar
 Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Kodratoff Y. (eds) Machine Learning — EWSL91. EWSL 1991. Lecture notes in computer science (Lecture notes in artificial intelligence), vol 482. Springer, BerlinGoogle Scholar
 Zeinalkhani M, Eftekhari M (2014) Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers. Inf Sci 278:715–735MathSciNetView ArticleMATHGoogle Scholar
 Yang Y, Webb GI (2009) Discretization for naivebayes learning: managing discretization bias and variance. Mach Learn 74(1):39–74View ArticleGoogle Scholar
 Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICAbased multivariate discretization algorithm. Springer, BerlinView ArticleGoogle Scholar
 Gupta A, Mehrotra KG, Mohan C (2010) A clusteringbased discretization for supervised learning. Stat Probab Lett 80(9):816–824MathSciNetView ArticleMATHGoogle Scholar
 Singh GK, Minz S (2007) Discretization using clustering and rough set theory. In: International conference on computing: theory and applications, 2007. ICCTA’07. IEEE, New York, pp 330–336Google Scholar
 Hartigan JA, Wong MA (1979) Algorithm as 136: a kmeans clustering algorithm. Appl Stat 28:100–108View ArticleMATHGoogle Scholar
 Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. In: Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105–115Google Scholar
 Ester M, Kriegel HP, Sander J, Xu X (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231Google Scholar
 Sriwanna K, Boongoen T, IamOn N (2016) In: Lavangnananda K, PhonAmnuaisuk S, Engchuan W, Chan JH (eds) An enhanced univariate discretization based on cluster ensembles. Springer, Cham, pp 85–98Google Scholar
 IamOn N, Boongoen T, Garrett S, Price C (2011) A linkbased approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409View ArticleGoogle Scholar
 Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble nonnegative matrix factorization. Inf Sci 181(11):2293–2302View ArticleGoogle Scholar
 RamirezGallego S, Garcia S, Benitez JM, Herrera F (2016) Multivariate discretization based on evolutionary cut points selection for classification. IEEE Transactions on Cybernetics 46(3):595–608View ArticleGoogle Scholar
 Parashar A, Gulati Y (2012) Survey of di erent partition clustering algorithms and their comparative studies. International Journal of Advanced Research in Computer Science 3(3):675–680Google Scholar
 Brandes U, Gaertler M, Wagner D (2007) Engineering graph clustering: models and experimental evaluation. ACM J Exp Algorithm 12(1.1):1–26MathSciNetMATHGoogle Scholar
 Van Dongen SM (2001) Graph clustering by ow simulation. PhD thesis, University of UtrechtGoogle Scholar
 Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64View ArticleMATHGoogle Scholar
 Foggia P, Percannella G, Sansone C, Vento M (2009) Benchmarking graphbased clustering algorithms. Image Vis Comput 27(7):979–988View ArticleMATHGoogle Scholar
 Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. Proc VLDB Endow 2(1):718–729View ArticleGoogle Scholar
 Cheng H, Zhou Y, Yu JX (2011) Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans Knowl Discov Data 5(2):12MathSciNetView ArticleGoogle Scholar
 Nascimento MC, De Carvalho AC (2011) Spectral methods for graph clusteringa survey. Eur J Oper Res 211(2):221–231MathSciNetView ArticleMATHGoogle Scholar
 Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905View ArticleGoogle Scholar
 Foggia P, Percannella G, Sansone C, Vento M (2007) In: Escolano F, Vento M (eds) Assessing the performance of a graphbased clustering algorithm. Springer, Berlin, pp 215–227Google Scholar
 Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for largescale detection of protein families. Nucleic Acids Res 30(7):1575–1584View ArticleGoogle Scholar
 Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497–515MathSciNetView ArticleMATHGoogle Scholar
 Brandes U, Gaertler M, Wagner D (2003) Experiments on graph clustering algorithms. Springer, Berlin, pp 568–579Google Scholar
 Kong W, Hu S, Zhang J, Dai G (2013) Robust and smart spectral clustering from normalized cut. Neural Comput Appl 23(5):1503–1512View ArticleGoogle Scholar
 Sen D, Gupta N, Pal SK (2013) Incorporating local image structure in normalized cut based graph partitioning for grouping of pixels. Inf Sci 248:214–238MathSciNetView ArticleGoogle Scholar
 Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1MathSciNetGoogle Scholar
 Everitt B, Landau S, Leese M (1993) Cluster analysis (Edward Arnold, London). ISBN 0470220430Google Scholar
 Soman KP, Diwakar S, Ajay V (2006) Data mining: theory and practice [with CD]. PHI LearnGoogle Scholar
 Chapanond A (2007) Application aspects of data mining analysis on evolving graphs. PhD thesis, TroyGoogle Scholar
 Boutin F, Hascoet M (2004) Cluster validity indices for graph partitioning. In: Proceedings, eighth international conference on information visualisation, 2004. IV 2004. IEEE, New York, pp 376–381Google Scholar
 Dua S, Chowriappa P (2012) Data mining for bioinformatics. CRC Press, Boca RatonView ArticleMATHGoogle Scholar
 Görke R, Kappes A, Wagner D (2014) Experiments on densityconstrained graph clustering. J Exp Algorithmics 19:6MathSciNetMATHGoogle Scholar
 Leighton T, Rao S (1988) An approximate maxflow mincut theorem for uniform multicommodity flow problems with applications to approximation algorithms. In: 29th annual symposium on foundations of computer science, 1988. IEEE, New York, pp 422–431Google Scholar
 Ding CH, He X, Zha H, Gu M, Simon HD (2001) A minmax cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE international conference on data mining, 2001, ICDM 2001. IEEE, New York, pp 107–114Google Scholar
 Mohar B, Alavi Y (1991) The laplacian spectrum of graphs. Graph Theory Comb Appl 2:871–898MathSciNetMATHGoogle Scholar
 Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416MathSciNetView ArticleGoogle Scholar
 Lichman M (2013) UCI machine learning repositoryGoogle Scholar
 Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(255–287):11Google Scholar
 AlcaláFdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318View ArticleGoogle Scholar
 Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
 Aha DW, Kibler D, Albert MK (1991) Instancebased learning algorithms. Mach Learn 6(1):37–66Google Scholar
 John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. Proceedings of the eleventh conference on uncertainty in artificial intelligence. UAI’95. Morgan Kaufmann Publishers Inc., San Francisco, pp 338–345Google Scholar
 Cortes C, Vapnik V (1995) Supportvector networks. Mach Learn 20(3):273–297MATHGoogle Scholar
 Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, Boca RatonView ArticleGoogle Scholar
 Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37View ArticleGoogle Scholar
 Kohavi R et al (1995) A study of crossvalidation and bootstrap for accuracy estimation and model selection. Ijcai 14:1137–1145Google Scholar
 Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159View ArticleGoogle Scholar
 Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310View ArticleGoogle Scholar
 Ruan J, Jahid MJ, Gu F, Lei C, Huang YW, Hsu YT, Mutch DG, Chen CL, Kirma NB, Huang THM (2016) A novel algorithm for networkbased prediction of cancer recurrence. Genomics. doi:10.1016/j.ygeno.2016.07.005 Google Scholar
 Lv J, Peng Q, Chen X, Sun Z (2016) A multiobjective heuristic algorithm for gene expression microarray data classification. Expert Syst Appl 59:13–19View ArticleGoogle Scholar
 Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701View ArticleMATHGoogle Scholar
 Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92MathSciNetView ArticleMATHGoogle Scholar
 Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATHGoogle Scholar
 García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064View ArticleGoogle Scholar
 Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6:65–70MathSciNetMATHGoogle Scholar
 GonzalezAbril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36(3):5327–5332View ArticleGoogle Scholar
 Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on classattribute contingency coefficient. Inf Sci 178(3):714–731View ArticleGoogle Scholar
 Eshelman LJ (1991) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. Found Genet Algorithms 1:265–283Google Scholar
 Zighed DA, Rabaséda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes. Int J Uncertain Fuzziness Knowl Based Syst 6(03):307–326View ArticleMATHGoogle Scholar
 Wong AKC, Liu TS (1975) Typicality, diversity, and feature pattern of an ensemble. IEEE Trans Comput 100(2):158–181MathSciNetView ArticleMATHGoogle Scholar
 Huang W (1997) Discretization of continuous attributes for inductive machine learning. Toledo, Department Computer Science, University of ToledoGoogle Scholar
 Ho KM, Scott PD (1997) Zeta: a global method for discretization of continuous variables. In: Proceedings of the third international conference knowledge discovery and data mining (KDD97), pp 191–194Google Scholar
 Healey J (2014) Statistics: a tool for social research. Cengage LearnGoogle Scholar