Generating descriptive model for student dropout: a review of clustering approach
 Natthakan IamOn^{1}Email author and
 Tossapon Boongoen^{2}
DOI: 10.1186/s1367301600830
© The Author(s) 2017
Received: 5 September 2016
Accepted: 8 December 2016
Published: 2 January 2017
Abstract
The implementation of data mining is widely considered as a powerful instrument for acquiring new knowledge from a pile of historical data, which is normally left unstudied. This data driven methodology has proven effective to improve the quality of decisionmaking in several domains such as business, medical and complex engineering problems. Recently, educational data mining (EDM) has obtained a great deal of attention among educational researchers and computer scientists. In general, publications in the field of EDM focus on understanding student types and targeted marketing, using both descriptive and predictive models to maximize student retention. Inspired by previous attempts, this paper aims to establish the clustering approach as a practical guideline to explore student categories and characteristics, with the working example on a real dataset to illustrate analytical procedures and results.
Keywords
Educational data mining Clustering Student performance Retention DropoutBackground
Given an increasing number of higher educational institutes and learning assisted technology, many universities have to adapt to changes in business environment and student expectation. This leads to a critical revision of their strategies and effectiveness [1], since they are held accountable for learning outcome and stakeholders’ satisfaction. It appears that one response to this challenge is the application of decisionsupport tools, including analytical and data mining (DM) techniques [2]. Such an approach is in line with the need of most universities to handle and make the best use of large repositories of data, which normally cover enrollment and registration, learning materials and resources, course and student details [3]. With respect to [4], this data collection can be regarded as a goldmine, from which knowledge about students’ behavior, preference and performance can be discovered.
Having recognized its potential, educational data mining (EDM), has been a fastgrowing interdisciplinary research field [5]. It concerns with developing, researching, and applying computerized methods to detect patterns in large collections of educational data that would otherwise be hard or impossible to analyze [1]. As such, disclosed knowledge is highly useful to better understand how students learn and effects of different settings to their achievement. This can help to improve educational outcomes and to gain insights into various educational phenomena, with several applications of EDM being put forward in the recent years [6]. Examples of these include evaluation of student performance, course recommendation and personalized learning plan, and identification of atypical learning pattern [7].
Specific to the understanding of student performance, several researches have shown that EDM can help to disclose atrisk students. This allows universities to become more proactive in identifying and assisting those students. For instance, with the aim to yield student retention, Lin [8] has studied a variety of machine learning algorithms to develop predictive models based on incoming students’ data. As such, the models are able to provide shortterm accuracy for predicting which types of students would benefit from student retention programs on campus. To achieve alike models of performance and dropout, different techniques have been explored. These include Naive Bayes [9], kmeans [10], decision tree [11, 12]. In addition, recent researches have also focused on understanding student groups and corresponding policy [13]: predictive models to maximize student retention [14, 15], an enrollment prediction system, for instance.
Student retention has become a common problem encountered by any university around the world, including Mae Fah Luang University (MFU) and others in Thailand. However, this does not draw much attention among researchers in Thai agencies and neighboring countries, with a handful of investigations being pursued in the past few years. Examples are the work conducted at King Mongkut University of Technology North Bangkok [16], and another for Prince of Songkla University [17]. Note that the model is limited to predictive purpose with the use of conventional classification algorithms [18]. Failing to improve or even sustain the retention rate would negatively impact students, parents, university and the society as a whole. On the other hand, the success will bring about several benefits such as better graduates’ career, higher university’s ranking, and more funding from both government agencies and private sector. As suggested by [19], universities with high attrition or dropout rates may face the significant loss of tuition fees and potential alumni contributions. Note that a significant portion of student attrition occurs in the first year of university. According to the research of [20], more than 50% of the student attrition can be attributed to the freshmen. Therefore, it is essential to identify vulnerable students who are prone to dropout as early as possible. This allows institutions to better and faster progress towards achieving their retention management goals.
Note that almost all the aforementioned attempts are constrained to the use of analytical algorithms only to generate a predictive model of students’ success and failure. As such, the prediction result is narrowed to the likely achievement of any student under examination, typically as either graduate or dropout. Unfortunately, a predictive method often fails to provide insights to understand factors and characteristics of those two student categories. In response, this review paper aims to boost the quality of analytical results by exploring the development of a descriptive model that can largely complement the predictive side, or even provide a unique and useful viewpoint hardly obtained before. One of the major approaches to deliver a desired descriptive model is data clustering, which is an unsupervised learning process for the exploration of data structural setting and properties. It is capable of revealing natural groups of objects of interest, especially for a new domain with minimal prior knowledge. As a result, clustering has been coupled with many real problems, including bioinformatics [21], medical and health informatics [22], psychological study [23], marketing research [24], customer relationship [25], and recommender systems [26]. Furthermore, the development of clustering for microarray gene expression data motivates a large number of contributions regarding both theoretical advancement and applications [27–29].
The rest of this paper is organized as follows. As for the development of a descriptive model, one of the recent developments in subspace clustering model is employed. Therefore, its basic assumption and process are presented in the second section, including its baseline technique. This also provides details of the model evaluation, in which different quality metrics are made available to ensure the reliability of clustering result. Following that, the third section illustrates a working example of descriptive model generation and interpretation, based on the case study of MFU. This discovery will allow a more indepth analysis where significant factors to a particular groupwise character can be revealed. The review is concluded in the forth section, with a discussion of future research directions.
Basis of cluster analysis
Principally, the core of cluster analysis is the clustering process, which divides data objects into groups or clusters such that objects in the same cluster are more similar to each other than to those belonging to different clusters [30]. Objects under examination are normally described in terms of objectspecific (e.g., attribute/feature values) or relative measurements (e.g., pairwise dissimilarity). Unlike supervised learning to which classification is categorized, clustering is ‘unsupervised’ and does not require class information. This is typically achieved through a manual tagging of category labels on data objects, by a domain expert (or through the consensus of multiple experts). Given its potential, a large number of research studies focus on several aspects of cluster analysis; for instance, clustering algorithms and extensions for particular data type [31], dissimilarity (or distance) metric [32], optimal cluster number [33], relevance of data attributes per cluster [34], evaluation of clustering results [35], and cluster ensembles [36]. This section aims to set the scene for the following section by emphasizing the clustering technique used for generating a descriptive model of student performance. In addition, a section of model evaluation is also included to shed the light on measuring goodness of the obtained model.
Model generation
Clustering is branded an unsupervised learning approach as the measurement of similarity is conducted without knowledge of class assignment. This knowledgefree scenario brings about a series of difficult decisions, with respect to selecting appropriate algorithm, similarity measure, criterion function, and initial parameter condition [37, 38]. For a given data \(X \in {R}^{n \times d}\), each \(x_i, i = 1 \ldots n\), corresponds to a sample or data point, which can be represented by a profile of d features, i.e., \(x_i = (x_{i1}, \ldots , x_{id})\). A clustering algorithm searches for the partition \(\pi = \{C_1, \ldots , C_{k}\}\) of samples \((x_1, \ldots , x_n)\) into k clusters, such that samples in the same cluster are more similar to each other than to those in the other clusters.
 1.
k data points are first randomly selected as initial cluster centers.
 2.Repeat:
 a.
Assign each data point to its closest cluster center. The Euclidean metric is commonly used to compute the distance between data points and centroids.
 b.
The centroid of each cluster is updated as the mean of all current data points in that cluster.
 a.
 3.
Until the termination criteria are met.
Model evaluation

DaviesBouldin (DB) makes use of similarity measure \(R_{ij}\) between the clusters \(C_i\) and \(C_j\), which is defined upon a measure of dispersion (\(s_i\)) of a cluster \(C_i\) and a dissimilarity measure between two clusters (\(d_{ij}\)). According to [41], \(R_{ij}\) is formulated aswhere \(d_{ij}\) and \(s_i\) can be estimated by the following equations. Note that \(v_x\) denotes the center of cluster \(C_x\) and \(C_x\) is the number of data points in cluster \(C_x\).$$\begin{aligned} R_{ij} = \frac{s_i + s_j}{d_{ij}} , \end{aligned}$$(11)$$\begin{aligned} d_{ij} = d(v_i, v_j) ,\end{aligned}$$(12)Following that, the DB index of a clustering \(\pi \) with k clusters, is defined as$$\begin{aligned} s_i = \frac{1}{C_i} \> \sum \limits _{\forall x \in C_i} d(x, v_i) \end{aligned}$$(13)where \(R_i = \max \nolimits _{j = 1 \ldots k, i \ne j} \> R_{ij}\). The DB index measures the average of similarity between each cluster and its most similar one. As the clusters have to be compact and separated, the lower DB index indicates better goodness of a data partition.$$\begin{aligned} DB(\pi ) = \frac{1}{k} \> \sum \limits _{i = 1}^k R_i , \end{aligned}$$(14)

Dunn is introduced by [42]. Its purpose is to identify compact and wellseparated clusters. For a given number of clusters k, the definition of the Dunn index is given by the following equation.where \(d(C_i, C_j)\) is the distance between two clusters \(C_i\) and \(C_j\), which can be defined as$$\begin{aligned} Dunn(\pi ) = \min _{i=1 \ldots k} \left( \min _{j=i+1 \ldots k} \left( \frac{d(C_i,C_j)}{\max _{z=1 \ldots k}(diam(C_z))} \right) \right) , \end{aligned}$$(15)In addition, \(diam(C_i)\) is the diameter of a cluster \(C_i\), which is defined as follows:$$\begin{aligned} d(C_i, C_j) = \min _{x \in C_i, y \in C_j} d(x,y) \end{aligned}$$(16)In a dataset containing compact and wellseparated clusters, the distances between the clusters are expected to be large and the diameters of the clusters are expected to be small. Therefore, a large value of the Dunn index signifies compact and wellseparated clusters.$$\begin{aligned} diam(C_i) = \max _{x,y \in C_i} d(x,y) \end{aligned}$$(17)
Clustering approach to generate a descriptive model of student dropout
Problem definition

Problem Context1—this focuses on the student cases before starting the firstyear study at MFU. The data under examination covers prioruniversity academic capability, demographic and degree enrollment details. The outcome will be beneficial for university admission, with students being helped to choose an appropriate degree. This is based on the degree outcome suggested by the data mining model and student’s preference. Of course, the level of student attrition might be reduced given this guidance. Furthermore, factors related to both desired and undesired performance can be exploited for an effective enrollment strategy.

Problem Context2—the focus has been shifted to the scenario observed after the firstyear study. In addition to the aforementioned collection of data, firstyear academic performance is also covered for this purpose. The expected results reflect student stereotypes as taking on university courses. They can be used to identify atrisk freshmen, and possible early assistive measures. Also, performance associated factors can lead to an effective degree/course planning.
Data acquisition and preparation

View1 (VW_STUDENT), students’ general data, for instance:

STUDENTID (student’s identification number)

LEVELID (student’s degree level, with ‘3’ specifying Bachelor degree)

DEPARTMENTID (identification number of academic department)

SEX (student’s gender)

HOMEPROVINCEID (identification number of student’s home province)

ADMITACADYEAR (year of student’s admission)

ENTRYTYPE (code denoting type of admission)

RQ (Regional Quota),

DA (Direct Admission),

ADA (Additional Direct Admission),

CA (Conditional Admission, with school GPAX above 2.0)


STUDENTSTATUS (code representing student’s status)

status ‘40’ means student graduated,

status ‘50’ means student resigned,

status ‘61’ means student dropout with GPAX less than 1.5,

status ‘62’ means student dropout with GPAX less than 1.8,

status ‘63’ means student was dropout with GPAX less than 2.0


FINISHDATE (date of termination  graduate or dropout)


View2 (VW_APPLICANT), students’ personal information and preuniversity grading data:

APPLICANTID (identification number of an universityentry applicant)

GPAX (student’s school overall grade)

GPA1 (student’s school grade, with respect to English subjects)

GPA2 (student’s school grade, with respect to Mathematical subjects)

GPA3 (student’s school grade, with respect to Science subjects)

GPA4 (student’s school grade, with respect to General subjects)


View3 (VW_TRANSCRIPT), students’ academic performance:

ACADYEAR (academic year in which student takes a specific course)

SEMESTER (semester in which student takes a specific course)

COURSECODE (code denoting academic course)

GRADE (course grade, i.e., A, B+, B, C+, C, D+, D, F, S, U, or W)


View4 (VW_ENTRYTYPE), description of entry type:

ENTRYTYPE (code denoting type of admission)

ENTRYTYPEDES (description of entry type)

Description of investigated dataset, with Context1 and Context2 denoting two problem contexts of before and afterfirstyear prediction
Feature  Data type  Context1  Context2  Description 

Sex  Nominal  Applicable  Applicable  Student’s sex 
Province  Nominal  Applicable  Applicable  Student’s home province 
Type  Nominal  Applicable  Applicable  Type of university entry 
Department  Nominal  Applicable  Applicable  Academic department 
SGPAX  Numerical  Applicable  Applicable  School grade (Overall) 
SGPA1  Numerical  Applicable  Applicable  School grade (English) 
SGPA2  Numerical  Applicable  Applicable  School grade (Mathematics) 
SGPA3  Numerical  Applicable  Applicable  School grade (Science) 
SGPA4  Numerical  Applicable  Applicable  School grade (General) 
GPAX  Numerical  n/a  Applicable  Student’s university grade 
A ratio  Numerical  n/a  Applicable  Ratio of subject with grade A 
B+ ratio  Numerical  n/a  Applicable  Ratio of subject with grade B+ 
B ratio  Numerical  n/a  Applicable  Ratio of subject with grade B 
C+ ratio  Numerical  n/a  Applicable  Ratio of subject with grade C+ 
C ratio  Numerical  n/a  Applicable  Ratio of subject with grade C 
D+ ratio  Numerical  n/a  Applicable  Ratio of subject with grade D+ 
D ratio  Numerical  n/a  Applicable  Ratio of subject with grade D 
F ratio  Numerical  n/a  Applicable  Ratio of subject with grade F 
S ratio  Numerical  n/a  Applicable  Ratio of subject with grade S 
U ratio  Numerical  n/a  Applicable  Ratio of subject with grade U 
W ratio  Numerical  n/a  Applicable  Ratio of withdrawn subject 
Statistical details of numerical features
Feature  Range  Max  Min  Mean 

SGPAX  [0.00–4.00]  3.96  1.98  3.06 
SGPA1  [0.00–4.00]  4.00  1.60  3.09 
SGPA2  [0.00–4.00]  4.00  0.75  2.62 
SGPA3  [0.00–4.00]  4.00  0.00  2.78 
SGPA4  [0.00–4.00]  4.00  0.00  3.17 
GPAX  [0.00–4.00]  4.00  0.00  2.36 
A ratio  [0.00–1.00]  1.00  0.00  0.12 
B+ ratio  [0.00–1.00]  0.64  0.00  0.15 
B ratio  [0.00–1.00]  0.67  0.00  0.17 
C+ ratio  [0.00–1.00]  0.67  0.00  0.16 
C ratio  [0.00–1.00]  1.00  0.00  0.14 
D+ ratio  [0.00–1.00]  1.00  0.00  0.09 
D ratio  [0.00–1.00]  1.00  0.00  0.08 
F ratio  [0.00–1.00]  1.00  0.00  0.08 
S ratio  [0.00–1.00]  1.00  0.00  0.63 
U ratio  [0.00–1.00]  1.00  0.00  0.19 
W ratio  [0.00–1.00]  0.67  0.00  0.02 
Model generation and evaluation

The set of data used in this clustering phase includes only academicperformance variables:

Problem Context1: SGPAX, SGPA1, SGPA2, SGPA3 and SGPA4

Problem Context2: SGPAX, SGPA1, SGPA2, SGPA3, SGPA4, GPAX, A Ratio, B+ Ratio, B Ratio, C+ Ratio, C Ratio, D+ Ratio, D Ratio, F Ratio, S Ratio, U Ratio and W Ratio


The number of clusters (k) is determined by the consensus of quality indices, such as DaviesBouldin (DB) and Dunn. First, clusterings of the investigated data set are created using different values of \(k \in \{2, 3, \ldots , 10\}\). Then, the optimal k is justified as the corresponding clustering having the best quality measures, which are summarized from 20 trials of each kspecific study.

Having achieved the student clusters, their stereotypes (in terms of academic performance profiles) can be derived for future references. In addition, the value distribution of other features such as ‘Entry Type’ can be examined. This can reveal relations and trends specific to each of the disclosed student clusters, hence the strategies to tackle dropout or underachievement.
Model interpretation
Conclusion
This review has presented the clustering approach to generate a descriptive model from educational data. The underlying methodology aims to discover knowledge, interesting patterns and relations that contribute to student dropout. This socalled retention problem has been recognized as one of the major difficulties commonly encountered by any university. Leaving it unsolved may negatively affect several parties, such as students, parents as well as the university, in terms of financial support and reputation. The paper kicks off with a brief revision of the clustering technique that is used to analyze the desired data collection, and examples of quality measures for the assessment of analytical results. These form ontology of technical tools for those who are interested in EDM and data clustering in general. To consolidate the process of creating and interpreting a descriptive model using the aforementioned methods, an illustrative case study of MFU has been explored and discussed.
The working example demonstrates a sequence of processes matching those of the conventional data mining framework. This begins with problem definition and data acquisition, where the former is conducted in conjunction with staffs from Admission and Registrar divisions, while the latter is achieved by extracting relevant data from MFU MIS. Having obtained the initial collection of student data, it is preprocessed such that errors and missing values are resolved. In particular, it is arranged for the following investigation using descriptive type of data mining model. This is studied with respect to two problem contexts of (i) before starting the first year where only demographic and schoolacademic details are available, and (ii) after the first year where initial university performance is known in addition to those in (i).
Contextspecific data is analyzed using a clustering procedure to reveal natural student groups. A filter approach to soft subspace clustering, namely RKM, has been exploited for this task. It reveals two clusters at the time of university admission; one corresponds to applicants with good school grades, while the other represents those with moderate to poor grading profiles. Likewise, two student clusters have been disclosed when applying the aforementioned technique to the set of data prepared for the second problem context. In a nutshell, those students with good schoolacademic background continue to do well in the first year. The majority of applicants with low school performance may face dropout after the first year, with some being able to adapt to university academic system and survive. Also, entry types implemented by MFU are not equally effective, where RQ appears to be the most successful while others should be used with constraints. The developed models and research findings may be highly useful as a working guideline to formulate an effective admission and consultant strategies. As a result, this can yield the level of student retention, hence optimizing tuition fees and government funding, student achievement, university reputation, and satisfaction of all the parties involved.
To strengthen this line of research, a number of important directions for future work can be highlighted. As suggested by many research works on the subject of student dropout [1], family background, financial support and universityevent participation may provide complementary interpretation of student achievement. To some students, academic capability is a major barrier to success, while social and financial factors can be crucial to others. Unfortunately, these attributes may not been properly recorded, thus prohibiting the corresponding investigation. However, through the cooperation with responsible divisions, a better understanding of nonacademic motives towards student performance can be acquired with the aforementioned variables being included.
Declarations
Authors' contributions
NIO collected and preprocessed the data. NIO and TB designed the research as well as analyzed the result. NIO was the lead writer of the paper. Both authors read and approved the final manuscript.
Acknowledgements
Natthakan IamOn—main contributor.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Romero C, Ventura S (2010) Educational data mining: a review of the stateoftheart. IEEE Trans Syst Man Cybern Part C 40:601–618View ArticleGoogle Scholar
 Bala M, Ojha DB (2012) Study of applications of data mining techniques in education. Int J Res Sci Technol 1:1–10View ArticleGoogle Scholar
 Koedinger K, Cunningham K, Skogsholm A, Leber B (2008) An open repository and analysis tools for finegrained, longitudinal learner data. In: Proceedings of first international conference on educational data mining, pp. 157–166Google Scholar
 Mostow J, Beck J (2006) Some useful tactics to modify, map and mine data from intelligent tutors. Nat Lang Eng 12:195–208View ArticleGoogle Scholar
 Baepler P, Murdoch CJ (2010) Academic analytics and data mining in higher education. Int J Schol Teach Learn 4(2):1–9Google Scholar
 Romero C, Ventura S (2013) Data mining in education. Wiley Interdiscip Rev Data Min Knowl Discov 3(1):12–27View ArticleGoogle Scholar
 Baker R, Yacef K (2009) The state of educational data mining in 2009: a review and future visions. J Educ Data Min 1(1):3–17Google Scholar
 Lin SH (2012) Data mining for student retention management. J Comput Sci Coll 27(4):92–99Google Scholar
 Kotsiantis S, Pierrakeas C, Pintelas P (2004) Prediction of student’s performance in distance learning using machine learning techniques. Appl Artif Intell 18(5):411–426View ArticleGoogle Scholar
 Erdogan SZ, Timor M (2005) A data mining application in a student database. J Aeronaut Space Technol 2(2):53–57Google Scholar
 SungHyuk C, Tappert C (2009) Constructing binary decision trees using genetic algorithms. J Pattern Recognition Res 1:1–13Google Scholar
 Kabra RR, Bichkar RS (2011) Performance prediction of engineering students using decision trees. Int J Comput Appl 36(11):8–12Google Scholar
 Antons C, Maltz E (2006) Expanding the role of institutional research at small private universities: a case study in enrollment management using data mining. New Dir Inst Res 131:69–81Google Scholar
 Ramaswami M, Bhaskaran R (2010) A CHAID based performance prediction model in educational data mining. Int J Comput Sci 7(1):10–18Google Scholar
 Yu C, Gangi SD, JannaschPennell A, Kaprolet C (2010) A data mining approach for identifying predictors of student retention from sophomore to junior year. J Data Sci 8:307–325Google Scholar
 Subyam S (2009) Causes of dropout and program incompletion among undergraduate students from the Faculty of Engineering, King Mongkut University of Technology North Bangkok. In: Proceedings of 8th National Conference on Engineering EducationGoogle Scholar
 Sittichai R (2012) Why are there dropouts among university students? Experiences in a thai university. Int J Educ Dev 32:283–289View ArticleGoogle Scholar
 Kongsakun K, Fung CC (2012) Neural network modeling for an intelligent recommendation system supporting SRM for Universities in Thailand. WSEAS Trans Comput 11(2):34–44Google Scholar
 Scott DM, Spielmans GI, Julka DC (2004) Predictors of academic achievement and retention among college freshmen: a longitudinal study. Coll Stud J 38(1):66–80Google Scholar
 Delen D (2011) Predicting student attrition with data mining methods. J Coll Stud Retent 13(1):17–35View ArticleGoogle Scholar
 Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386View ArticleGoogle Scholar
 He Q, Wang J, Zhang Y, Tang Y, Zhang Y (2009) Cluster analysis on symptoms and signs of traditional Chinese medicine in 815 patients with unstable angina. In: Proceedings of international conference on fuzzy systems and knowledge discovery, pp 435–439Google Scholar
 Henry DB, Tolan PH, GormanSmith D (2005) Cluster analysis in family psychology research. J Fam Psychol 19(1):121–132View ArticleGoogle Scholar
 Sheppard AG (1996) The sequence of factor analysis and cluster analysis: differences in segmentation and dimensionality through the use of raw and factor scores. Tour Anal 1:49–57Google Scholar
 Wu RC, Chen RS, Chang CC, Chen JY (2005) Data mining application in customer relationship management of credit card business. In: Proceedings of international conference on computer software and applications, pp 39–40Google Scholar
 Kim K, Ahn H (2008) A recommender system using GA Kmeans clustering in an online shopping market. Expert Syst Appl 34:1200–1209View ArticleGoogle Scholar
 Bredel M, Bredel C, Juric D, Harsh G, Vogel H, Recht L, Sikic B (2005) Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYCinteracting genes in human gliomas. Cancer Res 65(19):8679–8689View ArticleGoogle Scholar
 Kim E, Kim S, Ashlock D, Nam D (2009) MULTIK: accurate classification of microarray subtypes using ensemble kmeans clustering. BMC Bioinform 10:260View ArticleGoogle Scholar
 Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 100(14):8418–8423View ArticleGoogle Scholar
 Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323View ArticleGoogle Scholar
 Ahmad A, Dey L (2007) A kmean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527View ArticleGoogle Scholar
 Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the first Pacific Asia knowledge discovery and data mining conference, pp 21–34Google Scholar
 Dudoit S, Fridyand J (2002) A predictionbased resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):0036View ArticleGoogle Scholar
 Boongoen T, Shen Q (2010) Nearestneighbour guided evaluation of data reliability and its applications. IEEE Trans Syst Man Cybern Part B 40(6):1622–1633View ArticleGoogle Scholar
 Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850View ArticleGoogle Scholar
 IamOn N, Boongoen T, Garrett S (2010) LCE: a linkbased cluster ensemble method for improved gene expression data analysis. Bioinformatics 26(12):1513–1519View ArticleGoogle Scholar
 Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. WileyInterscience, New York, p 153Google Scholar
 Xue H, Chen S, Yang Q (2009) Discriminatively regularized leastsquares classification. Pattern Recognit 42(1):93–104View ArticleMATHGoogle Scholar
 McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297Google Scholar
 Boongoen T, Shang C, IamOn N, Shen Q (2011) Extending data reliability measure to a filter approach for soft subspace clustering. IEEE Trans Syst Man Cybern Part B 41(6):1705–1714View ArticleGoogle Scholar
 Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227View ArticleGoogle Scholar
 Dunn JC (1974) Well separated clusters and optimal fuzzy partitions. J Cybern 4:95–104MathSciNetView ArticleMATHGoogle Scholar