Combining multiple clusterings using evidence accumulation

## Combining multiple clusterings using evidence accumulation ##### 講者: 梁廣廷 --- ## Ontline * Abstract * Introduction * Problem Formulation * Evidence Accumulation Clustering * Figures of Merit for the Evaluation of Clustering Combination Results * Combining Data Partitions Based on the K-Means Algorithm * Experimental Results * Conclusions --- ## Abstract * clustering ensemble * evidence accumulation (EAC) * evaluation mutual information --- ## Introduction - 1 * data clustering is important but an extremely difficult * data mining * information retrieval * image segmentation * machine learning --- ## Introduction - 2 * A large number of clustering algorithms * single algorithm * parametric density approaches * prototype-based methods * square-error clustering * K-means * clustering combination --- ![](https://i.imgur.com/oKOOUXX.png) --- ## Introduction - 3 * evidence accumulation clustering * combining data partitions * clustering combination performs * single-link (SL) * average-link (AL) --- ## Problem Formulation - 1 ![](https://i.imgur.com/D3cRQT4.png) --- ## Problem Formulation - 2 ![](https://i.imgur.com/o5Kjlfk.png) --- ## Problem Formulation - 3 $$ C^{i}_{j} $$ $$ \Sigma^{k_{i}}_{j=1}n^{i}_{j} $$ --- ## Problem Formulation - 4 $$ \mathbb{P} = \{P^{1}, P^{2}, ......, P^{N}\} $$ $$ P^{*} $$ --- ## Problem Formulation - 5 * Consistency with the clustering ensemble P * Robustness to small variations in P * Goodness of fit with ground truth information (true cluster labels of patterns), if available. --- ## Evidence Accumulation Clustering - 1 > The idea of evidence accumulation clustering is to combine the results of multiple clusterings into a single data partition, by viewing each clustering result as an independent evidence of data organization. --- ## Evidence Accumulation Clustering - 2 1. how to collect evidence or generate the clustering ensemble? 2. how to combine the evidence 3. how to extract a consistent data partitioning from the combined evidence --- ### Clustering ensembles 1. choice of data representation 2. choice of clustering algorithms or algorithmic parameters --- #### choice of data representation 1. employing diferent pre-processing and/or feature extraction mechanisms, which ultimately lead to diferent pattern representations (vectors, strings, graphs, etc.) or diferent feature spaces 2. exploring sub-spaces of the same data representation, such as using sub-sets of features 3. perturbing the data, such as in bootstrapping techniques (like bagging), or sampling approaches --- #### choice of clustering algorithms or algorithmic parameters 1. applying diferent clustering algorithms 2. using the same clustering algorithm with diferent parameters or initializations 3. exploring diferent dissimilarity measures for evaluating inter-pattern relationships, within a given clustering algorithm --- ### Combining Evidence: The Co-Association Matrix * N data partitions * n patterns --- ### n * n co-association matrix $$ C(i, j) = n_{ij} / N $$ --- ![](https://i.imgur.com/AAqbBYy.png) 400 patterns in figure 2(a): * outer ring (200 patterns) * rectangular shaped cluster (50 patterns) * 2-D gaussian cluster (150 patterns). --- ![](https://i.imgur.com/PB6POk6.png) --- ![](https://i.imgur.com/PB8vQ7a.png) ![](https://i.imgur.com/M4zTCSp.png) --- ![](https://i.imgur.com/k0cgx2Y.png) --- ![](https://i.imgur.com/kFwMyrd.png) --- ![](https://i.imgur.com/BtfIKI2.png) --- ![](https://i.imgur.com/MU570I3.png) --- ## Recovering Natural Clusters ![](https://i.imgur.com/Eu0tEFE.png) * Distances = 1 - similarity * threshold * lifetimes --- ![](https://i.imgur.com/6lqlpbR.png) * p - nearest neighbor index --- ## Figures of Merit for the Evaluation of Clustering Combination Results --- ### mutual information ![](https://i.imgur.com/pizMkva.png) --- ### normalized mutual information (NMI) ![](https://i.imgur.com/nbnvwTm.png) --- ### average normalized mutual information ![](https://i.imgur.com/1MSrwO4.png) --- ### average agreement between partitions ![](https://i.imgur.com/x9ZnnD6.png) --- ### clustering ensemble P in the sense of maximizing the objective function ![](https://i.imgur.com/ddgi5zW.png) --- ![](https://i.imgur.com/nXRN7Co.png) * Nc (the number of clusters in Pi) --- ![](https://i.imgur.com/0aooP1N.png) --- ## Combining Data Partitions Based on the K-Means Algorithm * Algorithm I Fixed k: N data partitions are produced by a random initialization of the k cluster centers * Algorithm II Random k: the value of k for each run of the K-means is chosen randomly within an interval [kmin; kmax]. N data partitions are produced by a random initialization of the k cluster centers. --- ### algorithms will be evaluated using the following data sets * random data in a 5-dimensional hyper-cube * half-rings data set * three concentric rings * cigar data set * Iris data set --- ![](https://i.imgur.com/IGlYYMz.png) --- ### A. Evidence Accumulation Combination Results --- ### B. Optimality Criteria and Adequacy of the Highest Lifetime Partition Criterion ![](https://i.imgur.com/9YRkkav.png) --- ### C. On the Selection of Design Parameters ![](https://i.imgur.com/8C77Hlj.png) --- ![](https://i.imgur.com/0HOf5Nh.png) --- ![](https://i.imgur.com/icyALxP.png) --- ## Conclusion * Algorithm I should use a k value higher than this minimum * Algorithm II is more robust than algorithm I, it is important to ensure that the range (kmin; kmax) is not completely below the minimum k value --- ## Experimental Results --- ### following data sets 1. Complex image - data set of ¯gure 1; 739 patterns distributed in 8 clusters (the 10 points of the outer circle are considered one cluster) 2. Cigar data - 4 clusters (see figure 4) 3. Half-rings - 2 clusters in ¯gure 6(b) 4. 3-rings { ¯gure 6(c), three clusters 5. Iris data - three clusters 6. Wisconsin Breast-cancer (683 patterns represented by 9 integer-valued attributes, with two class labels - benign and malignant), available at the UCI Machine Learning Repository 7. Optical digits - from a total of 3823 samples (each with 64 features) available at the UCI repository(we used a subset composed of the ¯rst 100 samples of all the digits) 8. Log Yeast 9. Std Yeast consist of the logarithm and the standardized version 10. Textures - consists of 4,000 patterns in a 19-dimensional feature space, representing an image with 4 distinct textures, distributed in a 2 * 2 grid --- ![](https://i.imgur.com/nwX263X.png) --- ![](https://i.imgur.com/XD5qPNJ.png) --- ![](https://i.imgur.com/TvV8Vah.png) --- (1) building the cluster ensemble using the K-means algorithm, which takes O(nkN) time, with n being the number of samples, k the number of clusters in each data partition, and N is the number of partitions in the clustering ensemble; --- (2) combining the data partitions. The proposed EAC combination method, associated with the SL or the AL algorihms, is O(n2N); compared to the CSPA (O(n2Nk)), HPGA (O(nNk)), and MCLA (O(nk2N2)) combination methods, HPGA is the fastest. Except for the spectral clustering method (O(n3)), ensemble methods are computationally more expensive than the single run clustering algorithms analyzed (K-means, SL, and AL) --- ## Conclusions * combining various clustering partitions of a given data set in order to obtain a partition that is better than individual partitions. * evidence accumulation technique maps the clustering ensemble into a new similarity measure between patterns, by accumulating pairwise pattern associations * Experimental results were compared with individual runs of well known clustering algorithms, and also with other cluster ensemble combination methods --- ## Q & A --- ## Thank you! :sheep: - email me ---