Publication View

Estimating the number of clusters in a dataset via the Gap statistic (2000)

Abstract
We propose a method (the Gap statistic") for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. k-means or hierarchical), comparing the change in within cluster dispersion to that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study that shows that the Gap statistic usually outperforms other methods that have been proposed in the literature. We also briey explore application of the same technique to the problem for estimating the number of linear principal components. 1 Introduction Cluster analysis is an important tool for unsupervised" learning| the problem of nding groups in data without the help of a response variable. A major challenge in cluster analysis is estimation of the optimal number of clusters". Figure 1 (top right) shows a typical plot of an error measure W k (the within cluster dispersion dened below) for a clustering pr...

Publication details
Download http://citeseer.ist.psu.edu/306526.html
Source http://www-stat.stanford.edu/~tibs/ftp/gap.ps
Publisher unknown
Contributors The Pennsylvania State University CiteSeer Archives
Repository CiteSeer (United States)
Keywords Robert Tibshirani,Guenther Walther Estimating the number of clusters in a dataset via the Gap statistic
Language Englisch
Relation oai:CiteSeerPSU:253483

Publications citing this publication (3)
Supervised gene clustering with penalized logistic regression (2003)
Identifying differentially expressed genes from microarray experiments via statistic synthesis (2004)