CSSS

CENTER FOR STATISTICS AND THE SOCIAL SCIENCES,
TALK ABSTRACTS

UW

Home
People
Departments
Seminars
Working
Papers
Student Seminar
Research
Abstracts
Seed Grants
Travel Grants
Undergrad Research Grants
Consulting
Courses
Ph.D. Tracks
Blalock Fellowship
Newsletters
Photos
Links
Conference Room/Equipment Reservation
Computing
Math Camp

Adrian E. Raftery

"Model-Based Clustering for Large Datasets Via Sampling"

Presented to the International Statistical Association Meeting, Berlin, August 2003

Today, data are generated at unprecedented speed. The growing size of data sets and data bases has increased the need for good clustering methods to capture and summarize the information. An example is the segmentation of multispectral images, where the objective is to group similar pixels, and to assess how many different groups there are. Typically, three to ten congruent images or bands containing complementary information are recorded, often containing tens of thousands of pixels per image. This places constraints on clustering techniques with respect to memory usage and computing time.

Many different clustering methods have been described since the 1950s. Model-based clustering is one of the more recent developments, and has shown good performance in a number of fields, including image analysis. As implemented in these applications, and in the MCLUST software (www.stat.washington.edu/mclust), model-based clustering consists of fitting a mixture of multivariate normal distributions to a data set by maximum likelihood using the EM algorithm, possibly with geometric constraints on the covariances matrices, and an additional component to allow for outliers or noise. Since the likelihood surface typically has many local maximna, initialization of the EM algorithm is a very important issue. Model-based hierarchical clustering has been found to provide good initializations.

Model-based hierarchical clustering generally requires storage and computing time at least proportional to the square of the dimension of the data, so that both space and time are limiting factors in its application to large data sets. Another problem is that when the size of the data set reaches a certain threshold, it is not possible to keep all of the required quantities in memory at the same time, forcing a dramatic and abrupt increase in necessary computational resources. This threshold varies with computer hardware and software, and data dimension, but at the current time it is typically on the order of several thousand objects.

Various approaches to the problem of clustering large data sets have been proposed, including initialization by clustering a sample of the data, and using an initial crude partitioning of the entire data set. The simplest and perhaps most widely applied approach is to apply the clustering method first to a small simple random sample from the data, and then apply the resulting estimated model to the full data set using discriminant analysis. The discriminant analysis can be carried out very easily in the model-based clustering framework by using a single E step. Unfortunately, this easily implemented strategy may lead to unstable segmentations when used in its simplest form.

We show that the appealingly simple approach based on clustering a sample of the data can be modified to give good and stable results, with two straightforward changes. For the image data sets we consider, we obtain good results by tentatively selecting several models based on the sample rather than just one, and by running several EM steps on the full data set rather than just one E step. We find that performance improves when the size of the sample is increased up to about 2,000, but that beyond that there is little gain. To reach this conclusion, we considered a range of sample sizes and several strategies of varying computational cost. Comparisons are based on three typical real-world data sets, and several realistic simulations.

This is joint work with Ron Wehrens and Lutgarde M.C. Buydens of the University of Nijmegen, and Chris Fraley of the University of Washington.



UW - CSSS: Wednesday, 17-Sep-2003 12:44:13 PDT Contact: Webmaster or CSSS