Chris Ding , Comp Sci & Eng Dept, Univ of Texas Arlington

(See related tutorial on Principal Component Analysis and Matrix Factorizations for Learning )

Tutorial given at ICML 2004 ( International Conference on Machine Learning, July 2004, Banff, Alberta, Canada )

Spectral Clustering

Tutorial slides for Part I (pdf file)
Tutorial slides for Part II (pdf file)

Summary. Spectral methods recently emerge as effective methods for data clustering, image segmentation, Web ranking analysis and dimension reduction. They start with well-motivated objective functions; optimization eventually leads to eigenvectors, with many clear and interesting algebraic properties. At the core of spectral clustering is the Laplacian of the graph adjacency (pairwise similarity) matrix, evolved from spectral graph partitioning. This tutorial provides a survey of recent advances after brief historical developments. Mathematical proofs will be outlined and examples in gene expresions and internet newsgroups will given to illustrate the ideas and results.

Tutorial contents:

Part I. Basic Algorithms (90 min)

Part II. Advanced Topics (90 min)

Audience. Spectral clustering is not just effective data clustering algorithms. Its has a rich structure with interesting properties and deep connections to principal component analysis, Hopfield networks, and semidefinite programming. Intended audience are both people interested in clustering and people interested in theoretical developments of unsupervised learning and dimension reduction.

Prerequisites. Basic matrix algebra at the level of G. Strang, Introduction to Linear Algebra; G. Golub and C.V. Loan, Matrix Computation.

Brief Introduction. Spectral clustering has its origin in spectral graph partitioning (Fiedler 1973; Donath & Hoffman 1972), a popular algorithm in high performance computing (Pothen, Simon & Liou, 1990). This led to Ratio-cut clustering (Hagen & Kahng, 92; Chan, Schlag & Zien, 1994). These algorithms use eigenvectors of the Laplacian of the graph adjacency (pairwise similarity) matrix. Recent work on Normalized-cut (Shi & Malik, 2000) uses the eigenvector of the generalized/normalized Laplacian (Chung, 1997) and brought renewed interest in the topic. This has been extended to bipartite graphs for simulataneous clustering of rows and columns of contingency table such as word-document matrix (Zha et al,2001; Dhillon,2001). Many new properties have been recently proved, such as random walks (Meila & Shi, 2001), self-aggregation (Ding et al, 2002), semidefinite relaxation (Xing & Jordan, 2003), and perturbation analysis (Ding et al,2002). Standard spectral clustering deals with 2-way clustering. Multi-way clustering methods are also proposed (Ng, Jordan & Weiss, 2001; Ding et al, 2002; Xu & Shi, 2003) and multi-way spectral relaxation and lower bounds (Gu et al, 2001). The widely used K-means clustering is shown recently (Zha,et al 2001; Ding & He, 2004) to be directly related to PCA: (a) the solution for cluster membership indicators in K-means clustering are given by PCA components, eigenvectors of the Gram (inner-product kernel) matrix; (b) PCA subspace is identical to the subspace spanned by K cluster centroids. Another popular use of eigenvectors is the webpage ranking algorithms, such as PageRank (used in Google) and HITS (Kleinberg, 1999), where closed-form solutions are obtained (Ding, et al, 2001, 2002). Finally, efficent linear algebra software for computing eigenvectors are fully developed and freely available, which will facilitate spectral clustering on large datasets.

Selected References

Presenter biography. Chris Ding is a staff computer scientist at Lawrence Berkeley National Laboratory. He started work on mesh/graph partitioning used spectral methods since 1995 and has been working extensively on spectral clustering: min-max cut, spectral relaxation on multi-way cuts and bounds, extension to bipartite graphs, K-means relaxation, and perturbation analysis; Latent Semantic Indexing in IR and web ranking algorithms using spectral methods, with about 15 publications in this area. This tutorial grows out of his research experiences in this area.