Yanhua Chen's Personal Homepage

Research

Topics

Semi-supervised NMF for Homogeneous Data Clustering

Problem Definition

Semi-supervised clustering uses class labels or pairwise constraints on data objects to aid unsupervised clustering. It can group data using the categories of the initial labeled data as well as unlabeled data in order to modify the existing set of categories which reflect the whole regularities in the data.

Algorithms

Supervision is provided as two sets of pairwise constraints on the data objects: must-link constraints M and cannot-link constraints C. Every pair of data (x_i, x_j) in M implies that xi and xj must belong to the same cluster. Similarly, all possible pairs (x_i, x_j) in C implies that the two objects should belong to different clusters (see above Figure).

Perform symmetric non-negative tri-factorization of the data similarity matrix A with constraints using an interative algorithm to minimize |A- GSG|^{2}, where G is a cluster indicator matrix, and S is a cluster centroid matrix. The correctness and convergence of the algorithm are proved by showing that the solution satisfied the KKT optimality and the algorithm is guaranteed to converge.

Results

In the following figures, (a) shows an artificial toy data set consisting of two natural clusters, (b) shows data distribution in the SS-NMF subspace of the two column vectors of indicator G. The data points from the two clusters get distributed along the two axes, and (c) shows data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.

In the following table, we compared the algorithms on some document datasets using AC values. The performance of the first three methods is similar with NMF proving to be the best amongst the unsupervised methods. However, the accuracy of NMF greatly deteriorates and is unable to produce meaningful results on datasets having more than two clusters. On the other hand, the superior performance of SS-NMF is evident across all the datasets. We can see that in general a semi-supervised method can greatly enhance the document clustering results by benefiting from the user provided knowledge.Moreover, SS-NMF is able to generate significantly better results by quickly learning from the few pairwise constraints provided.

The following sub-figures show that the AC values against increasing percentage of pairwise constraints available, for the algorithms on some document datasets. On the whole, all three algorithms perform better as the percentage of pairwise constraints increases. Comparatively, SS-NMF gets better accuracy than the other two algorithms even for minimum percentage of pairwise constraints.

We plotted the computational speed of SS-NMF with respect to SS-KK and SS-SNC in the following figure. As the number of data samples increase, SS-SNC turns out to be the slowest of the three algorithms. SS-KK is the quickest with SS-NMF closely following it. SS-NMF provides an efficient approach for clustering.

Related Publications

Yanhua Chen, Manjeet Rege, Ming Dong and Jing Hua, "Non-negative Matrix Factorization for Semi-supervised Data Clustering", Journal of Knowledge and Information Systems (Springer), Vol. 17, No. 3, pp. 355 - 379, 2008. (invited as a best paper of ICDM 07, conference version: "Incorporating User provided Constraints into Document Clustering", was published in Proc. of IEEE International Conference on Data Mining, Omaha, NE, 2007 as a Regular paper, acceptance rate 7.2%).
Yanhua Chen, Manjeet Rege, Ming Dong and Farshad Fotouhi, "Deriving Semantics for Image Clustering from Accumulated User Feedbacks", Proc. of ACM Multimedia, Germany, 2007 (acceptance rate: 24%).

Top