Yanhua Chen's Personal Homepage

Research

Topics

Semi-supervised NMF for Heterogeneous Data Co-clustering

Problem Definition

With the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. Generally, co-clustering can be divided into two types: pairwise co-clustering (i.e., two types of heterogeneous data clustering) and high-order co-clustering (i.e., multiple types of heterogeneous data). For example, the interrelations of words, documents and categories in text corpus, Web pages, search queries and Web users in a Web search system, papers, keywords, authors and conferences in a scientific publication domain can be identified through simultaneous clustering of several related data types. Through co-clustering, we are able to discover a hidden global structure in the heterogeneous data, which seamlessly integrates multiple data types to provide us a better picture of the underlying data distribution, highly valuable in many real world applications. However, accurately co-clustering heterogeneous data without domain-dependent background information is still a challenging task. In our model, given a Star-structured Heterogeneous Relational Data (SHRD) set, with a central data type X_c, and l feature modalities X_1, ..., X_p, ..., X_l, the goal is to cluster central data type X_c into k_c disjoint clusters simultaneously with feature modality X_1 into k_1 disjoint clusters, ..., X_p into k_p disjoint clusters, ... , and X_l into k_l disjoint clusters. Notice that SHRD provides a very good abstraction for many real-world data mining problems. For example, it can be used to model words, documents and categories in text mining, where the document is the central data type; authors, conferences, papers and keywords in academic publications, where the paper is the central data type; and images, color, and texture features in image retrieval, where the image is the central data type.

Algorithms

We first model SHRD using a set of relation matrices. That is, a matrix R^{(cp)} is used to represent the relation between a central data type X_c and a feature modality X_p.

The supervision is provided as two sets of pairwise constraints derived from the given labels on the central data type: must-link constraints M (x_i,x_j) and cannot-link constraints C = (x_i,x_j), where (x_i,x_j) in M implies that x_i and x_j are labeled as belonging to the same cluster, while (x_i,x_j) in C implies that (x_i, x_j) are labeled as belonging to different clusters. Note that our assumption is made based on the fact that in practice constraints are much easier to specify on the central data type (e.g., documents in document-word co-clustering) than on the feature modalities (e.g., words). The above figure shows a data triplet, the basic element of SHRD, with constraints on the central data type. The green edges indicate the must-link constraints M, while the red edges denote the cannot-link constraints C. The dotted line shows the optimal co-clustering result.

Through distance metric learning and modality selection (i.e., feature weight "alpha"), prior knowledge is integrated into co-clustering, making must-link central data points as tight as possible and cannot-link central data points as loose as possible.

Using an iterative algorithm, we then perform tri-factorizations of the new data matrices, obtained with the learned distance metric, to infer the central data clusters while simultaneously deriving the clusters of related feature modalities.

Results

In the following table, we compared the algorithms on co-clustering some document datasets using AC values. It shows document co-clustering accuracy obtained by SRC, CMRF, NMF, SS-CMRF and SS-NMF (both with 15% constraints). Averaged AC values over all nine data sets are also reported. It is obvious that NMF outperforms other unsupervised methods in six out of nine data sets. In general, SRC performs the worst amongst the three unsupervised ones. Specifically, its accuracy on the data set HT7 is only 19%. Also from the table, semi-supervised methods provide significantly better results than the corresponding unsupervised ones. The average AC of SS-CMRF increases 15% over CMRF, while up to 20% is gained by SS-NMF over NMF. We also observe that SS-NMF can achieve high clustering accuracy (over 80%) in five out of the nine data sets. The average AC of SS-NMF is 72.43%, about 10% higher than that of SS-CMRF.

The following figure shows that the average AC values against increasing percentage of pairwise constraints for SS-CMRF and SS-NMF. Again, when more prior knowledge is available, the performance of SS-CMRF and SS-NMF clearly gets better. It is also obvious that on average SS-CMRF is consistently outperformed by SS-NMF with varying amounts of constraints.

We report the accuracy of text categorization by SRC, CMRF, NMF, SS-CMRF and SS-NMF in the left panel of the following table. In six out of nine text data sets, the AC value of SS-NMF either ranks the best or the second with exceptions on the data sets: HT3, HT8 and HT9. In high-order co-clustering, we also obtain the clusters of words simultaneously with the clusters of documents and categories. However, for text representation, there is no ground truth available to compute an AC value. Here, we select the "top" 10 words based on mutual information for each word cluster associated with a category cluster and list them in the right panel of the table. These words can be used to represent the underlying "concept" of the corresponding category cluster.

The very important result is about modality selection. First, modality selection can provide additional information on the relative importance of various relations (e.g., "word" and "category") for grouping the central data type (e.g., "document"). Moreover, from a technical point of view, it also acts like feature selection when computing the new relational data matrix. The following table lists the modality importance for the two relations: document-word and document-category in SSNMF with 1% constraints. A higher value in the table indicates more importance. It is clear that the significance of "word" and "category" are quite different in different data sets.

We compare the computational speed of three unsupervised approaches: SRC, CMRF, and NMF, and two semi-supervised approaches: SS-CMRF and SS-NMF in the following figures. In both cases, unsupervised NMF is the quickest among the five approaches as it uses an efficient iterative algorithm to compute the cluster indicator and cluster association matrices. SS-NMF ranks second as n_c increases while close to CMRF and SS-CMRF when n_p increases. The difference between SS-NMF and unsupervised NMF is mainly due to the additional computation required to learn the new distance metric, in which we need to solve a generalized eigenproblem. We observe that in the following left figure, the computing time for SS-NMF is close to unsupervised NMF because both methods have a linear complexity of n_c when n_p is fixed. On the other hand, as shown in the following right figure, time for SS-NMF increases more quickly when n_c is fixed. In addition, the speed of CMRF and SS-CMRF is between NMF and SRC. The computing time of these two algorithms increases quickly in both cases since their complexity is the third power of n_c or n_p when the other is fixed. Moreover, we observe that SRC is the slowest in both cases. Even though SRC is completely unsupervised, it needs to solve a computationally more expensive constrained eigen-decomposition problem and requires additional post-processing (k-means) to infer the clusters. From these results, it is obvious that SS-NMF provides an efficient way for semi-supervised data co-clustering.

Related Publications

Yanhua Chen, Lijun Wang, and Ming Dong, "Non-negative Matrix Factorization for Semi-supervised Data Co-clustering", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 22, No. 10, pp. 1459-1474, October 2010.
Yanhua Chen, Lijun Wang, and Ming Dong, " Semi-supervised Document Clustering with Simultaneous Text Representation and Categorization", accepted by European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), appear in the Proceedings of Springer-Verlag Lecture Notes in Artificial Intelligence (LNAI), Bled, Slovenia, 2009 (acceptance rate: 22%).

Yanhua Chen, Ming Dong, and Wanggen Wan, " Image Co-clustering with Multi-modality Features and User Feedbacks", accepted for publication by ACM International Conference on Multimedia (ACM MM), Beijing, China, 2009 (acceptance rate = 28%)

Top