Problem
Definition

With the fast growth of Internet and computational technologies in
the past decade, many data mining applications have advanced swiftly
from the simple clustering of one data type to the coclustering of multiple
data types, usually involving high heterogeneity. Generally, coclustering can be divided into two types:
pairwise coclustering (i.e., two types of heterogeneous data clustering) and
highorder coclustering (i.e., multiple types of heterogeneous data). For example, the
interrelations of words, documents and categories in text corpus,
Web pages, search queries and Web users in a Web search system,
papers, keywords, authors and conferences in a scientific
publication domain can be identified through simultaneous
clustering of several related data types. Through
coclustering, we are able to discover a hidden global structure in
the heterogeneous data, which seamlessly integrates multiple
data types to provide us a better picture of the underlying
data distribution, highly valuable in many real world applications.
However, accurately coclustering heterogeneous data without domaindependent
background information is still a challenging task. In our model, given a Starstructured Heterogeneous Relational Data (SHRD) set,
with a central data type X_c, and l feature
modalities X_1, ..., X_p, ..., X_l, the goal is to cluster central data type
X_c into k_c disjoint clusters simultaneously with
feature modality X_1 into k_1 disjoint clusters,
..., X_p into k_p disjoint clusters, ... , and X_l into k_l disjoint clusters.
Notice that SHRD provides a very good abstraction for many realworld data mining
problems. For example, it can be used to model words, documents and
categories in text mining, where the document is the central data
type; authors, conferences, papers and keywords in academic
publications, where the paper is the central data type;
and images, color, and texture features in image retrieval, where the image is
the central data type.

Algorithms
 We first model SHRD using a set of relation
matrices. That is, a matrix R^{(cp)} is used to represent the relation between a central data
type X_c and a feature modality X_p.
 The supervision is provided
as two sets of pairwise constraints derived from the given labels on the
central data type: mustlink constraints M
(x_i,x_j) and cannotlink constraints C = (x_i,x_j), where
(x_i,x_j) in M implies that x_i and x_j are labeled as belonging to the
same cluster, while (x_i,x_j) in C implies that (x_i, x_j) are labeled as
belonging to different clusters. Note that our assumption is made
based on the fact that in practice constraints are much easier to specify on the central data type
(e.g., documents in documentword coclustering) than on the feature modalities (e.g., words).
The above figure shows a data triplet, the basic element of SHRD, with
constraints on the central data type. The green edges indicate the
mustlink constraints M, while the red edges
denote the cannotlink constraints C. The dotted line
shows the optimal coclustering result.
 Through distance metric learning and modality selection (i.e., feature weight "alpha"),
prior knowledge is integrated into coclustering, making
mustlink central data points as tight as possible and cannotlink central data points as loose as possible.
 Using an iterative algorithm, we then perform
trifactorizations of the new data matrices, obtained with the
learned distance metric, to infer the central data clusters while
simultaneously deriving the clusters of related feature modalities.
Results
 In the following table, we compared the algorithms on coclustering
some document datasets using AC values. It shows document coclustering accuracy obtained by
SRC, CMRF, NMF, SSCMRF and SSNMF (both with 15% constraints). Averaged AC values over all nine data
sets are also reported. It is obvious that NMF outperforms
other unsupervised methods in six out of nine data
sets. In general, SRC performs the worst amongst the
three unsupervised ones. Specifically, its accuracy on the
data set HT7 is only 19%. Also from the table, semisupervised
methods provide significantly better results than the
corresponding unsupervised ones. The average AC of
SSCMRF increases 15% over CMRF, while up to 20% is
gained by SSNMF over NMF. We also observe that SSNMF
can achieve high clustering accuracy (over 80%)
in five out of the nine data sets. The average AC of SSNMF
is 72.43%, about 10% higher than that of SSCMRF.
 The following figure shows
that the average AC values against
increasing percentage of pairwise constraints for SSCMRF
and SSNMF. Again, when more prior knowledge
is available, the performance of SSCMRF and SSNMF
clearly gets better. It is also obvious that on average
SSCMRF is consistently outperformed by SSNMF with
varying amounts of constraints.

 We report the accuracy
of text categorization by SRC, CMRF, NMF, SSCMRF
and SSNMF in the left panel of the following table.
In six out of nine text data sets, the AC
value of SSNMF either ranks the best or the second with
exceptions on the data sets: HT3, HT8 and HT9.
In highorder coclustering, we also obtain the clusters
of words simultaneously with the clusters of documents
and categories. However, for text representation, there
is no ground truth available to compute an AC value.
Here, we select the "top" 10 words based on mutual
information for each word cluster associated with a
category cluster and list them in the right panel of the table.
These words can be used to represent the underlying
"concept" of the corresponding category cluster.

 The very important result is about modality selection.
First, modality selection can provide additional
information on the relative importance of various
relations (e.g., "word" and "category") for grouping the
central data type (e.g., "document"). Moreover, from a
technical point of view, it also acts like feature selection
when computing the new relational data matrix. The following table
lists the modality importance for the
two relations: documentword and documentcategory in SSNMF
with 1% constraints. A higher value in the table
indicates more importance. It is clear that the significance
of "word" and "category" are quite different in different
data sets.

 We compare the computational speed of three
unsupervised approaches: SRC, CMRF, and NMF, and
two semisupervised approaches: SSCMRF and SSNMF in the following figures.
In both cases, unsupervised NMF is the quickest
among the five approaches as it uses an efficient iterative
algorithm to compute the cluster indicator and
cluster association matrices. SSNMF ranks second as n_c
increases while close to CMRF and SSCMRF when n_p
increases. The difference between SSNMF and unsupervised
NMF is mainly due to the additional computation
required to learn the new distance metric,
in which we need to solve a generalized eigenproblem.
We observe that in the following left figure, the computing
time for SSNMF is close to unsupervised NMF because
both methods have a linear complexity of n_c when n_p
is fixed. On the other hand, as shown in the following right figure,
time for SSNMF increases more quickly when
n_c is fixed. In addition, the speed of CMRF and SSCMRF
is between NMF and SRC. The computing time
of these two algorithms increases quickly in both cases
since their complexity is the third power of n_c or n_p when
the other is fixed. Moreover, we observe that SRC is
the slowest in both cases. Even though SRC is completely
unsupervised, it needs to solve a computationally
more expensive constrained eigendecomposition problem
and requires additional postprocessing (kmeans) to
infer the clusters. From these results, it is obvious that
SSNMF provides an efficient way for semisupervised
data coclustering.

Related Publications
 Yanhua Chen, Lijun Wang, and Ming Dong,
"Nonnegative Matrix Factorization for Semisupervised
Data Coclustering", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 22, No. 10, pp. 14591474, October 2010.
 Yanhua Chen, Lijun Wang, and Ming Dong, "
Semisupervised Document Clustering with Simultaneous Text Representation and Categorization",
accepted by European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD),
appear in the Proceedings of SpringerVerlag Lecture Notes in Artificial Intelligence (LNAI), Bled, Slovenia, 2009 (acceptance rate: 22%).
 Yanhua Chen, Ming Dong, and Wanggen Wan, "
Image Coclustering with Multimodality Features and User Feedbacks",
accepted for publication by ACM International Conference on Multimedia (ACM MM), Beijing, China, 2009 (acceptance rate = 28%)
