|
||||
External Projects
SOFTWARE/CODE AND DATA
|
II. PhD Research Summary ![]()
Generally, data mining (sometimes called data or knowledge discovery) is the process
of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.
The most commonly tasks in data mining include: classification, clustering, visualization, estimation and so on. |
Top |
- CRI: IAD Acquisition of Research Infrastructure for Knowledge-enhanced Large-scale Learning of Multimodality Visual Data, NSF
- HyperEye: Susceptibility Weighted Imaging-based Informatics Tools for Brain Tumor Studies, State of Michigan, 21st Century Jobs Fund
- Collaborative Research: Integrated Modeling and Learning of Multimodality Data across Subjects for Brain Disorder Study, NSF
Top |
- SS-NMF for Homogeneous Data Clustering Sample Code Download
- SS-NMF for Heterogeneous Data Clustering Sample Code Download
- Exemplar-based Visualization Data and Software Demo Download
Top |
- Semi-supervised NMF for Homogeneous Data Clustering
- Semi-supervised NMF for Heterogeneous Data Co-clustering
- Exemlpar-based Visualization of Large Document Corpus
![]() Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. We propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they "must" or "cannot" be clustered together. Through an iterative algorithm, we perform symmetric trifactorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF and SS-NMF provides a general framework for semi-supervised clustering. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering. Further details of this work are available here. |
Top |
![]() Co-clustering heterogeneous data has attracted extensive attentions recently due to its high impact on various important applications, such us text mining, image retrieval and bioinformatics. However, data co-clustering without any prior knowledge or background information is still a challenging problem. We propose a Semi-Supervised Non-negative Matrix Factorization (SS-NMF) framework for data co-clustering. Our method computes a new pairwise relational matrix by incorporating user provided constraints through distance metric learning. Using an iterative algorithm, we perform tri-factorization of the new matrix to infer the clusters of two data types. Through extensive experiments conducted on publicly available data sets, we demonstrate the superior performance of SS-NMF for data co-clustering. Further details of this work are available here. |
Top |
![]() With the rapid growth of the World Wide Web and electronic information services, many data, such as text corpus, bioinformatics, and so on, are becoming available on-line at an incredible rate. By displaying them in a logical layout (e.g., color graphs), data visualization presents a direct way to observe the data as well as understand the relationship between them. In this work, we propose a novel technique, Exemplar-based Visualization (EV), to visualize an extremely large data sets. Capitalizing on recent advances in matrix approximation and decomposition, EV presents a probabilistic multidimensional projection model in the low-rank data subspace with a sound objective function. The probability of each data sample proportion to the class is obtained through iterative optimization and embedded to a low dimensional space using parameter embedding. By selecting the representative exemplars, we obtain a compact approximation of the data. This makes the visualization highly efficient and flexible. In addition, the selected exemplars neatly summarize the entire data set and greatly reduce the cognitive overload in the visualization, leading to an easier interpretation of large data sets. Empirically, we demonstrate the superior performance of EV through extensive experiments performed on the publicly available data sets. Further details of this work are available here. |
Top |