Yanhua Chen's Personal Homepage

EV DATA SET

10 Pubmed Data set used in Exemplar-based Visualization (EV) Software

Summary

10Pubmed data set is a collection of approximately 15,500 medical documents, partitioned across 10 different diseases. It consists of published abstracts in the MEDLINE database from 2000 to 2008, relating to 10 different diseases. Use ``MajorTopic'' tag along with the disease-related MeSH terms as queries to MEDLINE. From all the retrieved abstracts, the common and stop words are removed, and the words are stemmed using Porter's suffix-stripping algorithm. Finally, a document-word matrix of the size 15565 x 22437 and the corresponding 22437 word lists are built.

Top

Organization

The data is organized into 10 different files, each corresponding to a different disease. Here is a list of the 10Pubmed, partitioned according to subject matter:
Gout,
Chickenpox,
Raynaud Disease,
Jaundice,
Hepatitis A,
Hay Fever,
Kidney Calculi,
Age-related Macular Degeneration,
Migraine,
Otitis.

Top

Data

The orignial data download from MEDLINE available here are in 10Pubmed.zip bundles.
You will need unzip to open them. Each subdirectory in the bundle represents a kind of disease documents, each document of a kind of disease is indexed by number. The total number of documents is 15569. After pre-processing, the final total number of documents is 15565, of which Porter algorithm skips 4. So the matlab version (below) represents 15565 documents. The details of each kind of disease documents are listed in the following table.

Diseases	Number of Documents
Gout	543
Chickenpox	732
Raynaud Disease	343
Jaundice	503
Hepatitis A	796
Hay Fever	1517
Kidney Calculi	1549
Age-related Macular Degeneration	3283
Migraine	3703
Otitis	2596

Top

Matlab Download

Below is a processed version of the 10Pubmed data set which is easy to read into Matlab, icluding:

docWordMat.mat
label.mat
wordList.mat
map.txt

docWordMat.mat is formatted as document-word matrix.
label.mat file is simply a list of label id's (i.e, 1-10).
wordList.mat file contains the vocabulary for the indexed data. The line number corresponds to the index number of the word, that is, word on the first line is word #1, word on the second line is word #2, etc.
map.txt file maps from label id's to label names.

Top