10 Pubmed Data set used in Exemplar-based Visualization (EV) Software
Summary
10Pubmed data set is a collection of approximately 15,500 medical
documents, partitioned across 10 different diseases. It consists of
published abstracts in the
MEDLINE database
from 2000 to 2008, relating to 10 different diseases. Use
``MajorTopic'' tag along with the disease-related MeSH terms as
queries to MEDLINE. From all the retrieved abstracts, the common and
stop words are removed, and the words are stemmed using
Porter's suffix-stripping algorithm. Finally, a
document-word matrix of the size 15565 x 22437 and the corresponding
22437 word lists are built.
The data is organized into 10 different files, each
corresponding to a different disease. Here is a list of the 10Pubmed, partitioned according to subject matter:
Gout,
Chickenpox,
Raynaud Disease,
Jaundice,
Hepatitis A,
Hay Fever,
Kidney Calculi,
Age-related Macular Degeneration,
Migraine,
Otitis.
The orignial data download from MEDLINE available here are in
10Pubmed.zip bundles.
You will need unzip to open them. Each
subdirectory in the bundle represents a kind of disease documents,
each document of a kind of disease is indexed by number. The
total number of documents is 15569. After pre-processing,
the final total number of documents is 15565, of which
Porter algorithm skips 4. So the matlab version (below) represents
15565 documents. The details of each kind of disease documents are listed in the
following table.
docWordMat.mat is formatted as document-word matrix.
label.mat file is simply a list of label id's (i.e, 1-10).
wordList.mat file contains the vocabulary for the indexed data. The line number
corresponds to the index number of the word, that is, word on the first line
is word #1, word on the second line is word #2, etc.