|
|
10 Pubmed Data set used in Exemplar-based Visualization (EV) Software
- Summary
10Pubmed data set is a collection of approximately 15,500 medical
documents, partitioned across 10 different diseases. It consists of
published abstracts in the
MEDLINE database
from 2000 to 2008, relating to 10 different diseases. Use
``MajorTopic'' tag along with the disease-related MeSH terms as
queries to MEDLINE. From all the retrieved abstracts, the common and
stop words are removed, and the words are stemmed using
Porter's suffix-stripping algorithm. Finally, a
document-word matrix of the size 15565 x 22437 and the corresponding
22437 word lists are built.
|
- Organization
The data is organized into 10 different files, each
corresponding to a different disease. Here is a list of the 10Pubmed, partitioned according to subject matter:
Gout,
Chickenpox,
Raynaud Disease,
Jaundice,
Hepatitis A,
Hay Fever,
Kidney Calculi,
Age-related Macular Degeneration,
Migraine,
Otitis.
|
- Data
The orignial data download from MEDLINE available here are in
10Pubmed.zip bundles.
You will need unzip to open them. Each
subdirectory in the bundle represents a kind of disease documents,
each document of a kind of disease is indexed by number. The
total number of documents is 15569. After pre-processing,
the final total number of documents is 15565, of which
Porter algorithm skips 4. So the matlab version (below) represents
15565 documents. The details of each kind of disease documents are listed in the
following table.
Diseases |
Number of Documents |
Gout |
543 |
Chickenpox |
732 |
Raynaud Disease |
343 |
Jaundice |
503 |
Hepatitis A |
796 |
Hay Fever |
1517 |
Kidney Calculi |
1549 |
Age-related Macular Degeneration |
3283 |
Migraine |
3703 |
Otitis |
2596 |
|
- Matlab Download
Below is a processed version of the 10Pubmed data set which is easy
to read into Matlab, icluding:
docWordMat.mat
label.mat
wordList.mat
map.txt
- docWordMat.mat is formatted as document-word matrix.
- label.mat file is simply a list of label id's (i.e, 1-10).
- wordList.mat file contains the vocabulary for the indexed data. The line number
corresponds to the index number of the word, that is, word on the first line
is word #1, word on the second line is word #2, etc.
- map.txt file maps from label id's to label names.
|
|
|
|
|
|