Journals publish a lot, and I had to decide what to include and what to leave out. The aim was to include all and only research articles, but this was harder than it looks.
The metadata that JSTOR provides includes a tag for article kind. I only included articles with the tag “research-article”, which does a reasonable job of getting rid of book reviews. But it turns out that it includes a lot of things that are not really research articles. It functions in the JSTOR metadata as something of a generic article kind, one that applies if nothing else seems right. So we have to manually edit out a bunch of articles.
I deleted all articles without a listed author. These were often editorials, corrections and the like.
After that, I started working through various words in titles that indicated something was not actually a research article. So I deleted all articles with these titles:
- Descriptive Notices
- Letter to Editor
The first four are clear enough. The last was mostly a problem for special issues, but there were enough special issues of one kind or another to make it worthwhile. Then I deleted any articles that had the following phrases anywhere in the title:
- Introductory Note
- Abstract of C
- Abstracts of C
- To the Editor
The last is the only one that really needs comment. All the articles I found with this in the title were reports on one or another philosophy congress, not genuine research articles. Maybe there was a political philosophy article that referenced the United States Congress in its title and should not have been excluded but I didn’t see it.
Since text mining only works within a single language, I excluded all the articles whose listed language in the metadata was anything other than English. And I manually excluded, when I saw them, articles whose title was not in English and which seemed like non-English articles.
That left me with 32261 articles to work with.