1.3 Selecting the Words
The JSTOR data excludes a few stop words (like the and and), and words with one or two characters. On the other hand, it takes nonletters to be word breaks. So doesn’t would be split into doesn and t and the second rejected as too short. And hyphenated words are split as well. It turned out that this made est into a reasonably common word. But I didn’t want to include all the words for various reasons.
It seems common in text mining to exclude a more expansive list of stop words than JSTOR leaves out. I was playing around with making my own list of stop words, but I decided it would be more objective to use the commonly used list from the tm package. They use the following list of stop words:
- i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, i’m, you’re, he’s, she’s, it’s, we’re, they’re, i’ve, you’ve, we’ve, they’ve, i’d, you’d, he’d, she’d, we’d, they’d, i’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very
I excluded all of these words from the analysis. The intuition here is that including them would mean that the analysis is more sensitive to stylistic ticks than to content, and in practice that seemed to be right. The models did look more reflective of substance than style with the stop words excluded. In principle I’m not sure it was right to exclude all those quantifiers from the end of the list, but it doesn’t seem to have hurt the analysis. I’ll come back to this point at the end of the chapter, but it is possible I should have been more aggressive in filtering out stop words.
The stop words list from tm includes a lot of contractions. I wrote a small script to extract the parts of those contractions before the apostraphe, and excluded them too. The parts after then apostrophe were always one or two letters, so they were already excluded.
I’ve also looked through the list of the five thousand most common words in the data set to see what shouldn’t be there, and the rest of this section comes from what was cut on the basis of that.
In some cases, JSTOR’s source for the text was from the LaTeX code for the article, so there was a lot of LaTeX junk in the text file. I’m sure I didn’t clean out all of this, but to clean out a lot of it, I deleted the following words.
- aastex, amsbsy, amsfonts, amsmath, amssymb, amsxtra, begin, cal, cyr, declaremathsize, declaretextfont, document, document class, empty, encodingdefault, end, fontenc, landscape, mayhem, mathrsfs, math strut, newcommand, normalfont, pagestyle, pifont, portland, renewcommand, rmdefault, selectfont, sfdefault, stmaryrd, textcomp, textcyr, usepackage, wmcyr, wncyss, xspace, documentclass, declaretextfontcommand, wncyr, declaremathsizes, mathrm, vert, mathstrut, hat, mathbf, thinspace, ldots, neg, bbb, ebc, cdot, boldsymbol, vec, langle, rangle, leq, infty, mathsf, vdash, boldmath, boldsymbol, cwmi, forall, mathrel, mbox, prfm, neq, anid
I’m a bit worried that excluding document meant I lost some signal about historical articles in the LaTeX noise. But this was unavoidable.
Also note that anid is not a LaTeX term, but it was worthwhile to exclude it here. Something about how the text recognition software JSTOR uses interacted with nineteenth- and early twentieth-century articles meant that several words, especially ‘and’, got coded as ‘anid’. But this was the OCR verison of a typo, and best deleted. (There were a few more of these that were not in the five thousand most common words that on reflection I wish I’d cut too. But I don’t think they make a huge difference to the analysis given how rare they are.)
Somewhat reluctantly, I deleted a bunch of spellings out of Greek letters for the same reason; they were mostly from LaTeX code. This meant deleting the following words:
- alpha, beta, gamma, delta, omega, theta, lambda, rho, psi, phi, sigma
I’m sure this lost some signal. But there was so much LaTeX noise that it was unavoidable.
Next I deleted a few honorifics; in particular:
- prof, mrs, professor
These just seemed to mark the article as being old, not anything about the content of the article. I didn’t need to exclude mr or dr since they were already excluded as too short.
Although I was trying to exclude foreign-language articles, I also excluded a bunch of foreign words. One reason was that it was a check on whether I missed any foreign-language articles. Another was that if I didn’t do this, then articles that had extensive quotation from foreign languages would be seen by the model as being in their own distinctive topic merely in virtue of having non-English quotations. And that seemed wrong. So to fix it, I excluded these words:
- auch, aussi, autre, cette, diese, haben, leur, soit, toute, peut, noch, habe, wenn, einem, doch, durch, kann, comme, aber, mais, nur, wird, wie, sont, ich, dieser, oder, avec, une, werden, bien, sie, auf, einer, dans, dass, esta, nicht, entre, uns, ont, que, wir, nach, einen, como, esprit, seine, elles, fait, elle, eine, lui, selbst, aus, deux, vom, pensee, schon, zum, nin, propre, les, pour, espace, las, una, amour, sind, etre, ueber, biran, das, bei, qui, temps, mich, alcan, sich, ein, zur, idee, welt, philosophique, mir, vie, homme, ces, maupertuis, leipzig, als, essai, del, sens, hier, monde, und, histoire, soi, por, des, den, bachelard, logique, sans, meyerson, filosofia, bourgeois, sein, philosophie, ist, meiner, zeit, raison, tarde, begriff, los, theorie, dem, der, pas, revue, uber, veblen, mas, weil, ser, philosophische, psychologie, milieu, geschichte, sur, dire, ses, une, les, que, est, etc
Finally, I excluded a bunch of words that seemed to turn up primarily in bibliographies or in text citations. Including them seemed to just make the model be more sensitive to the referencing style of the journal rather than the content. But here the deletions really did cost some content, because some of the words were philosophically relevant. But I deleted them because they seemed to be turning up more often in bibliographies than in text:
- doi, proceedings, review, journal, press, compilation, compilation, editors, supplementary, quarterly, aristotelian, kegan, dordrecht, minnesota, reidel, edu, stanford, oxford, cambridge, basil, blackwell, thanks, cit, mit, eds, loc, york, university, nous, chicago, clarendon, edited
The surprising one there is compilation. But it most often appears because some journals have a footer saying “Journal compilation ©”.
Then to speed up processing, I deleted any word that appeared in any article three times or less This lost some content, but it sped up the processing a lot. Some of the steps I’ll describe below took several days computing time. Without this restriction they would have taken several weeks. And I thought words that appear one to three times in an article shouldn’t be that significant for determining its content. Though as I’ll note below, this might have been too aggressive in retrospect.