Now that I’ve written the whole methodology that I used, there are a few things I wish I’d done differently. This wish clearly isn’t strong enough to make me scrap the project and start again.8 But I hope that others will learn from what I’ve done, and to that end, I want to be up front about things I could have done differently.
First, I should have filtered even more words. There are four kinds of words I wish I’d been more aggressive about filtering out:
- There are some systematic OCR errors in the early articles. I caught anid, which appears over three thousand times. (It’s almost always meant to be and, I think.) But I missed aind, which appears about 1,500 times. And there are other less common words that are also OCR errors and should be filtered out.
- I caught a lot of latex words but somehow missed rightarrow, as well as a few rarer words.
- If a word is hyphenated in the original journal, each half appears as a word in this data set. (At least if the data was generated by OCR.) I caught a few of the prefixes and suffixes that turn up for that reason, but missed ity, which ends up being a reasonably common word.
- And I caught a lot of words that almost always appear in bibliographies, headers, or footers but missed basil (which turns up on a table later) and noûs (though I caught nous).
In general, I could have been way more aggressive filtering words like these out.
But second, I think it was a mistake to filter out words that appear one to three times in articles. This actually makes perfect sense for long articles, and for some really long articles words that appear four or five times could be eliminated as well. But it’s too aggressive for short articles. I needed some kind of rule such as filtering out words that appear less than one time in two thousand in the article. It is important, I think, to filter out the words that appear just once or else it’s easy to miss OCR errors and weird latex code. But after that there needs to be some kind of sliding scale.
The next three things are much more systematic, though also less clearly errors.
The third problem was that my model selection was too stepwise and not holistic enough. I found the best sixty topic model I could find. Then I increased the topics on it (eventually to ninety) until the topics looked as good as they could get while holding a fixed seed number from the search through sixty topics. Then I ran refinements on it until the refinements looked like they were damaging the model. Then I split some of the topics up for categorization. What I didn’t do at any time was look back and ask, for example, how the other sixty topic models would look if I applied these adjustments to them.
There was a reason for that. Each of those adjustments cost quite a lot of my time, and even more computer time. Doing the best at each step and then locking in the result makes the process at least a bit manageable. But I should have been (a) a bit more willing to revisit earlier decisions and (b) more forward looking when making each of those intermediate decisions. I was a bit forward looking at one point; one of my criteria for choosing between sixty topic models was a preference for unwanted conflations over unwanted splits. And that was because I knew I could fix conflations in various ways. But I should have been both more forward looking and more willing to take a step or two backward. And maybe I could have stuck much closer to sixty topics if I had.
The fourth problem was that I didn’t realize how bad a topic arguments would turn out to be. For the purposes of the kind of study I’m doing, it’s really important that the topics really be topics in the ordinary sense, and not tools or methods. Now this is hard in philosophy, because philosophy is so methodologically self-conscious that there are articles that really are about all the tools and methods one might care about. But I wish I’d avoided having a topic about a tool. (I’ll come back in section 8.10 to a formal method one can use for detecting these kinds of topics early in the process.)
The fifth problem, if it is a problem, is that I wasn’t more aggressive about expanding the list of stop words. This model has a topic on ordinary language philosophy. Actually, all the models I built had a topic like this (at least once they had at least fifteen or so topics). But the keywords characteristic of this topic are words that really could have been included on a stop words list. They are words like ask and try. And one side-effect of this is that the model keeps thinking a huge proportion of the articles in the data set are, perhaps, ordinary language philosophy articles.
Another way to put this is that the boundary between a stop word and a substantive word (in this context) is pretty vague. And given that ordinary language philosophy was a thing that happened and that affected how everyone (at least in the United Kingdom) was writing for a while, there is a good case for taking a very expansive understanding of what the stop words were.
The choice I made was to not lean on the scales at all, and just use the most common off-the-shelf list of stop words. And there was a good reason for that: I wanted the model to not simply replicate my prejudices. But I half-think I made the wrong call here, and that the model would be more useful if I had filtered out more “ordinary language”.
I did in fact scrap several versions of this when writing up the model revealed mistakes in the model building. This is the model that resulted from acting on the lessons of those mistakes.↩︎