So now I had a model, with sixty topics, that looked good but not quite right. And, by design, there was a natural way to fix the problems: just add topics. It turns out that if the seed number is kept the same and the model is given more topics to play with, it makes very few changes. Or, to be a bit more precise, it makes very few changes apart from permuting the numbers. So, if two models are built with the same seed, and the second has one more topic than the first, for the vast majority of topics in the first model, there will typically be a “matching” topic in the second model. And by “matching” here I mean that the correlation between the probabilities the models give to articles being in those topics is very, very high—above 0.99 or so. The matching models won’t always have the same number, so it isn’t always easy to find them. But by simply looking at the correlations between any pairs of topics one from each model) they usually jumped out.
That meant it was possible every time a few topics were added to simply look at the new topics and ask if they were improvements or not. In an earlier attempt at this project—one that was fatally undermined by not filtering out enough latex and bibliographic words—this had led to a clear optimum arising around seventy topics. And that’s what I expected this time. But it didn’t happen.
Instead, what happened was that as I kept adding topics, it (a) kept finding relative, sensible new topics to add and (b) was not splitting up the topics I really hoped it would split. This was something of a disappointment—the project would have been more manageable for me if the model had found an optimum number of topics in the low seventies or lower. But it simply didn’t; by the standards I’d set before looking at the models, they just kept getting better as the number of topics got higher.
Eventually I settled on ninety topics. There were a bit more than I wanted, and I could have gone even higher. But it was starting to get a little more fine-grained than I wanted—I already had three distinct topics in philosophy of biology, for example. Still, the model runs in which I asked for ninety-six topics and then for one hundred topics weren’t clearly worse than the one run with ninety topics by the standards I’d set myself). So stopping here was somewhat arbitrary.
Once I had the ninety-topic model, it still wasn’t perfect. There were a few places where it looked like the model had put some things in very odd spots. Some of this remains in the finished product—the model bundles together some work on probability and coherence with historical work on Hume, and it puts one-half of the Freud papers with medical ethics and the other half of them with intention. But at this stage there were more of these overlaps than I liked.
I relied on one last feature of the topicmodels package. The algorithm doesn’t stop when it reaches an equilibrium; it stops when it sees insufficient progress toward equilibrium. One thing to do would be to refine what counts as “insufficient,” but I found this hard to control. A similar approach is to start not with a random distribution but with a finished model and then ask the algorithm to approach equilibrium from that starting point. It won’t go very far; the model was finished to start with. But it will end up with a model that the algorithm likes slightly better. (The model will, for example, have a lower perplexity score.) I’ll call the resulting model a refinement.
The refinement process takes a model as input and returns a model as output, so it can be iterated.7 And at this stage I had a clever thought. Since the refinement process improves the model, and it can be iterated, I should just iterate it as often as I can to get a better and better model. At the back of my mind, I had two worries at this point: one was that this was a bit like tightening a string, and if done too much the string will just snap. The other was that I had lost my mind and was fretting about mathematical models of large text libraries using half-baked metaphors concerning the physics of everyday objects.
Reader, it snapped.
After one hundred iterations, the model ended up making an interesting, and amusing, mistake.
One signature problem with the kind of text mining I’m doing is that it can’t tell the difference between a change of vocabulary that is the result of a change in subject matter, and a change of vocabulary that is the result of a change in verbal fashions. If these kinds of models are built with almost any parameter settings, a distinctive topic (or two) for ordinary language philosophy will turn up. Why? Because the language of the ordinary language philosophers was so distinctive. That’s not great, but it’s unavoidable. Ideally, that would be the only such topic. And one of the reasons I filtered out so many words was to avoid having more such topics.
But it turns out that there is another period with a somewhat distinctive vocabulary: the twenty-first century. It’s not as distinctive as midcentury British philosophy. And usually it isn’t distinctive enough to really confuse most of these models. But it is just distinctive enough that if refinements are run iteratively for, let’s say, four days while I’m away at a conference, the model will find this distinctive language. So after one hundred iterations, I ended up with a model whose last topic that wasn’t a philosophical topic at all, but was characterized by the buzzwords of recent philosophy.
Still, it turns out the refinements weren’t all a bad idea. After fifteen refinements, the model had separated out some of the disjunctive categories I’d hoped it would and was only starting to get thrown by the weird language of very recent philosophy. That’s the model I ended up using—the one with seed 22031848, ninety topics, and fifteen iterations of the refinement process.
If you’re interested in doing this yourself, the magic code looks like
refinedlda <- LDA(all_dtm, k = 90, model = refinedlda, control = list(seed = 22031848, verbose = 1, initialize = "model")). That is
refinedldais an LDA that takes the DTM I started with, and has nineity topics, and is based on a model, where that model is
refinedldaitself. If loops don’t scare you, you can simply loop this process to get as many iterations of refinement as you like. They took about forty-five minutes each to run when I did them.↩︎