The first thing I tried was to look at the correlations between topics. That is, for any two topics, measure the correlation between the probability the model assigns to each article being in the first topic, and the probability it assigns to each article being in the second topic. If the topics are part of a common category, this should be reasonably high.
There are some interesting results from looking at the data this way, and later I’ll talk about them more. But it doesn’t work well as a way of generating categories. For one thing, there are too many false positives. For another, the approach fails just when it is most needed—when the task is to sort topics that are intuitively on the border between two categories. I did rely on correlations at one point below, but mostly it was a bad idea.
So then I tried simply categorizing the topics by hand. And this got a lot of the way, but there were just too many hard cases for it to be reliable. That said, thinking about how to categorize the topics by hand led to two crucial realizations.
First, some of the topics seem so disjunctive that they don’t fit naturally into any topic, but they do seem like they should be divisible in a way that makes them easier to classify.
Second, it’s important to not get too “realist” about what we’re trying to do here. It’s a bad idea to start with the question, “Is this topic really in category X or category Y”. That leads to the following mistake. Over a series of close calls, the topic in question is put in category X not Y. Even if every one of those placements is defensible, the conjunction of them is not.
The aim here is not to match some Platonic ideal of correct classification. The aim is to tell a story about what happened to philosophy over time. And if every close call gets decided the same way, that story won’t be any good.
In sporting terms, this is a case where what’s really needed are “make-up calls”. To track how well philosophy of science, for example, was represented in these journals, then about half the close calls involving whether to put a topic in the philosophy of science category should be resolved in favour of saying it is in that category. That principle is something I’ll come back to a few times in what follows.