1.6 Choosing between whe Models
Despite the number of topics being set, there are still a lot of ways that the model can change. Building a model starts with a somewhat random assignment of words and articles to topics, followed by a series of steps (themselves each involving a degree of randomization) toward a local equilibrium. But there is a lot of path dependency in this process, as there always is in finding a local equilibrium.
Rather than walk through the mathematics of why this is so, I find it more helpful to think about what the model is trying to achieve and why it is such a hard thing to achieve. Let’s just focus on one subject matter in philosophy, friendship, and think about how it could be classified it if were trying to divide all of philosophy up into sixty to ninety topics.
It’s too small a subject to be its own topic. It’s best if the topics are roughly equal size, and discussions that are primarily about friendship are, I’d guess, about 0.1 to 0.2 percent of the articles in these twelve journals. It’s an order of magnitude short of being its own topic. It has to be grouped in with neighboring subjects. But which ones? For some subjects, the problem is that there aren’t enough natural neighbors. This is why the models never quite know what to do with vagueness, or feminism, or Freud. But here the problem is that there are too many.
One natural thing to do is to group papers on friendship with papers on love and both of them with papers on other emotions or perhaps with papers on other reactive attitudes. That groups a nice set of papers about aspects of the mental lives of humans that are central to actually being human but not obviously well captured by simple belief-desire models.
Another natural thing to do is to group papers on friendship with papers on families, and perhaps include both of them in broader discussions of ways in which special connections to particular others should be accounted for in a good ethical theory. Again, this produces a reasonably nice set of papers here, with the general theme of special connections to others.
Or yet another natural thing to do is to group papers on friendship with papers on cooperation. And while thinking about cooperation, the natural paper to center the topic around is Michael Bratman’s very highly cited paper “Shared Cooperative Activity”. From there, there are a few different ways one could go. Expanding the topic to Bratman’s work on intention more broadly and the literature it has spawned could be done. Or one could expand it to include other work on group action, and even perhaps on group agency. (I teach that Bratman paper in a course on groups and choices, which is centered around game theory. Though I think getting from friendship to game theory in a single one of our sixty to ninety topics would be a step too far.)
Which of these is right? Well, I saw all of them when I ran the algorithm enough times. And they all seem like sensible choices to me. How should I choose which model to use when different models draw such different boundaries within the space of articles? A tempting thought is to see which one looks most like what one thinks philosophy really looks like and choose it. But now prejudices are being imposed on the model rather than letting the model teach something about the discipline.
A better thing to do is to run the algorithm a bunch of times and find the output that most commonly appears. Intuitively, we’re looking for an equilibrium, and there’s something to be said for picking the equilibrium with the largest basin of attraction. This is more or less what I did, though there are two problems.
The first problem is that running the algorithm a bunch of times is easier said than done. On the computers I was using pretty good personal computers), it took about eight hours to come up with a model with sixty topics. Running a bunch of them to find an average was a bit of work. The University of Michigan has a good unit for doing intensive computing jobs like this, but I kept feeling as though I was close enough to being done that running things on my own devices was less work than setting up an account there. (This ended up being a bad mistake.) But I could just leave them run overnight every night for a couple of weeks, and eventually I had sixteen sixty-topic models to average out.
The models are distinguished by their seed. This is a number that can be specified to seed the random-number generator. The intended use of it is to make it possible to replicate work like this that relies on randomization. But it also means that a bunch of models can be run, then slight changes can be made to the one that seems most representative. And that’s what I ended up doing. The seeds I used at this stage were famous dates from the revolutions of 1848. And to get ahead of the story, the model the book is based around has seed value 22031848, the date of both the end of the Five Days of Milan and of the start of the Venetian Revolution.6
The second is that it isn’t obvious how to average the models. At one level, what the model produces is a giant probability function. And there is a lot of literature on how to merge probability functions into a single function or more or less equivalently) how to find the most representative of a set of probability functions. But this literature assumes that the probability functions are defined over more or less) the same possibility spaces. And that’s precisely what isn’t true here. When building one of these models, what is left is a giant probability function all right. But no two model runs give a function over the same space. Indeed, the most interesting thing about any model is what space it decides is most relevant. So the standard tools for merging probability functions don’t apply.
What I did instead was look for two things.
First, the model doesn’t just say, “This article goes in this topic.” It says that this article goes in this topic with probability p. Indeed, it gives nonzero probabilities to each article being in each topic. So the thing to look for in a model is what articles does it think have the highest probability of being in any given topic? That is, roughly speaking, Which articles does it think are the paradigms of the different topics it discovers? Then ask, across a range of models, How much does this model agree with the other models about which are the paradigm articles? So, for instance, find the ten articles with the highest probability of being in each of the sixty topics. And then ask, Out of the six hundred articles that this model thinks are the clearest instance of a particular topic, how many of them are similarly in the six hundred articles that other models think are the paradigms of a particular topic? So that was one what I looked for: Which models had canonical articles that were also canonical articles in a lot of other models?
Second, the models don’t just give probabilistic judgments of an article being in a particular topic; they give probabilistic judgments of a word being in an article in that topic. So, the model might say that the probability of the word Kant turning up in an article in topic 25 is 0.1, while the probability of it turning up in most other topics is more like 0.001. That tells us that topic 25 is about Kant, but it also tells us that the model thinks that Kant is a keyword for a topic. Since some words will turn up frequently in a lot of topics no matter what, focus here not just on the raw probabilities like the 0.1 above) but on the ratio between the probability of a word being in one topic and it being in others. That determines how characteristic the word is of the topic. And again this trick can be used to find the six hundred characteristic words of a particular model and ask how often those six hundred words are characteristic words of any model at all. There is a lot of overlap here—the vast majority of models have a topic where Aristotle is a characteristic word in this sense, for example. But there are also idiosyncrasies, and the models with fewest idiosyncrasies seem like better bets for being more representative. So that was another thing I looked for: Which models had keywords that were also keywords in a lot of other models?
The problem was that these two approaches (and a couple of variations of them that I tried) didn’t really pick out a unique model. It told me that three of them were better than the others but not really which of those three was best. I chose one in particular. Partially this was because I could convince myself it was a bit better on the two representativeness tests from the last two paragraphs, though honestly the other two would have done just as well, and partially it was because it did better on the four criteria from the previous section. But largely it was because the flaws it had all seemed to go one way: they were all flaws in which the model failed to make distinctions I felt it should be making. The other models had a mix; some were missing distinctions but also it had some needless distinctions. And I felt at the time that having all the errors go one way was a good thing. All I had to do now was run the same model with slightly more topics and I’d have a really good model. And that sort of worked, though it was more complicated than I’d hoped.
Why 1848 and not some other historical event? Well, I had originally been using dates from the French Revolution. But I made so many mistakes that I had to start again. In particular, I didn’t learn how many words I needed to filter out, and how many articles I needed to filter out, until I saw how much they were distorting the models. And by that stage I had so many files with names starting with 14071789 and the like that I needed a clean break. So 1848, with all its wins and all its losses, it was.↩︎