1 Methodology
The point of this chapter is to explain the choices I made in building the model that the book is based around. But to understand the choices that I made, it helps to know a little bit about what a Latent Dirilecht Algorithm (LDA) does.
The inputs to the model are some texts and a number. The model doesn’t care about the ordering of words in the texts, so really the input isn’t texts but a list of lists of ordered pairs. Each ordered pair is a word and a number. In the version I’m using, the outer list is a list of philosophy articles. And each element of that list is a list of words in that article, along with the number of times the word appears.
Along with that, you give the model a number. This is the number of topics that you want the model to divide the texts into. I’ll call this number \(t\) in this introduction. And intuitively there is a function \(T\) that maps articles into the \(t\) topics.
What the model outputs is, for our purposes, a pair of probability functions: one for articles and one for words.
The probability function for articles gives, for each article \(a\) and topic number \(n \in \{1, \dots, t\}\), a probability for \(T(a) = n\); that is, it gives a probability that the article is in that topic. Notably, it doesn’t identify the topics with any more than numbers. I’m going to give names to the topics—this one is Kant; this one is composition and constitution, etc.—but the model doesn’t do that. For it, the topics really are just integers between 1 and \(t\).
The probability function for words gives, for each word \(w\) from any of the articles, and topic number \(n \in \{1, \dots, t\}\), the probability that a randomly chosen word from the articles in that topic is \(w\). So in the Kant topic, the probability that a randomly chosen word is Kant is about 0.14.
That number feels absurdly high, but it makes sense for a couple of reasons. One is that to make the models compile in even semireasonable time, I filtered out a lot of words. What’s it’s really saying is that the word Kant produces about 1/7 of the tokens that remain. The other is that what it’s really giving you here is the probability that a random word in an article is Kant conditional on the probability of that article being in the Kant is 1. And in fact the model is never that confident. Even for articles that might be considered to be clearly articles about Kant, the model is rarely more than 40 percent confident that that’s what they are about. And this is for a good reason. Most articles about Kant in philosophy journals are, naturally enough, about Kantian philosophy. And any part of Kantian philosophy is, well, philosophy. So the model has a topic on beauty, and when it sees an article on Kantian aesthetics, it gives some probability the correct classification of that article is in the topic on Beauty. So the word probabilities are quite abstract things—they are something like word frequencies in a certain kind of stereotyped article. What the model really wants to do is find \(t\) stereotypes such that each real article is a linear mixture of the stereotypes.
The way the model approaches this goal is by building two probability functions, checking how well they cohere, and recursively refining them in places that they don’t cohere. One probability function is the probability, for each article, that it is in one or other of the ninety topics. So it might say this article is 0.4 likely to be in topic 32, 0.3 likely to be in topic 68, and so on down to some vanishing probability that it is in topic 89. The other probability function is the probability, for each topic, of a given word appearing. So it might say that given the article is in topic 32, there is a 0.15 likelihood that a randomly selected word in the article is Kant, an 0.05 likelihood that a randomly selected wordis ideal, and so on, down to a vanishingly small likelihood that the word is, say, Weatherson. Combining those functions, we get the probability, for each actual article, that a randomly selected word in it is Kant or ideal or Weatherson or any other word. And we can check that calculated probability against the actual frequency of each word in the article. I’ll call the calculated probabilities the modeled frequencies, and say that the goal of the model is to have the modeled frequencies of words in articles match the actual frequencies of the words in the articles. A perfect match here is impossible to achieve; there aren’t enough degrees of freedom. But the model can minimize the error, and it does so recursively.
The process involved is slow. I was able to build all the models I’ll discuss on personal computers, but it takes some processing time. The particular model I’m primarily using took about twenty hours to build, but I ran through many more hours than that building other models to compare it to.
And the process is very path dependent. The algorithm, like many algorithms, has the basic structure of pick a somewhat random starting point, then look for a local equilibrium. That’s incredibly dependent on how you start and somewhat dependent on how you travel.
The point of this chapter is to describe how I chose the inputs to the model I ended up using, and then how I set various parameters within the model. The parameters are primarily, in terms of the metaphor of the previous paragraph, the starting point of the search, and how long the search should go before we decide one is something close enough to an equilibrium.
The inputs are more complex. Very roughly, the inputs I used are the frequently occurring substantive words from research articles in twelve important philosophy journals. I’ll start by talking about how and why I selected the particular twelve journals that I did.