3.5 Raw Frequency of Articles

The problems of the previous section are just exacerbated if we go to raw frequencies.

A plot showing the importance of all topics over time on a single graph, as measured by raw frequency. The underlying data is in Table B.4. It is mostly a mess of dots that doesn't show very much, but what information can be gleaned by looking is described in the text below.

Figure 3.8: All ninety topics - Raw Frequency of Articles

The same data as above, but with each topic shown as a separate facet.

Figure 3.9: The ninety topics—raw frequency of articles (faceted).

All this tells us is that there is a lot more diversity, and a lot more specialization, in journals in the last thirty years than there was 120 years ago. Everything else gets lost in the noise.

It’s only a little clearer in the graph that only shows the last seventy-five years.

Figure 3.10: All ninety topics - Raw Frequency of Articles

The animation is a bit more revealing.

But what it reveals is primarily that these raw counts are very unstable. That’s because the measure they are built on is subject to severe tipping-point effects. Whether an article gets probability 0.26 for being in one category and probability 0.25 for being in another, or the other way around, really just depends on where the algorithm is stopped. (It’s bob-of-the-head stuff in horse racing terms.) But it makes all the difference to these raw counts. This is why I’ve tried, contrary to most work that I’ve seen that uses topic modeling, to deemphasize these raw counts in favor of the weighted counts.

The rest of the graphs look at what happens when focusing on pages rather than articles. I’m using articles as the basic unit of measure for everything else in this book, but it’s worth spending a little time seeing how things look if focusing on pages instead.