3.5 Raw Frequency of Articles
The problems of the previous section are just exacerbated if we go to raw frequencies.
All this tells us is that there is a lot more diversity, and a lot more specialization, in journals in the last thirty years than there was 120 years ago. Everything else gets lost in the noise.
It’s only a little clearer in the graph that only shows the last seventy-five years.
The animation is a bit more revealing.
But what it reveals is primarily that these raw counts are very unstable. That’s because the measure they are built on is subject to severe tipping-point effects. Whether an article gets probability 0.26 for being in one category and probability 0.25 for being in another, or the other way around, really just depends on where the algorithm is stopped. (It’s bob-of-the-head stuff in horse racing terms.) But it makes all the difference to these raw counts. This is why I’ve tried, contrary to most work that I’ve seen that uses topic modeling, to deemphasize these raw counts in favor of the weighted counts.
The rest of the graphs look at what happens when focusing on pages rather than articles. I’m using articles as the basic unit of measure for everything else in this book, but it’s worth spending a little time seeing how things look if focusing on pages instead.