Part 4: Word Frequencies

The Dataset

The data used in our exploration comes from a collection of NSF Research Awards Abstracts consisting of 129,000 abstracts, collected by researchers at the University of California, Irvine. The data set is rather massive however (more than 500 MB), so some preprocessing is necessary before exploring our subset of interest with R.

We will be working with the bag of words part of the data set this time around—we'll keep the filename associated with each NSF organisation in the abstracts so that we can look up the word data. This Perl script, which has been modified since we last used it, will put the year, organisation, grant amount, and the filename on tab-separated lines.

We'll be using SQLite3 this time as well, which should make it easy to look up the word data for each file. The advantages of SQLite in general is that it's small, fast (especially moreso than reading a flat file in R), and stores data in a self-contained file instead of requiring an RDBMS server for querying data.

We'll create our database with the following commands (assuming we've already run our Perl script):

create table orgs_filenames (year, org, grant, filename);
.separator "\t"
.import orgs_filenames.tsv orgs_filenames
This table lets us look up the organisation and the filename of the abstract.
.separator " "
create table freqs (docid, wordid, freq);
.import "../data/bag\ of\ words/nsfabs_part1_out/docwords.txt" freqs
.import "../data/bag\ of\ words/nsfabs_part2_out/docwords.txt" freqs
.import "../data/bag\ of\ words/nsfabs_part3_out/docwords.txt" freqs
The freqs table allows us to look up the frequency of a word in each document by word id.
create table words (wordid, word);
.separator " "
.import ../data/words.txt words
Now we have the words and their ids, so we can understand the result. Finally, we'll need to get the docid from the filename:
create table docids (docid, filename);
.separator "\t"
.import "../data/bag\ of\ words/nsfabs_part1_out/idnsfid.txt" docids
.import "../data/bag\ of\ words/nsfabs_part2_out/idnsfid.txt" docids
.import "../data/bag\ of\ words/nsfabs_part3_out/idnsfid.txt" docids

The database ends up being 400 MB, so I guess I won't bother uploading it.

Analysis

As a grad student in Information Science, I'm interested in what interesting words are used to request grant funding in the field. Zipf's law (or at least the many observations in linguistic corpora that follow the law) would lead us to believe that the most frequent words are function words like "the", "of", and "a". However, they may also be other content-free words, like "like". Meta-Wiki (of the Wikimedia Foundation) maintains a list of stop word lists, including the one used in MySQL. I imported it into SQLite so we can account for them when counting up. Looks like it wasn't necessary though, since they've already been removed! Oh boy.

Let's go ahead and connect the database to R.

library(RSQLite)
s <- dbDriver('SQLite')
conn <- dbConnect(s, dbname='words.db')

According to the NSF site, we are interested in Information Integration and Informatics (III) and Information and Intelligent Systems (IIS), so we'll focus on those while executing the query/reading in the data.

wordcounts <- dbGetQuery(conn, 'select word, sum(freq) as wc from words, freqs where words.wordid = freqs.wordid and docid in (select docid from docids, (select filename from orgs_filenames where org in ("III", "IIS")) as information where information.filename = docids.filename) group by word order by wc desc;')
Whew. You can create indexes (indices?) on the columns in the WHERE clauses if you'd like it to run faster.

The scales for this data are pretty straightforward: the words are nominal, whereas the counts are ratio. We would, however, like to get another measure for each word: namely its rank based on its frequency. This will be ordinal (one rank is greater than or less than another, but we won't know by how much). Luckily, we've already sorted it with the database query. We'll go ahead and assign a rank sequentially to each row.

wordcounts$rank <- seq(1, nrow(wordcounts))
ggplot(wordcounts, aes(wc, rank)) + geom_point()

Looks like a power law. Heh, I thought it was pretty cool that Prof. McQuaid called us on that in one of our classes. Since we're comparing rank to frequency, it would be (as expected) an instance of Zipf's law. We can be more sure with a log-log plot.

ggplot(wordcounts, aes(rank, wc)) + geom_point() + scale_x_log10() + scale_y_log10()

It's not perfect, but… errr… it looks close enough.

ggplot(wordcounts, aes(rank, wc)) + geom_point() + scale_x_log10() + scale_y_log10() + geom_smooth(method="lm")

Okay, that was disappointing. An unexpectedly bad fit. In any case, we ought to be able to look at the words on the plot as well, so we'll build up a layered plot with the 75 most popular words.

aestext <- aes(rank, wc, label=word, size=log(wc))
aespoint <- aes(rank, wc, color = rank, size=wc)
ggplot(wordcounts[wordcounts$rank<75,]) + geom_point(aespoint, alpha=I(2/10)) + geom_text(aestext, angle = 45, alpha=I(5/10)) + scale_x_log10() + scale_y_log10()

Do note that the top ranked words here are "content-free" (as in uninteresting) words in academia, like "research", "data", and "information". But maybe not! Perhaps "data" and "information" are more interesting in IIS/III in comparison to the other fields. This would be something to investigate.

References