The data used in our exploration comes from a collection of NSF Research Awards Abstracts consisting of 129,000 abstracts, collected by researchers at the University of California, Irvine. The data set is rather massive however (more than 500 MB), so some preprocessing is necessary before exploring our subset of interest with R.
In this instance, we will be exploring the geographic data. We can think of some of the "top" research schools and where they're located. Does actual data reflect our conceptions?
I've edited our Perl script to extract the state as well as the year, NSF Organisation, and grant amount. The result is a tab-separated file.
We can now load up our data into R:
The first question that comes to mind is, "Which states get the most grants?" To find this out, we can run the following command:
This shows us that California easily the most greatly represented, followed by New York and Massachusetts. And of course, some of these aren't actually states. But what about actual funding? We can use a box-and-whisker plot, which should allow us to see visually the mean, and the (lower and upper) quartiles. The State axis is categorical whereas the Award axis is continuous, which would make this geom very appropriate.
The box-and-whisker plot doesn't seem to help that much, since so much of the data are concentrated on the lower end of the award spectrum. We could make the award axis logarithmic…
Do remember to remove zeroes before doing a log plot.
I must say, I found this surprising, even though it perhaps shouldn't be. They are all roughly, on average, along the same order of magnitude. There are definitely some with some obvious outliers (CA, CO, VA, etc. having some above 1x10^8; the very low funding that some seem to have gotten, including a few at $1 [seen in the source data]).
Since (visually speaking, anyway) there doesn't seem to be any overwhelming relationship between location and funding, if I wanted to go somewhere to do research, I might want to go somewhere that has a good track record of getting funding. Presumably this shouldn't be a factor, but it seems reasonable, no? Regardless, the states with the highest means include CA, IL, MA, WA, and WI, which is what I would have reasonably expected.