Ryan Burton

Part 2: Exploring Grant Funding with Quick Plots

The Dataset

The data used in our exploration comes from a collection of NSF Research Awards Abstracts consisting of 129,000 abstracts, collected by researchers at the University of California, Irvine. The data set is rather massive however (more than 500 MB), so some preprocessing is necessary before exploring our subset of interest with R.

Perl comes in handy for this task. The following Perl script will produce a tab-separated file containing the start year for a grant, the NSF organisation associated with the grant, and the grant total per award.

	$data_root = "../data/abstracts";

	open(OUT, "> amounts.tsv");

	foreach my $part (`ls $data_root`) {
	    chomp $part;
	    foreach my $year_dir (`ls $data_root/$part`) {
	        my $year = "";
	        chomp $year_dir;
	        if($year_dir =~ /\w+_(\d+)/) {
	            $year = $1;
	        }
	        foreach my $abstract_file (`find $data_root/$part/$year_dir -name \"*.txt\" -print`) {
	            chomp $abstract_file;
	            open(IN, "$abstract_file");
	            my $org = "";
	            my $grant = "";
	            while(<IN>) {
	                chomp;
	                my $line = $_;
	                if($line =~ /NSF Org\s+:\s+(\w+).*/) {
	                    $org = $1;
	                } elsif($line =~ /Total Amt.\s+:\s+\$(\d+).*/) {
	                    $grant = $1;
	                }
	            }
	            print OUT "$year\t$org\t$grant\n";
	            close(IN);
	        }
	    }
	}

	close(OUT);

The data can then be loaded into R with the following command:

ga <- read.table("amounts.tsv", sep = "\t", col.names = c("Year", "NSF Organisation", "Grant Amount"))

Analysis

The first order of business is getting an overview of how the grants were allocated by organisational unit. One way to do this is by plotting a point for every grant corresponding to the organisation and the amount. We can jitter the points to slightly mitigate the impact of overlaps, and make the points translucent so that we can see where the data are clustered:

qplot(y=NSF.Organisation, x = Grant.Amount, data = ga, colour = NSF.Organisation, geom = "jitter", alpha = I(1/2))

Unfortunately, a box-and-whister plot would be no more helpful than this.

As we might reasonably expect, most grants are in the <$1 million zone. We do have some, however, that get into the hundereds-of-millions. It seems that the Division for Atmospheric Sciences ("ATM") has had the highest grant amount in the data set at more than 400 million. The Astronimical Sciences ("AST") and Physical Sciences ("PHY") also secure high-valued grants. I haven't been able to find an exhaustive list of all the NSF organisations – some, if not all could potentially be extracted from the data.

We can furthermore look at the trend of grant totals over the years:

qplot(Year, weight = Grant.Amount, data = ga, geom = "bar", main = "Funding by Year", ylab = "Grant Amount")

There seems to be a sort of upward trend in the amount of the grants – the paltry 2003 total might be due to the lack of data available at the time that the collection was made.

Actually, how does this correspond to the number of grants per year?

qplot(Year, data = ga, geom = "bar", main = "Number of Grants per Year", ylab = "Grant Count")

The number of awards granted seems stable from year-to-year. Could research be getting more expensive?

We would like to break these out by organisation, but with over sixty the plot quickly becomes overwhelming. We will address this, among other things, in the future.

References

http://had.co.nz/ggplot2/stat_summary.html
Wickham, H. ggplot2: Elegant Graphics for Data Analysis