The collection of NSF Research Awards Abstracts consists of 129,000 abstracts and an index-friendly bag of words characterising the data. The data was assembled by Michael J. Pazzani and Amnon Meyers at the School of Computer Science at the University of California, Irvine.
To get a feel of the data, let's look at a sample abstract:
Title : Postdoctoral Research Fellowship in Plant Biology Type : Award NSF Org : DBI Latest Amendment Date : May 23, 1996 File : a9404025 Award Number: 9404025 Award Instr.: Fellowship Prgm Manager: THOMAS QUARLES DBI DIV OF BIOLOGICAL INFRASTRUCTURE BIO DIRECT FOR BIOLOGICAL SCIENCES Start Date : August 1, 1994 Expires : August 31, 1997 (Estimated) Expected Total Amt. : $97200 (Estimated) Investigator: John E. Fowler (Principal Investigator current) Sponsor : Fellowships Arlington, VA 22230 / - NSF Program : 1101 SPECIAL PROJECTS Fld Applictn: 0201000 Agriculture 61 Life Science Biological Program Ref : 9179,SMET, Abstract : 9404025 Fowler This is an individual award for the Postdoctoral Research Fellowship in Plant Biology. The applicant received his Ph.D. from University of California at Berkeley. His Ph.D. research was in developmental genetics in maize. He now plans to change his field of research and carry out his postdoctoral fellowship research in the laboratory of Dr. Ralph Quatrano at University of North Carolina. The proposed research entitled "Mechanism of establishment of cellular polarity in the brown alga, Fucus" will expand the scientific experience of this Fellow into basic cell biology.The main draw of this data set is the breadth of analyses possible—over the next few weeks, I plan to investigate the distribution of grant awards by discipline and sub-discipline, the changes over the decade, the grant amounts, the durations of these grants, and analysis of the abstracts themselves.
It might be interesting to get an overview of a portion of the data set (because of its sheer size) by looking of the proportions of the grants by NSF organisation (which may include the Biological Sciences, Computer & Information Science & Engineering, etc.) in 1990. The first step involves a Perl script that counts up the number of files for each NSF organisation as they are listed on NSF's Web site. It should be noted that the name of the Computer & Information Science & Engineering branch changed since 1990—the "Information" was a relatively recent addition. The script, which follows, uses some Unix tools and regex trickery:
my %nsf_orgs = ( BIO => "Directorate for Biological Sciences", "(CISE|CSE)" => "Directorate for Computer & Information Science & Engineering", EHR => "Directorate for Education & Human Resources", ENG => "Directorate for Engineering", GEO => "Directorate for Geosciences", MPS => "Directorate for Mathematical & Physical Sciences", SBE => "Directorate for Social, Behavioral & Economic Sciences" ); open(OUT, "> proportions.tsv"); for $org_abbrev (keys %nsf_orgs) { my $total = `grep -REl \" $org_abbrev \" . | wc -l`; $total =~ s/\s+//; print OUT "$nsf_orgs{$org_abbrev}\t$total\n"; } close(OUT);
Since we'd like to now visualise these proportions, let's make a pie using R and ggplot2.
library(ggplot2) prop <- read.delim("proportions.tsv", sep = "\t", header=FALSE) ggplot(prop, aes(x = factor(1), y = V2, fill = V1)) + geom_bar(width = 1) + coord_polar(theta = "y")
Before showing the chart however, we can actually do a comparison to 2002… we have the technology (since we just built it). The 2002 data may be more interesting than the seemingly sparse 2003 data. Moving the script to the directory holding the 2002 data and repeating the process, we can see the following pies:
We can see how, for instance, the CISE and Engineering portions grow, while the Education & Human Resources portion shrinks. It would seem however that the distribution of the other disciplines have remained fairly static. Further analysis is possible however (and is planned)—there may be underlying trends over time that this simple visualisation is unable to capture.