Ryan Burton

Part 1: Introduction and Diving In

The Dataset

The collection of NSF Research Awards Abstracts consists of 129,000 abstracts and an index-friendly bag of words characterising the data. The data was assembled by Michael J. Pazzani and Amnon Meyers at the School of Computer Science at the University of California, Irvine.

To get a feel of the data, let's look at a sample abstract:

Title : Postdoctoral Research Fellowship in Plant Biology
Type : Award
NSF Org : DBI
Latest
Amendment
Date : May 23, 1996
File : a9404025

Award Number: 9404025
Award Instr.: Fellowship
Prgm Manager: THOMAS QUARLES
DBI DIV OF BIOLOGICAL INFRASTRUCTURE
BIO DIRECT FOR BIOLOGICAL SCIENCES
Start Date : August 1, 1994
Expires : August 31, 1997 (Estimated)
Expected
Total Amt. : $97200 (Estimated)
Investigator: John E. Fowler (Principal Investigator current)
Sponsor : Fellowships

Arlington, VA 22230 / -

NSF Program : 1101 SPECIAL PROJECTS
Fld Applictn: 0201000 Agriculture
61 Life Science Biological
Program Ref : 9179,SMET,
Abstract :
9404025 Fowler This is an individual award for the Postdoctoral Research
Fellowship in Plant Biology. The applicant received his Ph.D. from University
of California at Berkeley. His Ph.D. research was in developmental genetics in
maize. He now plans to change his field of research and carry out his
postdoctoral fellowship research in the laboratory of Dr. Ralph Quatrano at
University of North Carolina. The proposed research entitled "Mechanism of
establishment of cellular polarity in the brown alga, Fucus" will expand the
scientific experience of this Fellow into basic cell biology.

The main draw of this data set is the breadth of analyses possible—over the next few weeks, I plan to investigate the distribution of grant awards by discipline and sub-discipline, the changes over the decade, the grant amounts, the durations of these grants, and analysis of the abstracts themselves.

Analysis

It might be interesting to get an overview of a portion of the data set (because of its sheer size) by looking of the proportions of the grants by NSF organisation (which may include the Biological Sciences, Computer & Information Science & Engineering, etc.) in 1990. The first step involves a Perl script that counts up the number of files for each NSF organisation as they are listed on NSF's Web site. It should be noted that the name of the Computer & Information Science & Engineering branch changed since 1990—the "Information" was a relatively recent addition. The script, which follows, uses some Unix tools and regex trickery:

	my %nsf_orgs = (
	    BIO => "Directorate for Biological Sciences",
	    "(CISE|CSE)" => "Directorate for Computer & Information Science & Engineering",
	    EHR => "Directorate for Education & Human Resources",
	    ENG => "Directorate for Engineering",
	    GEO => "Directorate for Geosciences",
	    MPS => "Directorate for Mathematical & Physical Sciences",
	    SBE => "Directorate for Social, Behavioral & Economic Sciences"
	);

	open(OUT, "> proportions.tsv");
	for $org_abbrev (keys %nsf_orgs) {
	    my $total = `grep -REl \"      $org_abbrev  \" . | wc -l`;
	    $total =~ s/\s+//;
	    print OUT "$nsf_orgs{$org_abbrev}\t$total\n";
	}
	close(OUT);

Since we'd like to now visualise these proportions, let's make a pie using R and ggplot2.

		library(ggplot2)
		prop <- read.delim("proportions.tsv", sep = "\t", header=FALSE)
		ggplot(prop, aes(x = factor(1), y = V2, fill = V1))
			+ geom_bar(width = 1) + coord_polar(theta = "y")

Before showing the chart however, we can actually do a comparison to 2002… we have the technology (since we just built it). The 2002 data may be more interesting than the seemingly sparse 2003 data. Moving the script to the directory holding the 2002 data and repeating the process, we can see the following pies:

We can see how, for instance, the CISE and Engineering portions grow, while the Education & Human Resources portion shrinks. It would seem however that the distribution of the other disciplines have remained fairly static. Further analysis is possible however (and is planned)—there may be underlying trends over time that this simple visualisation is unable to capture.

References

http://had.co.nz/ggplot2/coord_polar.html