Part 1: Introduction and Diving In

The Dataset

The collection of NSF Research Awards Abstracts consists of 129,000 abstracts and an index-friendly bag of words characterising the data. The data was assembled by Michael J. Pazzani and Amnon Meyers at the School of Computer Science at the University of California, Irvine.

To get a feel of the data, let's look at a sample abstract:

	Title       : Postdoctoral Research Fellowship in Plant Biology
	Type        : Award
	NSF Org     : DBI 
	Latest
	Amendment
	Date        : May 23,  1996       
	File        : a9404025

	Award Number: 9404025
	Award Instr.: Fellowship                                   
	Prgm Manager: THOMAS QUARLES                          
		      DBI  DIV OF BIOLOGICAL INFRASTRUCTURE        
		      BIO  DIRECT FOR BIOLOGICAL SCIENCES          
	Start Date  : August 1,  1994     
	Expires     : August 31,  1997     (Estimated)
	Expected
	Total Amt.  : $97200              (Estimated)
	Investigator: John E. Fowler   (Principal Investigator current)
	Sponsor     : Fellowships

		      Arlington, VA  22230    /   -

	NSF Program : 1101      SPECIAL PROJECTS
	Fld Applictn: 0201000   Agriculture                             
	              61        Life Science Biological                 
	Program Ref : 9179,SMET,
	Abstract    :
	                9404025  Fowler  This is an individual award for the Postdoctoral Research
	              Fellowship in Plant Biology.  The applicant received his Ph.D. from University
	              of California at Berkeley.  His Ph.D. research was in developmental genetics in
	              maize.  He now plans to change his field of research and carry out his
	              postdoctoral fellowship research in the laboratory of Dr. Ralph Quatrano at
	              University of North Carolina.  The proposed research entitled "Mechanism of
	              establishment of cellular polarity in the brown alga, Fucus" will expand the
	              scientific experience of this Fellow into basic cell biology.
The main draw of this data set is the breadth of analyses possible—over the next few weeks, I plan to investigate the distribution of grant awards by discipline and sub-discipline, the changes over the decade, the grant amounts, the durations of these grants, and analysis of the abstracts themselves.

Analysis

It might be interesting to get an overview of a portion of the data set (because of its sheer size) by looking of the proportions of the grants by NSF organisation (which may include the Biological Sciences, Computer & Information Science & Engineering, etc.) in 1990. The first step involves a Perl script that counts up the number of files for each NSF organisation as they are listed on NSF's Web site. It should be noted that the name of the Computer & Information Science & Engineering branch changed since 1990—the "Information" was a relatively recent addition. The script, which follows, uses some Unix tools and regex trickery:

	my %nsf_orgs = (
	    BIO => "Directorate for Biological Sciences",
	    "(CISE|CSE)" => "Directorate for Computer & Information Science & Engineering",
	    EHR => "Directorate for Education & Human Resources",
	    ENG => "Directorate for Engineering",
	    GEO => "Directorate for Geosciences",
	    MPS => "Directorate for Mathematical & Physical Sciences",
	    SBE => "Directorate for Social, Behavioral & Economic Sciences"
	);

	open(OUT, "> proportions.tsv");
	for $org_abbrev (keys %nsf_orgs) {
	    my $total = `grep -REl \"      $org_abbrev  \" . | wc -l`;
	    $total =~ s/\s+//;
	    print OUT "$nsf_orgs{$org_abbrev}\t$total\n";
	}
	close(OUT);
	

Since we'd like to now visualise these proportions, let's make a pie using R and ggplot2.

		library(ggplot2)
		prop <- read.delim("proportions.tsv", sep = "\t", header=FALSE)
		ggplot(prop, aes(x = factor(1), y = V2, fill = V1))
			+ geom_bar(width = 1) + coord_polar(theta = "y")
	

Before showing the chart however, we can actually do a comparison to 2002… we have the technology (since we just built it). The 2002 data may be more interesting than the seemingly sparse 2003 data. Moving the script to the directory holding the 2002 data and repeating the process, we can see the following pies:

We can see how, for instance, the CISE and Engineering portions grow, while the Education & Human Resources portion shrinks. It would seem however that the distribution of the other disciplines have remained fairly static. Further analysis is possible however (and is planned)—there may be underlying trends over time that this simple visualisation is unable to capture.

References