Part 3: Analysis with Factors

The Dataset

The data used in our exploration comes from a collection of NSF Research Awards Abstracts consisting of 129,000 abstracts, collected by researchers at the University of California, Irvine. The data set is rather massive however (more than 500 MB), so some preprocessing is necessary before exploring our subset of interest with R.

Similar to the last part, Perl is a good fit to extract data until I decide to import it all into a database. The following Perl script will produce a tab-separated file containing the start year for a grant, the NSF organisation associated with the grant, the grant total per award, and the dates of its duration. The data can then be loaded into R with the following command:

g <- read.table("amounts_dates.tsv", sep = "\t", col.names = c("Year", "NSF Organisation", "Grant Amount", "Start Date", "End Date"))

Analysis

Can we see a rough idea of how long each NSF organisation gets grants for? It seems as though we could turn the duration dates into an interval in weeks, and take an average of that.

It seems that the dates are currently factors, which isn't really what we want. They should be date-time values.

str(g)
'data.frame':   132372 obs. of  5 variables:
         $ Year            : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
         $ NSF.Organisation: Factor w/ 71 levels "","ACI","ANI",..: 19 48 25 23 57 10 57 57 57 57 ...
         $ Grant.Amount    : int  179720 300000 188574 225024 463490 53277 3842340 14546493 2916509 50000 ...
         $ Start.Date      : Factor w/ 1257 levels "","April 1, 1990",..: 668 668 548 1122 690 668 441 441 441 263 ...
         $ End.Date        : Factor w/ 726 levels "","April 1, 1991",..: 581 521 169 220 582 581 399 460 399 518 ...
g$Start.Date <- as.Date(as.character(g$Start.Date), format='%B %d, %Y')
g$End.Date <- as.Date(as.character(g$End.Date), format='%B %d, %Y')
str(g)
'data.frame':   132372 obs. of  5 variables:
          $ Year            : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
          $ NSF.Organisation: Factor w/ 71 levels "","ACI","ANI",..: 19 48 25 23 57 10 57 57 57 57 ...
          $ Grant.Amount    : int  179720 300000 188574 225024 463490 53277 3842340 14546493 2916509 50000 ...
          $ Start.Date      :Class 'Date'  num [1:132372] 7456 7456 7486 7548 7470 ...
          $ End.Date        :Class 'Date'  num [1:132372] 8369 8916 8765 8459 8734 ...

Let's turn the start/end dates into an actual interval.

g$Duration <- with(g, difftime(End.Date, Start.Date, units='weeks'))
head(g)
  Year NSF.Organisation Grant.Amount Start.Date   End.Date       Duration
    1 1990              DEB       179720 1990-06-01 1992-11-30 130.4286 weeks
    2 1990              MCB       300000 1990-06-01 1994-05-31 208.5714 weeks
    3 1990              DMS       188574 1990-07-01 1993-12-31 182.7143 weeks
    4 1990              DMI       225024 1990-09-01 1993-02-28 130.1429 weeks
    5 1990              OCE       463490 1990-06-15 1993-11-30 180.5714 weeks
    6 1990              CCR        53277 1990-06-01 1992-11-30 130.4286 weeks
g$Duration <- as.numeric(g$Duration)

Now that we have this duration, can we turn it into a factor and examine the grant amount by year? We'll make a dot plot with the colour corresponding to the duration. Let's cut() it into something like "Short", "Medium", "Long".

g$Duration <- cut(g$Duration, breaks=3, labels=c("Short", "Medium", "Long"), ordered=TRUE)
gf <- na.omit(g) # There was one observation with no data except the year

That wasn't very helpful. Let's look at the raw numbers.

table(gf$Duration)
 Short Medium   Long 
    132369      1      1

It's completely lop-sided! Maybe if we remove the "outliers" and recalculate…

gf2 <- gf[gf$Duration=="Short",]
table(gf2$Duration)
 Short Medium   Long 
    132363      5      1

Still Skewed. Maybe this tells us something. Even re-factoring with 5 breaks gives a heavily skewed distribution:

Very Short      Short     Medium       Long  Very Long 
        132363          1          4          0          1

In any case, let's look at what we came here for: mean duration. We'll recalculate the duration (again since I made it a factor) and leave it as a numeric interval. That's uninteresting (and we've done it before) so I'll omit that part. Then we can calculate the average with ddply(). Then let's make a factor out of the result and see what happens.

m <- ddply(gf2[c(2,6)], .(NSF.Organisation), numcolwise(mean))
m$df <- cut(m$Duration, breaks=5, labels=c("Very Short", "Short", "Medium", "Long", "Very Long"), ordered=TRUE)
table(m$df)
Very Short      Short     Medium       Long  Very Long 
            13         13         32         10          2

Much better, though there are undoubtedly problems with this approach. Now let's see what a dot plot would look like.

References