2 Descriptive Statistics

2.1 Calculating group means

One of the most basic exploratory tasks with any data set involves computing the mean, variance, and other descriptive statistics. A previous section has already demonstrated how to obtain many of these statistics from a data set, using the summary(), mean(), and sd() functions. However these functions were used in the context of an entire data set or column from a data set; in most cases it will be more informative to calculate these statistics for groups of data, such as experimental treatments. The tapply() function can be used for this purpose. Summary statistics for the corn irrigation data set were calculated previously using the summary() function. The tapply() function allows us to calculate similar descriptive statistics for groups of data, most commonly for treatments, but also for other logical groups, such as by experimental sites, or years. The tapply() function requires three arguments:

  • The data column you want to summarize (response variable)
  • The data column you wish to group the data by (treatment or grouping variable)
  • The function you want to calculate (mean, standard deviation, maximum, etc.)

The example below will calculate mean yield for each irrigation block (Full irrigation or Limited irrigation).

corn.dat <- read.csv("http://rstats4ag.org/data/irrigcorn.csv")
tapply(corn.dat$Yield.BuA, corn.dat$Irrig, mean)
##     Full  Limited 
## 181.1667 163.2708

To specify multiple grouping variables, we use the list() function within the tapply() function:

tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), mean)
##           Full  Limited
## 23000 171.1667 147.3333
## 33000 191.1667 179.2083

2.2 Calculating other group statistics with tapply

We can use the same syntax to calculate nearly any statistic for the groups by replacing the mean argument with another function, such as max, min, sd etc.

tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), median)
##       Full Limited
## 23000  171   152.5
## 33000  191   182.5

You can also specify other functions or operations. For example, there is no default function in R to calculate the CV, or coefficient of variation. This statistic is commonly used in crop sciences. The CV is simply the standard deviation divided by the mean, and so we can calculate the CV using by storing the output from tapply() into an object that we can call upon later.

popirr.sd   <- tapply(corn.dat$Yield.BuA, 
                      list(corn.dat$Population.A, corn.dat$Irrig), sd)
popirr.mean <- tapply(corn.dat$Yield.BuA, 
                      list(corn.dat$Population.A, corn.dat$Irrig), mean)
popirr.cv<-popirr.sd/popirr.mean
round(popirr.cv, 3)
##        Full Limited
## 23000 0.075   0.102
## 33000 0.052   0.098

The round() function simply rounds the result to 3 decimal places to make the output more readable. Alternatively, if the CV is a statistic we will use regularly, it may be worthwhile to write our own function to save keystrokes later on. We can do this using function().

cv <- function(x) {
  xm  <- mean(x)
  xsd <- sd(x)
  xcv <- xsd/xm
  round(xcv, 4)
}

Or, more efficiently:

cv <- function(x) {
  xcv <- round(sd(x) / mean(x), 4)
}
tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), cv)
##         Full Limited
## 23000 0.0750  0.1021
## 33000 0.0523  0.0976

This approach is desirable if the CV will be used repeatedly. Once the function cv() has been defined, you can call upon it repeatedly with very few keystrokes.

2.3 Using dplyr to summarize data

In addition to using tapply() from base stats R package, there is an excellent package for summarizing data called dplyr. We can get similar information from the corn data set using the dplyr package using the following syntax (explained in more detail below).

library(dplyr)
corn.summary <- corn.dat %>%
  group_by(Population.A, Irrig) %>%
  summarize(N        = length(Yield.BuA),
            AvgYield = round(mean(Yield.BuA),1),
            CV       = cv(Yield.BuA))
corn.summary            
## # A tibble: 4 x 5
## # Groups:   Population.A [2]
##   Population.A Irrig       N AvgYield     CV
##          <int> <fct>   <int>    <dbl>  <dbl>
## 1        23000 Full       24     171. 0.075 
## 2        23000 Limited    24     147. 0.102 
## 3        33000 Full       24     191. 0.0523
## 4        33000 Limited    24     179. 0.0976

We will explain and demonstrate some of the basic uses of dplyr using herbicide resistant weed data from Canada. The following data set was assembled using data from weedscience.org.

resistance<-read.csv("http://rstats4ag.org/data/CanadaResistance2.csv")
resistance<-na.omit(resistance)
head(resistance)
##   Province ID          SciName Year                                     MOA
## 1  Alberta  1  Stellaria media 1988                    ALS inhibitors (B/2)
## 2  Alberta  2  Kochia scoparia 1989                    ALS inhibitors (B/2)
## 3  Alberta  3      Avena fatua 1989 Multiple Resistance: 2 Sites of Action 
## 6  Alberta  4  Setaria viridis 1989           Microtubule inhibitors (K1/3)
## 7  Alberta  5      Avena fatua 1991                 ACCase inhibitors (A/1)
## 8  Alberta  6 Sinapis arvensis 1993                    ALS inhibitors (B/2)

In the data file, there are instances of multiple resistance (resistance to more than one mode of action). The second line of the code above (na.omit()) removes all lines in the data set that contain NA (which is the default value in R for missing data). This is necessary for this particular example so that instances of multiple resistance are not counted multiple times. We may want to use the information from those rows later on, however, and therefore this method is preferable to deleting the rows from the raw data file.

One of the first things we might like to do is simply count the number of herbicide resistant weed cases by mode of action. We could do this rather quickly using the tapply() function discussed in the previous section: tapply(resistance$ID, resistance$MOA, length). In the code below, we will use the dplyr package to get some interesting information from this data set. First, we can simply count the number of herbicide resistant weeds for each mode of action in the data set. Initially it seems like much more typing compared to using tapply(), but later on it will become clear why this is a potentially more powerful method.

library(dplyr)
bymoa<-group_by(resistance, MOA)
moatotal<-summarize(bymoa, total=length(ID))
print(moatotal)

In the code above, we are loading the dplyr package, then using two functions from that package. The group_by() function will provide the information on how the data are grouped. In this case, we want to group the data by MOA. We can then use the the summarize() function to calculate means, counts, sums, or any other operation on the data. Above, we used the length() function to count the unique instances of the ID variable for each mode of action. Storing each step with the <- operator can be a little cumbersome, especially if we will not need to use the intermediary steps again. The dplyr package uses the %>% operator to string together several steps. The code below will result in the same summary as above, with fewer keystrokes, and without storing the intermediary grouping as a separate object. The first line provides the name of the data set (resistance) followed by the %>% operator. This tells R to expect a function on the next line from the dplyr package. The second line groups the resistance data set by MOA, and the third line summarizes the data by calculating the length of the ID variable for each level of the grouping factor. Effectively, this counts each instance of resistance. The last line prints out the resulting data frame.

moatotal <- resistance %>%
  group_by(MOA) %>%
  summarize(total=length(ID))
print(moatotal)
## # A tibble: 13 x 2
##    MOA                                       total
##    <fct>                                     <int>
##  1 ACCase inhibitors (A/1)                       9
##  2 ALS inhibitors (B/2)                         42
##  3 EPSP synthase inhibitors (G/9)                2
##  4 Lipid Inhibitors (thiocarbamates) (N/8)       2
##  5 Microtubule inhibitors (K1/3)                 3
##  6 "Multiple Resistance: 2 Sites of Action "    12
##  7 "Multiple Resistance: 3 Sites of Action "     3
##  8 "Multiple Resistance: 4 Sites of Action "     1
##  9 Photosystem II inhibitors (C1/5)             12
## 10 PSI Electron Diverter (D/22)                  3
## 11 PSII inhibitors (Nitriles) (C3/6)             2
## 12 PSII inhibitor (Ureas and amides) (C2/7)      3
## 13 Synthetic Auxins (O/4)                        3

The real power and elegance of the dplyr package becomes apparent once we begin using multiple groupings, and creating new variables. Below, we generate totals for each mode of action again, but also by year of listing by using the group_by() and sumarize() functions.

totals<-resistance %>%
  group_by(Year, MOA) %>%
  summarize(total = length(ID)) 
head(totals)
## # A tibble: 6 x 3
## # Groups:   Year [6]
##    Year MOA                              total
##   <int> <fct>                            <int>
## 1  1957 Synthetic Auxins (O/4)               1
## 2  1973 Photosystem II inhibitors (C1/5)     1
## 3  1976 Photosystem II inhibitors (C1/5)     2
## 4  1977 Photosystem II inhibitors (C1/5)     3
## 5  1980 Photosystem II inhibitors (C1/5)     1
## 6  1981 Photosystem II inhibitors (C1/5)     3

We can then create a new variable using the mutate() function; we will add a cumulative total of herbicide resistance cases over time, while retaining information about mode of action by first ungrouping the data, then using mutate to calculate the cumulative total using the cumsum() from the default stats package.

totals <- resistance %>%
  group_by(Year, MOA) %>%
  summarize(total = length(ID)) %>%
  ungroup() %>%
  mutate(cumulative = cumsum(total))

head(totals)
## # A tibble: 6 x 4
##    Year MOA                              total cumulative
##   <int> <fct>                            <int>      <int>
## 1  1957 Synthetic Auxins (O/4)               1          1
## 2  1973 Photosystem II inhibitors (C1/5)     1          2
## 3  1976 Photosystem II inhibitors (C1/5)     2          4
## 4  1977 Photosystem II inhibitors (C1/5)     3          7
## 5  1980 Photosystem II inhibitors (C1/5)     1          8
## 6  1981 Photosystem II inhibitors (C1/5)     3         11
tail(totals)
## # A tibble: 6 x 4
##    Year MOA                                       total cumulative
##   <int> <fct>                                     <int>      <int>
## 1  2010 EPSP synthase inhibitors (G/9)                1         88
## 2  2011 ACCase inhibitors (A/1)                       1         89
## 3  2011 ALS inhibitors (B/2)                          2         91
## 4  2011 "Multiple Resistance: 2 Sites of Action "     2         93
## 5  2012 ALS inhibitors (B/2)                          1         94
## 6  2012 "Multiple Resistance: 2 Sites of Action "     3         97

Based on the final row of data, there are 97 unique cases of herbicide resistance in this data set. And now, we can use the new data frame generated by dplyr to make a plot of cumulative herbicide resistance cases over time, coded by herbicide mode of action. The specifics of plotting are discussed in the Data Visualization section.

par(bg=gray(0.95), mar=c(4.5,4.5,0.8,0.8), mgp=c(2,.7,0))
plot(cumulative~Year, col=as.numeric(MOA), pch=as.numeric(MOA), 
     data=totals, ylab="Herbicide Resistant Weed Cases in Canada")
legend("topleft", legend=unique(totals$MOA),
       col=as.numeric(unique(totals$MOA)), 
       pch=as.numeric(unique(totals$MOA)))
Cumulative number of herbicide resistant weed cases in Canada over time.

Figure 2.1: Cumulative number of herbicide resistant weed cases in Canada over time.

We can also use dplyr to look at other grouping, including by province or weed species, and then use the subset() function to print out only one group. In the below example, we will calculate totals for each province by weed species, but print out only the cases of herbicide resistant weeds from Saskatchewan.

provspec <- resistance %>%
  group_by(Province, SciName) %>%
  summarize(total=length(ID))
subset(provspec, Province=="Saskatchewan")
## # A tibble: 12 x 3
## # Groups:   Province [1]
##    Province     SciName                 total
##    <fct>        <fct>                   <int>
##  1 Saskatchewan Amaranthus retroflexus      1
##  2 Saskatchewan Avena fatua                 4
##  3 Saskatchewan Capsella bursa-pastoris     1
##  4 Saskatchewan Chenopodium album           1
##  5 Saskatchewan Galium spurium              1
##  6 Saskatchewan Kochia scoparia             2
##  7 Saskatchewan Lolium persicum             1
##  8 Saskatchewan Salsola tragus              1
##  9 Saskatchewan Setaria viridis             3
## 10 Saskatchewan Sinapis arvensis            1
## 11 Saskatchewan Stellaria media             1
## 12 Saskatchewan Thlaspi arvense             1