## Calculating group means

One of the most basic exploratory tasks with any data set involves computing the mean, variance, and other descriptive statistics. A previous section has already demonstrated how to obtain many of these statistics from a data set, using the `summary()`

, `mean()`

, and `sd()`

functions. However these functions were used in the context of an entire data set or column from a data set; in most cases it will be more informative to calculate these statistics for groups of data, such as experimental treatments. The `tapply()`

function can be used for this purpose. Summary statistics for the corn irrigation data set were calculated previously using the `summary()`

function. The `tapply()`

function allows us to calculate similar descriptive statistics for groups of data, most commonly for treatments, but also for other logical groups, such as by experimental sites, or years. The `tapply()`

function requires three arguments:

- The data column you want to summarize (response variable)
- The data column you wish to group the data by (treatment or grouping variable)
- The function you want to calculate (mean, standard deviation, maximum, etc.)

The example below will calculate mean yield for each irrigation block (Full irrigation or Limited irrigation).

```
corn.dat <- read.csv("http://rstats4ag.org/data/irrigcorn.csv")
tapply(corn.dat$Yield.BuA, corn.dat$Irrig, mean)
```

```
## Full Limited
## 181.2 163.3
```

To specify multiple grouping variables, we use the `list()`

function within the `tapply()`

function:

```
tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), mean)
```

```
## Full Limited
## 23000 171.2 147.3
## 33000 191.2 179.2
```

## Calculating other group statistics with tapply

We can use the same syntax to calculate nearly any statistic for the groups by replacing the `mean`

argument with another function, such as `max`

, `min`

, `sd`

etc.

```
tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), median)
```

```
## Full Limited
## 23000 171 152.5
## 33000 191 182.5
```

You can also specify other functions or operations. For example, there is no default function in `R`

to calculate the CV, or coefficient of variation. This statistic is commonly used in crop sciences. The CV is simply the standard deviation divided by the mean, and so we can calculate the CV using by storing the output from `tapply()`

into an object that we can call upon later.

```
popirr.sd <- tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), sd)
popirr.mean <- tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), mean)
popirr.cv<-popirr.sd/popirr.mean
round(popirr.cv, 3)
```

```
## Full Limited
## 23000 0.075 0.102
## 33000 0.052 0.098
```

The `round()`

function simply rounds the result to 3 decimal places to make the output more readable. Alternatively, if the CV is a statistic we will use regularly, it may be worthwhile to write our own function to save keystrokes later on. We can do this using `function()`

.

```
cv <- function(x) {
xm <- mean(x)
xsd <- sd(x)
xcv <- xsd/xm
round(xcv, 4)
}
```

Or, more efficiently:

```
cv <- function(x) {
xcv <- round(sd(x) / mean(x), 4)
}
tapply(corn.dat$Yield.BuA, list(corn.dat$Population.A, corn.dat$Irrig), cv)
```

```
## Full Limited
## 23000 0.0750 0.1021
## 33000 0.0523 0.0976
```

This approach is desirable if the CV will be used repeatedly. Once the function `cv()`

has been defined, you can call upon it repeatedly with very few keystrokes.

## Using dplyr to summarize data

In addition to using `tapply()`

from base stats R package, there is an excellent package for summarizing data called `dplyr`

. We can get similar information from the corn data set using the `dplyr`

package using the following syntax (explained in more detail below).

```
library(dplyr)
corn.summary <- corn.dat %.%
group_by(Population.A, Irrig) %.%
summarize(N = length(Yield.BuA),
AvgYield = round(mean(Yield.BuA),1),
CV = cv(Yield.BuA))
corn.summary
```

```
## Source: local data frame [4 x 5]
## Groups: Population.A
##
## Population.A Irrig N AvgYield CV
## 1 23000 Full 24 171.2 0.0750
## 2 23000 Limited 24 147.3 0.1021
## 3 33000 Full 24 191.2 0.0523
## 4 33000 Limited 24 179.2 0.0976
```

We will explain and demonstrate some of the basic uses of `dplyr`

using herbicide resistant weed data from Canada. The following data set was assembled using data from weedscience.org.

```
resistance<-read.csv("http://rstats4ag.org/data/CanadaResistance2.csv")
resistance<-na.omit(resistance)
head(resistance)
```

```
## Province ID SciName Year MOA
## 1 Alberta 1 Stellaria media 1988 ALS inhibitors (B/2)
## 2 Alberta 2 Kochia scoparia 1989 ALS inhibitors (B/2)
## 3 Alberta 3 Avena fatua 1989 Multiple Resistance: 2 Sites of Action
## 6 Alberta 4 Setaria viridis 1989 Microtubule inhibitors (K1/3)
## 7 Alberta 5 Avena fatua 1991 ACCase inhibitors (A/1)
## 8 Alberta 6 Sinapis arvensis 1993 ALS inhibitors (B/2)
```

In the data file, there are instances of multiple resistance (resistance to more than one mode of action).

The second line of the code above (`na.omit()`

) removes all lines in the data set that contain `NA`

(which is the default value in R for missing data). This is necessary for this particular example so that instances of multiple resistance are not counted multiple times. We may want to use the information from those rows later on, however, and therefore this method is preferable to deleting the rows from the raw data file.

One of the first things we might like to do is simply count the number of herbicide resistant weed cases by mode of action. We could do this rather quickly using the `tapply()`

function discussed in the previous section: `tapply(resistance$ID, resistance$MOA, length)`

. In the code below, we will use the `dplyr`

package to get some interesting information from this data set. First, we can simply count the number of herbicide resistant weeds for each mode of action in the data set. Initially it seems like much more typing compared to using `tapply()`

, but later on it will become clear why this is a potentially more powerful method.

```
library(dplyr)
bymoa<-group_by(resistance, MOA)
moatotal<-summarize(bymoa, total=length(ID))
print(moatotal)
```

In the code above, we are loading the dplyr package, then using two functions from that package. The `group_by()`

function will provide the information on how the data are grouped. In this case, we want to group the data by MOA. We can then use the the `summarize()`

function to calculate means, counts, sums, or any other operation on the data. Above, we used the `length()`

function to count the unique instances of the ID variable for each mode of action. Storing each step with the `<-`

operator can be a little cumbersome, especially if we will not need to use the intermediary steps again. The `dplyr`

package uses the `%.%`

operator to string together several steps. The code below will result in the same summary as above, with fewer keystrokes, and without storing the intermediary grouping as a separate object. The first line provides the name of the data set (`resistance`

) followed by the `%.%`

operator. This tells `R`

to expect a function on the next line from the dplyr package. The second line groups the resistance data set by MOA, and the third line summarizes the data by calculating the length of the ID variable for each level of the grouping factor. Effectively, this counts each instance of resistance. The last line prints out the resulting data frame.

```
moatotal <- resistance %.%
group_by(MOA) %.%
summarize(total=length(ID))
print(moatotal)
```

```
## Source: local data frame [13 x 2]
##
## MOA total
## 1 ACCase inhibitors (A/1) 9
## 2 ALS inhibitors (B/2) 42
## 3 EPSP synthase inhibitors (G/9) 2
## 4 Lipid Inhibitors (thiocarbamates) (N/8) 2
## 5 Microtubule inhibitors (K1/3) 3
## 6 Multiple Resistance: 2 Sites of Action 12
## 7 Multiple Resistance: 3 Sites of Action 3
## 8 Multiple Resistance: 4 Sites of Action 1
## 9 Photosystem II inhibitors (C1/5) 12
## 10 PSI Electron Diverter (D/22) 3
## 11 PSII inhibitors (Nitriles) (C3/6) 2
## 12 PSII inhibitor (Ureas and amides) (C2/7) 3
## 13 Synthetic Auxins (O/4) 3
```

The real power and elegance of the `dplyr`

package becomes apparent once we begin using multiple groupings, and creating new variables. Below, we generate totals for each mode of action again, but also by year of listing:

```
totals<-resistance %.%
group_by(Year, MOA) %.%
summarize(total=length(ID))
head(totals)
```

```
## Source: local data frame [6 x 3]
## Groups: Year
##
## Year MOA total
## 1 1957 Synthetic Auxins (O/4) 1
## 2 1973 Photosystem II inhibitors (C1/5) 1
## 3 1976 Photosystem II inhibitors (C1/5) 2
## 4 1977 Photosystem II inhibitors (C1/5) 3
## 5 1980 Photosystem II inhibitors (C1/5) 1
## 6 1981 Photosystem II inhibitors (C1/5) 3
```

We can then treat the resulting data.frame as any other, and create a new variable, which will keep a cumulative total of herbicide resistance cases over time, while retaining information about mode of action. A cumulative total can be calculated using the `cumsum()`

from the default stats package.

```
totals$cumulative<-cumsum(totals$total)
head(totals)
```

```
## Source: local data frame [6 x 4]
## Groups: Year
##
## Year MOA total cumulative
## 1 1957 Synthetic Auxins (O/4) 1 1
## 2 1973 Photosystem II inhibitors (C1/5) 1 2
## 3 1976 Photosystem II inhibitors (C1/5) 2 4
## 4 1977 Photosystem II inhibitors (C1/5) 3 7
## 5 1980 Photosystem II inhibitors (C1/5) 1 8
## 6 1981 Photosystem II inhibitors (C1/5) 3 11
```

```
tail(totals)
```

```
## Source: local data frame [6 x 4]
## Groups: Year
##
## Year MOA total cumulative
## 56 2010 EPSP synthase inhibitors (G/9) 1 88
## 57 2011 ACCase inhibitors (A/1) 1 89
## 58 2011 ALS inhibitors (B/2) 2 91
## 59 2011 Multiple Resistance: 2 Sites of Action 2 93
## 60 2012 ALS inhibitors (B/2) 1 94
## 61 2012 Multiple Resistance: 2 Sites of Action 3 97
```

Based on the final row of data, there are 97 unique cases of herbicide resistance in this data set. And now, we can use the new data frame generated by `dplyr`

to make a plot of cumulative herbicide resistance cases over time, coded by herbicide mode of action. The specifics of plotting are discussed in the Data Visualization section.

```
par(bg=gray(0.95), mar=c(4.5,4.5,0.8,0.8))
plot(cumulative~Year, col=as.numeric(MOA), pch=as.numeric(MOA),
data=totals, ylab="Herbicide Resistant Weed Cases in Canada")
legend("topleft", legend=unique(totals$MOA),
col=as.numeric(unique(totals$MOA)),
pch=as.numeric(unique(totals$MOA)))
```

We can also use `dplyr`

to look at other grouping, including by province or weed species, and then use the `subset()`

function to print out only one group. In the below example, we will calculate totals for each province by weed species, but print out only the cases of herbicide resistant weeds from Saskatchewan.

```
provspec <- resistance %.%
group_by(Province, SciName) %.%
summarize(total=length(ID))
subset(provspec, Province=="Saskatchewan")
```

```
## Source: local data frame [12 x 3]
## Groups:
##
## Province SciName total
## 51 Saskatchewan Amaranthus retroflexus 1
## 52 Saskatchewan Avena fatua 4
## 53 Saskatchewan Capsella bursa-pastoris 1
## 54 Saskatchewan Chenopodium album 1
## 55 Saskatchewan Galium spurium 1
## 56 Saskatchewan Kochia scoparia 2
## 57 Saskatchewan Lolium persicum 1
## 58 Saskatchewan Salsola tragus 1
## 59 Saskatchewan Setaria viridis 3
## 60 Saskatchewan Sinapis arvensis 1
## 61 Saskatchewan Stellaria media 1
## 62 Saskatchewan Thlaspi arvense 1
```