## One-Way Analysis of Variance

Some of the most often used experimental designs in weed science include completely randomized designs (CRD) and randomized complete block designs (RCBD). These experiments are typically analyzed using analysis of variance (ANOVA). The one-way ANOVA is used to determine the effect of a single factor (with at least three levels) on a response variable. Where only two levels of a single factor are of interest, the `t.test()`

function will be more appropriate. There are several ways to conduct an ANOVA in the base R package. The `aov()`

function requires a response variable and the explanatory variable separated with the `~`

symbol. It is important when using the `aov()`

function that your data are balanced, with no missing values. For data sets with missing values or unbalanced designs, please see the section on mixed effects models.

The “FlumiBeans” data set is from a herbicide trial conducted in Wyoming during three different years (2009 through 2011). Five different herbicide treatments were applied, plus nontreated and handweeded controls for a total of seven treatments. The study was a randomized complete block design (RCBD) each year. There were 3 replicates in 2009, but 4 replicates in subsequent years. The response variable in this data set is dry bean (*Phaseolus vulgaris*) density 4 weeks after planting expressend in plants per acre.

```
read.csv("http://rstats4ag.org/data/FlumiBeans.csv")->bean.dat
bean.dat$population.4wk <- bean.dat$population.4wk * 2.47 # Convert to SI units (plants/ha)
summary(bean.dat)
```

```
## year treatment block
## Min. :2009 EPTC + ethafluralin :11 Min. :1.00
## 1st Qu.:2009 flumioxazin + ethafluralin :11 1st Qu.:1.00
## Median :2010 flumioxazin + pendimethalin:11 Median :2.00
## Mean :2010 flumioxazin + trifluralin :11 Mean :2.36
## 3rd Qu.:2011 Handweeded :11 3rd Qu.:3.00
## Max. :2011 imazamox + bentazon :11 Max. :4.00
## Nontreated :11
## population.4wk
## Min. : 17216
## 1st Qu.: 51648
## Median : 94683
## Mean : 87836
## 3rd Qu.:114052
## Max. :163541
##
```

It is important to note that some of our factor variables have been expressed in the .csv file as numbers (e.g. 'year' and 'block'), and consequently, R has recognized these factors as numeric variables. It is important for our ANOVA that these variables be included as factors, and not as numeric variables. We can tell R to view these variables as factors by re-naming them using the `as.factor()`

function.

```
bean.dat$year<-as.factor(bean.dat$year)
bean.dat$block<-as.factor(bean.dat$block)
summary(bean.dat)
```

```
## year treatment block population.4wk
## 2009:21 EPTC + ethafluralin :11 1:21 Min. : 17216
## 2010:28 flumioxazin + ethafluralin :11 2:21 1st Qu.: 51648
## 2011:28 flumioxazin + pendimethalin:11 3:21 Median : 94683
## flumioxazin + trifluralin :11 4:14 Mean : 87836
## Handweeded :11 3rd Qu.:114052
## imazamox + bentazon :11 Max. :163541
## Nontreated :11
```

Notice the difference for the 'year' and 'block' variables in the `summary()`

results before and after renaming. For this analysis, we want to look at a One-Way analysis of variance, and so we will focus on the data from only one year. We can subset the data by year, and call this subsetted data `bean.09`

.

```
bean.09<-subset(bean.dat, year==2009)
```

We will then use the `aov()`

function to generate the One-way ANOVA to see if herbicide treatment had an effect on dry bean population (expressed in plants per hectare) in 2009.

```
aov(population.4wk ~ block+treatment, data=bean.09)->b09.aov
```

### Checking ANOVA assumptions

ANOVA assumes (1) errors are normally distributed, (2) variances are homogeneous, and (3) observations are independent of each other. ANOVA is fairly robust to violations of the normality assumption, and thus it typically requires a rather severe violation before remedial measures are warranted. However, it is still wise to check the normality assumption prior to interpreting the ANOVA results. Three methods are provided to check normality of the response variable of interest: the Shapiro-Wilks test, a stem and leaf plot, and a Normal Q-Q plot. The model residuals are automatically stored within the ANOVA model produced with the `aov()`

function, and thus we can call upon the model to view and test the residuals for normality. We can get the residuals by using the `resid()`

function, or by appending `$resid`

onto the model fit. `resid(b09.aov)`

and `b09.aov$resid`

will provide the same result.

```
shapiro.test(b09.aov$resid)
```

```
##
## Shapiro-Wilk normality test
##
## data: b09.aov$resid
## W = 0.9718, p-value = 0.7726
```

```
stem(b09.aov$resid)
```

```
##
## The decimal point is 4 digit(s) to the right of the |
##
## -2 | 690
## -0 | 88607431
## 0 | 222599
## 2 | 2787
```

```
qqnorm(b09.aov$resid)
qqline(b09.aov$resid)
```

It appears that the residuals in this case are at least close to normally distributed. The most common next step if residuals are not normally distributed would be to attempt a transformation. However, it is worth noting that there are other, more modern methods to deal with non-normality that should be strongly considered.

The next assumption to test is homogeneity of variance. Formal tests for homogeneity of variance include the Bartlett (parametric) and Fligner-Killeen (non-parametric) tests. These tests can be used in conjunction with box plots to determine the level of heterogeneity of variance between treatments.

```
bartlett.test(bean.09$population.4wk, bean.09$treatment)
```

```
##
## Bartlett test of homogeneity of variances
##
## data: bean.09$population.4wk and bean.09$treatment
## Bartlett's K-squared = 3.787, df = 6, p-value = 0.7055
```

```
fligner.test(bean.09$population.4wk, bean.09$treatment)
```

```
##
## Fligner-Killeen test of homogeneity of variances
##
## data: bean.09$population.4wk and bean.09$treatment
## Fligner-Killeen:med chi-squared = 3.249, df = 6, p-value = 0.777
```

Formal tests do not indicate a problem (p>0.7), and box plots indicate that things are reasonable similar. Perhaps the handweeded and EPTC + ethalfluralin treatments have less variability than the others, but differences are not too dramatic, especially considering there were only 3 replicates per treatment in 2009.

```
par(mar=c(4.2,12,0.5,0.5))
boxplot(bean.09$population.4wk ~ bean.09$treatment, horizontal=T,
las=1, xlab="Bean plants per hectare")
```

Simply plotting the aov fit in R [e.g. `plot(b09.aov)`

] will provide a series of diagnostic plots as well. The first two plots are the residuals vs fitted values, and the Normal Q-Q plot, which can be plotted at the same time with the following code:

```
par(mar=c(4.2,4.2,2.2,0.5), mfrow=c(1,2))
plot(b09.aov, which=1:2)
```

Since neither the normality nor homogeneity of variance assumptions were dramatically violated, and ANOVA is fairly robust to violations of these assumptions, it seems safe to proceed with interpretation of the ANOVA results. The third assumption (independence of observations) is not something we will test statistically; it is up to the researcher to ensure that they are collecting data in a way that does not violate this assumption.

Finally, after the assumptions have been checked, and remedial measures taken if necessary, we will want to know whether there is a significant effect of herbicide treatment on dry bean population density. Once again, we can use the `summary()`

function along with the stored model to obtain this information. The resulting ANOVA table indicates that we have pretty strong evidence that there are differences in dry bean plants per acre due to herbicide treatment (p = 0.019).

```
summary(b09.aov)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## block 2 5.57e+08 2.79e+08 0.43 0.659
## treatment 6 1.57e+10 2.62e+09 4.05 0.019 *
## Residuals 12 7.74e+09 6.45e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

### Mean Separation - agricolae package

The `agricolae`

package has several functions for mean separation, including the `LSD.test()`

function that will compare means using Fisher's LSD. The LSD.test computes a lot of useful information that can be extracted for later use. You can use the `str(bean.lsd)`

function to learn more about what is available. Below, we will look at the `$group`

section which gives the treatment means and mean separation, and the `$statistics`

section which provides the LSD value (among other things).

```
library(agricolae)
bean.lsd <- LSD.test(b09.aov, trt="treatment")
bean.lsd$group
```

```
## trt means M
## 1 imazamox + bentazon 123373 a
## 2 EPTC + ethafluralin 120504 a
## 3 Nontreated 114766 a
## 4 flumioxazin + ethafluralin 94683 ab
## 5 Handweeded 86075 abc
## 6 flumioxazin + trifluralin 60252 bc
## 7 flumioxazin + pendimethalin 48775 c
```

```
bean.lsd$statistics
```

```
## Mean CV MSerror LSD
## 92632 27.42 6.45e+08 45182
```

To get standard deviations and confidence limits, we could type `bean.lsd$means`

. The `agricolae`

package can also calculate Tukey's HSD using the `HSD.test()`

function. The syntax above is the same, simply changing `LSD.test`

to `HSD.test`

.

### Mean Separation - multcomp package

The `multcomp`

package is also useful for separating means, and has the advantage that it can be used for linear, nonlinear, and mixed effects models.

```
library(multcomp)
b09.tukey <- summary(glht(b09.aov, linfct=mcp(treatment = "Tukey")))
b09.tukey # prints out all pairwise comparisons
cld(b09.tukey) # prints out the compact letter display (cld)
```

## Multi-Factor ANOVA

We'll now look at the second and third year of the study, beginning by subsetting the data to include only these years.

```
bean2.dat<-subset(bean.dat, year==2010 | year==2011)
summary(bean2.dat)
```

```
## year treatment block population.4wk
## 2009: 0 EPTC + ethafluralin :8 1:14 Min. : 17216
## 2010:28 flumioxazin + ethafluralin :8 2:14 1st Qu.: 51645
## 2011:28 flumioxazin + pendimethalin:8 3:14 Median : 95761
## flumioxazin + trifluralin :8 4:14 Mean : 86038
## Handweeded :8 3rd Qu.:115665
## imazamox + bentazon :8 Max. :163541
## Nontreated :8
```

We'll now incorporate year as a fixed effect in the analysis. We're going to exclude the 2009 data because it does not have the same number of replicates as the other two years of data. The `aov()`

function is not appropriate for unbalanced data. We will look at how to best handle this in a later section using mixed effects. We're also going to incorporate the blocking design into the model using the `Error()`

argument. Blocking was used in each year of the field study (in a randomized complete block design). Because the trial was located in a different field each year, there is no reason to think that block 1 would result in a similar impact from one year to the next. Therefore, the effect of block is nested within year. We can specify this structure in the `aov()`

function by using the `/`

operator.

```
bean2.aov<-aov(population.4wk ~ year*treatment + Error(year/block), data=bean2.dat)
summary(bean2.aov)
```

```
##
## Error: year
## Df Sum Sq Mean Sq
## year 1 5.46e+09 5.46e+09
##
## Error: year:block
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 6 1.36e+09 2.27e+08
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 6 6.59e+10 1.10e+10 51.92 2.9e-16 ***
## year:treatment 6 5.42e+09 9.04e+08 4.27 0.0024 **
## Residuals 36 7.61e+09 2.11e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Adding the Error term into the model results in 3 separate sections of the ANOVA table, corresponding to the three separate error terms: year, year:block, and within. In this case no F or P-values are calculated for the year effect. We could do that manually by dividing the Mean Square for year by the Mean Square for year:block, but it is clear just by looking at the numbers that the F value will be quite high (greater than 23, in fact). However, if we want `aov()`

to calculate the P-value for us, we can specify the year*block error term by creating a new variable that contains the combined information. We will do this below using the `with()`

function.

```
bean2.dat$yrblock <- with(bean2.dat, factor(year):factor(block))
bean2.aov2<-aov(population.4wk ~ year*treatment + Error(yrblock), data=bean2.dat)
summary(bean2.aov2)
```

```
##
## Error: yrblock
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 5.46e+09 5.46e+09 24 0.0027 **
## Residuals 6 1.36e+09 2.27e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 6 6.59e+10 1.10e+10 51.92 2.9e-16 ***
## year:treatment 6 5.42e+09 9.04e+08 4.27 0.0024 **
## Residuals 36 7.61e+09 2.11e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The Mean Square for year and the year:block interaction are the same as before, as expected, but now the F and P-values are provided. However, this is not as important in this case, as there is a significant year by treatment interaction effect, making the main effect of year less interesting.

### Mean Separation

Due to the different error terms, getting mean separation from the `LSD.test()`

function in this case is not as straightforward as with only one factor. It can still be done by specifying the response, treatment, error degrees of freedom, and mean square error manually.

```
bean2.lsd<-LSD.test(bean2.dat$population.4wk,
with(bean2.dat, year:treatment),
36, 2.11e+08)
bean2.lsd$groups
```

```
## trt means M
## 1 2010:imazamox + bentazon 142023 a
## 2 2010:EPTC + ethafluralin 139871 a
## 3 2010:Handweeded 124808 ab
## 4 2010:Nontreated 124808 ab
## 5 2011:Handweeded 104369 bc
## 6 2011:Nontreated 97375 c
## 7 2011:imazamox + bentazon 95761 c
## 8 2011:EPTC + ethafluralin 94148 c
## 9 2011:flumioxazin + trifluralin 57027 d
## 10 2010:flumioxazin + trifluralin 51645 de
## 11 2010:flumioxazin + pendimethalin 49493 de
## 12 2011:flumioxazin + pendimethalin 48957 de
## 13 2010:flumioxazin + ethafluralin 38734 de
## 14 2011:flumioxazin + ethafluralin 35507 e
```

It seems there is general trend for treatments containing flumioxazin to have fewer bean plants. But having the different years mixed together makes it difficult to see, so an alternative approach would be to simply calculate the means using `tapply()`

and print out the LSD value.

```
round(tapply(bean2.dat$population.4wk,
list(bean2.dat$treatment, bean2.dat$year), mean), -2)
```

```
## 2009 2010 2011
## EPTC + ethafluralin NA 139900 94100
## flumioxazin + ethafluralin NA 38700 35500
## flumioxazin + pendimethalin NA 49500 49000
## flumioxazin + trifluralin NA 51600 57000
## Handweeded NA 124800 104400
## imazamox + bentazon NA 142000 95800
## Nontreated NA 124800 97400
```

```
bean2.lsd$statistics
```

```
## Mean CV MSerror LSD
## 86038 16.88 2.11e+08 20831
```

Looking at the data in this way makes it appear that the year interaction was due to a higher bean population in 2010 compared to 2011 for most treatments. However, the flumioxazin treatments had similar bean populations in both years, much reduced compared to the nontreated or handweeded control treatments. The LSD value of 20800 allows us to compare the same treatment between years, or compare between treatments within a year.