Linear Regression

Simple Linear Regression

Linear regression in R is very similar to analysis of variance. In fact, the mathematics behind simple linear regression and a one-way analysis of variance are basically the same. The main difference is that we use ANOVA when our treatments are unstructured (say, comparing 5 different pesticides or fertilizers), and we use regression when we want to evaluate structured treatments (say, evaluating 5 different rates of the same pesticide or fertilizer). The main difference is the use of the lm() function for linear regression instead of the aov() function for ANOVA. However, it should be noted that the lm() function can also be used to conduct ANOVA. The primary advantage of aov() for ANOVA is the ability to incorporate multi-strata error terms (like for split-plot designs) easily.

To illustrate linear regression in R, we will use data from (Kniss et al. 2012), quantifying the sugar yield of sugarbeet in response to volunteer corn density. The response variable is sucrose production in pounds/acre (LbsSucA), and the independent variable is volunteer corn density in plants per foot of row. The lm() function syntax (for the purpose of regression) asks for a formula relating the response variable (Y) to independent variable(s), e.g. lm(Y ~ X).

beets.dat <- read.csv("")
##    Loc Density LbsSucA
## 1 WY09    0.00   10106
## 2 WY09    0.03    8639
## 3 WY09    0.05    7752
## 4 WY09    0.08    5718
## 5 WY09    0.11    7954
## 6 WY09    0.15    6012
density.lm <- lm(LbsSucA ~ Density, data = beets.dat)

Model Diagnostics

The assumptions can be checked in the same manner as with the fitted aov object (see ANOVA section for full details). Quick diagnostic plots can be obtained by plotting the fitted lm object.

plot(density.lm, which=1:2)

plot of chunk unnamed-chunk-2

We can also conduct a lack-of-fit test to be sure a linear model is appropriate. The simplest way to do this is to fit a linear model that removes the linear structure of the independent variable (treating it as a factor variable). The models can then be compared using the anova() function.

density.lm2 <- lm(LbsSucA ~ as.factor(Density), data = beets.dat)
anova(density.lm, density.lm2)
## Analysis of Variance Table
## Model 1: LbsSucA ~ Density
## Model 2: LbsSucA ~ as.factor(Density)
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)
## 1     22 42066424                         
## 2     18 40058981  4   2007444 0.23   0.92

When 2 linear models are provided to the anova() function, the models will be compared using an F-test. This is only appropriate when the models are nested; that is, one model is a reduced form of the other. The highly non-significant p-value above (p = 0.92) indicates that removing the structured nature of the density variable does not increase the model fit. Therefore, we will proceed with the regression analysis.

Model Interpretation

Using the anova() function on a a single fitted lm object produces output similar to the aov() function. The ANOVA table indicates strong evidence that volunteer corn density has an effect on the sugar production, as expected. To get the regression equation, we use the summary() function on the fitted lm model:

## Analysis of Variance Table
## Response: LbsSucA
##           Df   Sum Sq  Mean Sq F value Pr(>F)   
## Density    1 25572276 25572276    13.4 0.0014 **
## Residuals 22 42066424  1912110                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## lm(formula = LbsSucA ~ Density, data = beets.dat)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2086.0 -1084.0    23.9   726.2  3007.7 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8678        486   17.87  1.4e-14 ***
## Density       -20645       5645   -3.66   0.0014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1380 on 22 degrees of freedom
## Multiple R-squared:  0.378,  Adjusted R-squared:  0.35 
## F-statistic: 13.4 on 1 and 22 DF,  p-value: 0.00139

The resulting output provides estimates for the intercept (8678), and slope (-20645) of the regression line under the heading “Coefficients.” The summary() function also prints standard errors, t-statistics, and P-values, and fit statistics such as the R2^ for these estimates are also provided. The regression line for this example would be:

Y = 8678 - 20645*X

To plot the data along with the fitted line:

plot(beets.dat$LbsSucA ~ beets.dat$Density, 
     ylab="Sugar yield (lbs/A)", xlab="Volunteer corn density (plants/ft row)",
     main="Lingle, WY 2009", ylim=c(0,10000))
  abline(density.lm) # Add the regression line
# to add the regression equation into the plot:
  int<-round(summary(density.lm)$coefficients[[1]]) # get intercept
  sl<- round(summary(density.lm)$coefficients[[2]]) # get slope
  reg.eq<-paste("Y =", int, sl, "* X") # create text regression equation
  legend("bottomleft",reg.eq, bty="n")

plot of chunk unnamed-chunk-5

The sugar yield was decreased by 20,600 lbs/A for each volunteer corn plant per foot of sugarbeet row; this seems unintuitive since the Y-intercept was only 8,678 lbs/A. The maximum volunteer corn density used in the study was only 0.15 plants per foot of sugarbeet row, so the units of the slope are not directly applicable to the data observed in the study. Although the linear regression model is appropriate here, it is extremely important not to extrapolate the linear model results to volunteer corn densities outside the observed data range (densities greater than 0.15 plants per foot of row). This can be illustrated by changing the Y and X limits on the figure. If extrapolation is absolutely necessary, a nonlinear model would be more appropriate (although still not ideal). The ideal situation would be to repeat the study with weed densities that caused crop yield to drop close to the minimum yield.

plot of chunk unnamed-chunk-6

Leave a Reply

Your email address will not be published. Required fields are marked *