Introduction

What is this site?

“Statistics, like all other disciplines, is dynamic; it is not a fixed set of unchanging rules passed down from Fisher and company.” -W. Stroup

There are many excellent books and websites available to students and practitioners who would like to learn the R statistical language. Even so, there are few resources that specifically address the type of designed experiments common in the plant and weed science disciplines. Statistics, and the R language, evolves over time. It is not uncommon for a book about R or statistics to become out of date within only a few years. So we decided to develop this material for the web rather than a textbook so that it could be more easily kept up to date.

This website is not meant to be a complete reference for all the capabilities of the R language, nor should it be used as as a substitute for consultation with a well-trained statistician. This site will not cover many of the underlying statistical concepts for the examples provided, and as such, this is certainly not a standalone statistics resource. The purpose of this site is simply to provide information and examples on how to use the R language to analyze statistical designs that are commonly used in agricultural experiments.

A majority of agricultural research uses a small subset of experimental designs, and thus, there is a high probability that the examples presented here will provide a framework for analysis of many agricultural experiments. It is important, though, that the researcher understands their own data and the experimental designs that were employed in the research so that these examples are not used inappropriately. As of this writing, the examples presented here are heavily focused on agronomic and weed science experiments, as that is the primary expertise of the authors. We welcome additional contributions from related disciplines to broaden the scope and usefulness of this site.

To date, a majority of agricultural researchers have been trained to analyze data using SAS (and to a lesser extent, SPSS). SAS is an excellent tool for statistical analysis, and shares at least one characteristic with the R language: they can both be very difficult to learn. This is particularly true for researchers and graduate students without a programming background. Each software package has unique syntax and conventions; therefore, researchers who have invested a large amount of time (sometimes an entire career) learning SAS will often find themselves frustrated when trying to do even simple tasks using the R language. It is important to keep in mind that it may take a similar time investment to learn R as was required to learn SAS. Although the statistical concepts are the same, the language used to obtain the desired analysis and output can be dramatically different between the two programs. Many researchers who have made the switch from SAS to R will agree that when used properly, R is an extremely efficient and elegant tool for analyzing experimental data.

Apart from the differences in the structure of SAS and R languages, there is another important difference: SAS is a commercial product whereas R is an open source language. Although the underlying code for SAS is not in the public domain, history tells us that the SAS Institute does a good job and was in fact developed for analysis of agricultural experiments in the first place. Their contribution to agricultural statistics cannot be overstated.

R, on the other hand is an open source language where all codes are in the public domain and can be checked by anyone with the inclination to do so. There are many capable statisticians that develop the language and add-on packages. But anyone with the ability to code can contribute functions to R. To rephrase: R is written by statisticians and practitioners, and is meant to be used by statisticians and practitioners. An analogy to every day life could be that with SAS or other commercial programs you have a king choose the menu, and hope the chef is a good cook. With R you are given all the ingredients to make a good menu with the bits and pieces. In both cases, it is quite possible to have either a delightful meal, or an unpleasant evening.

While there are many texts already available for learning R, they are typically aimed broadly at statisticians, or targeted at a specific discipline ranging from ecology to the social sciences. This website is primarily focused on providing data and code examples for analyzing the most common experimental designs used by agronomists and weed scientists, and thus it will hopefully be useful to students and practitioners as they attempt to learn how to use a new statistical analysis environment. The philosophy of learning by example is that we do not go into much detail about the R functions, or the statistical theory behind each function. In fact we think that the way you learned your mother tongue was to listen to adults and repeat without prior knowledge of the grammar and syntax. The same applies to R. See, do, repeat, and gradually understand the grammar and syntax.

Last but not least, a great many documents and books on R are freely available on the R website and elsewhere on the web. Because R is open source the changes in the language make it difficult to find up-to date commercial books. Consequently, we will not recommend any, but suggest this listing on the R homepage as a starting point for more information.

Downloading and Installing R

Instructions for downloading and installing R vary depending on the operating system, and can be found at the homepage for the R project. There is a wealth of information on installing R on the web, so we suggest searching Google or YouTube if you are having trouble or would like a step-by-step guide.

R packages - CRAN

One of the most notable benefits of the R language is the fact that many add-on packages have been written by other statisticians. If there is a common need for some type of analysis or calculation of a common statistic, chances are good that someone has already written code to automate the process. In many cases, this code is contributed to the R user community as an add-on package that can be freely downloaded and used by others. There are currently over 5,000 contributed packages available for free download from the Comprehensive R Archive Network (CRAN). The procedure for installing contributed packages also differs between operating systems. Instructions for installing contributed packages in R can be found in the online R documentation.

The following packages are used for some of the examples on this site, and some will need to be installed to run all of the code provided. Some of the packages below are installed by default with the base R installation, but others will need to be installed afterwards.

  • dplyr
  • lattice
  • gplot
  • ggplot
  • agricolae
  • Hmisc
  • nlme
  • lme4
  • lsmeans
  • multcomp
  • drc

RStudio

We suggest installing RStudio as a useful and consistent interface for R. The default appearance of R differs greatly between Windows, Mac, and Linux operating systems. RStudio is available for all 3 platforms and provides several useful features in addition to a consistent interface.

Conventions

Several typographical conventions will be used throughout this text. References to R packages (such as the drc package) will be highlighted. Functions will be highlighted similarly, but followed by open parentheses, for example, aov(). Example code that is meant to be typed or copied directly into the R console will be enclosed in a shaded box in a monospace font. Where appropriate, code will be followed by the resulting output, preceded by “##”.

Input.statement
## [1] "This is the output"

Basics of using R

At its most basic, R can be used as a calculator. You can enter any mathematical operation, and R will solve it for you.

1+1
2*10
2.3*10^4

As with any good scientific calculator, R has the capability to store results as an object to be called upon later. This functionality will be used extensively as you learn to use R efficiently. To build on the previous example, the results of the three mathematical operations above will be stored as objects named “a”, “b”, and “c”:

1+1->a
2*10->b
2.3*10^4->c

The less than (or greater than) and minus symbols (“<” and “-”) are used together to create an arrow for storage of a result. The direction of the arrow indicates the direction of the storage operation. The statements a<-1+1 and 1+1->a will produce identical results. These objects can then be called upon later just by typing the name of the object; that is, typing the letter c into the R console will print the information stored as the object named c. It is important to keep in mind that R is case sensitive, so a and A are recognized as separate objects in R.

c
## [1] 23000
a + b + c
## [1] 23022

This same method can (and often will) be used to store far more complex forms of information than the result of a mathematical expression. Two objects that are commonly used to store data are vectors and data frames. Vectors can be thought of as a list of information, whereas a data frame more closely resembles a spreadsheet or table. Vectors can be created using the concatenate function (abbreviated with just the first letter c). The data.frame() function will produce a data frame. In the following example two vectors (x and y) are created, and then assembled into a single data frame, which is stored under the name fake.data.

x<-c(1:10)
y<-c(10,12,13,15,16,18,19,21,23,24)
fake.data<-data.frame(x,y)

The vector x was created using 1:10; the colon in this context is shorthand to generate all consecutive integers between 1 and 10. Notice that we can name a stored object as a single letter or as a string of letters. The period is recognized by R as just another textual character in this case, therefore naming the object fake.data would be no different than if we had chosen to name the object fake_data or fakeData. To see the full data frame, simply type the object name into the R console. Additionally, we can get various information about the vectors in the data frame by using other functions such as summary(), and colMeans().

fake.data
##     x  y
## 1   1 10
## 2   2 12
## 3   3 13
## 4   4 15
## 5   5 16
## 6   6 18
## 7   7 19
## 8   8 21
## 9   9 23
## 10 10 24
summary(fake.data)
##        x               y       
##  Min.   : 1.00   Min.   :10.0  
##  1st Qu.: 3.25   1st Qu.:13.5  
##  Median : 5.50   Median :17.0  
##  Mean   : 5.50   Mean   :17.1  
##  3rd Qu.: 7.75   3rd Qu.:20.5  
##  Max.   :10.00   Max.   :24.0
colMeans(fake.data)
##    x    y 
##  5.5 17.1

The summary() function provides min, max, mean, median, and 1st and 3rd quartiles of each vector in the data.frame. The colMeans() function returns the mean of each column in the data frame. It is also possible to apply functions to only one column within the data frame. This is accomplished by specifying the data frame, then the column, separated by a $. The examples below will provide the summary, mean, and standard deviation of only the y vector (or second column) of the data frame.

summary(fake.data$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    13.5    17.0    17.1    20.5    24.0
mean(fake.data$y)
## [1] 17.1
sd(fake.data$y)
## [1] 4.725

How to get help?

Help files

If you know a function, say lm() and do not know/remember the argument syntax you can type?lm().

If you are looking for a function , say drm(), but are not sure which package it is in then you can
type ??drc.

When you get the search results you usually first get to the top of the help page It starts out with a usage, saying lm(...) with a lot of arguments, mainly written for people who know the lm(..) to begin with. If you are new in the business we recommend you scroll down to the examples at the bottom of the help page to see if you can use some of the examples that fit your purpose. Unfortunately, some help pages are better than others, and for the developer one of the most tedious jobs is to write help pages in an understandable way for others than already seasoned R users.

Online help

If the installed help files are not enough, there are several locations to search on the web. However, the language name R sometimes makes it difficult to find the most relevant information. The site Rseek, searches only sites with relevant R information, and can be helpful for both the novice and the seasoned user.

Google is sometimes your best friend. For example, searching How to do regression with R will bring tens of thousands of hits, some better than others. There are also numeous YouTube clips in various languages.

R is so dynamic that except for functions that are very general our experience is that buying books on specific topics will not always provide code that is functioning even a couple of years after publication. We say a dynamic R environment is good, but…

If you wish to have some material you can look into contributed documentation; there are documents in various languages. But again if the documents are rather old there might be R function that do not work properly.

One Comment

  1. I found this site very interesting as I am a statistician working on the pesticide industry, mainly with R. I have many questions, (for example, the confidence interval calculation for EC10 in dose-response analysis), that cannot be answered by my colleagues who are mostly chemists or biologists. I hope it can be discussed here in the forum or by personal communications.

Leave a Reply

Your email address will not be published. Required fields are marked *