1 Introduction

1.1 How to get help?

Learning a new language can be frustrating - especially when you’re not even sure how to begin troubleshooting the problem you’re having. It is important to realize there are a variety of ways to solve problems in R.

1.1.1 Help files

If you know the name of a function, but do not know or remember how to use it you can bring up the documentation for that function by typing a question mark in front of the function: ?lm(). This will work for any function in a currently loaded package. You can do a broader search by typeing two question marks. For example, if you know the name of a function, but are not sure which package you need to load to see it: ??drm().

The help pages for most R functions start out with usage, saying lm(...) which details the arguments available in that function. These are mainly written for people who know the function (or at the very least are familiar with R documentation) to begin with. If you are new to R we recommend you scroll down to the examples at the bottom of the help page to see if you can use some of the examples that fit your purpose. Unfortunately, some help pages are better than others, and for the developer one of the most tedious jobs is to write help pages in an understandable way for others than already seasoned R users.

1.1.2 Online help

Nearly every question related to using R has been already asked, and in many cases, has an answer online. So if the installed help files aren’t sufficient, there are several locations to search on the web. A simple Google is sometimesall you’ll need. For example, searching How to do regression with R will bring tens of thousands of hits, some better than others. There are also numeous YouTube clips in various languages for many common analyses.

One difficulty, however, is that the language name R sometimes makes it difficult to find the most relevant information. If Google fails for this reason, the site Rseek, searches only sites with relevant R information, and can be helpful for both the novice and the seasoned user.

There are many hard-copy texts that you can purchase, and many are of very high quality. However, R is so dynamic that books on specific topics can often contain code that is not functioning even just a couple of years after publication. We recommend caution when purchasing books, and also hesitate to recommend specific titles.

1.2 Downloading and Installing R

Instructions for downloading and installing R vary depending on the operating system, and can be found at the homepage for the R project. There is a wealth of information on installing R on the web, so we suggest searching Google or YouTube if you are having trouble or would like a step-by-step guide.

1.2.1 R packages - CRAN

One of the most notable benefits of the R language is the fact that many add-on packages have been written by other statisticians. If there is a common need for some type of analysis or calculation of a common statistic, chances are good that someone has already written code to automate the process. In many cases, this code is contributed to the R user community as an add-on package that can be freely downloaded and used by others. There are currently over 5,000 contributed packages available for free download from the Comprehensive R Archive Network (CRAN). The procedure for installing contributed packages also differs between operating systems. Instructions for installing contributed packages in R can be found in the online R documentation.

The following packages are used for some of the examples in this text, and some will need to be installed to run all of the code provided. Some of the packages below are installed by default with the base R installation, but others will need to be installed afterwards. When a package needs to be loaded, we have tried to include the coe to load the library within the relevant chapter.

  • dplyr
  • lattice
  • ggplot2
  • agricolae
  • Hmisc
  • nlme
  • lme4
  • emmeans
  • multcomp
  • drc

1.2.2 RStudio

We suggest installing RStudio as a useful and consistent interface for R. The default appearance of R differs greatly between Windows, Mac, and Linux operating systems. RStudio is available for all 3 platforms and provides several useful features in addition to a consistent interface. One of the great benefits ot RStudio is the ability to easily use R Markdown - a version of the markdown markup language that can help greatly in conducting reproducible research and creating print and web reports. This website, for example, was created entirely in RStudio using the R Markdown language.

1.3 Conventions

Several typographical conventions will be used throughout this text. References to R packages (such as the drc package) will be highlighted. Functions will be highlighted similarly, but followed by open parentheses, for example, aov(). Example code that is meant to be typed or copied directly into the R console will be enclosed in a shaded box in a monospace font. Where appropriate, code will be followed by the resulting output, preceded by “##”.

Input.statement
## [1] "This is the output"

1.4 Basics of using R

At its most basic, R can be used as a calculator. You can enter any mathematical operation, and R will solve it for you.

1+1
2*10
2.3 * 10^4

As with any good scientific calculator, R has the capability to store results as an object to be called upon later. This functionality will be used extensively as you learn to use R efficiently. To build on the previous example, the results of the three mathematical operations above will be stored as objects named “a”, “b”, and “c”:

1 + 1 -> a
2 * 10 -> b
2.3 * 10^4 -> c

The less than (or greater than) and minus symbols (“<” and “-”) are used together to create an arrow for storage of a result. The direction of the arrow indicates the direction of the storage operation. The statements a <- 1 + 1 and 1 + 1 -> a will produce identical results. These objects can then be called upon later just by typing the name of the object; that is, typing the letter c into the R console will print the information stored as the object named c. It is important to keep in mind that R is case sensitive, so a and A are recognized as separate objects in R.

c
## [1] 23000
a + b + c
## [1] 23022

This same method can (and often will) be used to store far more complex forms of information than the result of a mathematical expression. Two objects that are commonly used to store data are vectors and data frames. Vectors can be thought of as a list of information, whereas a data frame more closely resembles a spreadsheet or table. Vectors can be created using the concatenate function (abbreviated with just the first letter c). The data.frame() function will produce a data frame. In the following example two vectors (x and y) are created, and then assembled into a single data frame, which is stored under the name fake.data.

x <- c(1:10)
y <- c(10,12,13,15,16,18,19,21,23,24)
fake.data <- data.frame(x, y)

The vector x was created using 1:10; the colon in this context is shorthand to generate all consecutive integers between 1 and 10. Notice that we can name a stored object as a single letter or as a string of letters. The period is recognized by R as just another textual character in this case, therefore naming the object fake.data would be no different than if we had chosen to name the object fake_data or fakeData. To see the full data frame, simply type the object name into the R console. Additionally, we can get various information about the vectors in the data frame by using other functions such as summary(), and colMeans().

fake.data
##     x  y
## 1   1 10
## 2   2 12
## 3   3 13
## 4   4 15
## 5   5 16
## 6   6 18
## 7   7 19
## 8   8 21
## 9   9 23
## 10 10 24
summary(fake.data)
##        x               y       
##  Min.   : 1.00   Min.   :10.0  
##  1st Qu.: 3.25   1st Qu.:13.5  
##  Median : 5.50   Median :17.0  
##  Mean   : 5.50   Mean   :17.1  
##  3rd Qu.: 7.75   3rd Qu.:20.5  
##  Max.   :10.00   Max.   :24.0
colMeans(fake.data)
##    x    y 
##  5.5 17.1

The summary() function provides min, max, mean, median, and 1st and 3rd quartiles of each vector in the data.frame. The colMeans() function returns the mean of each column in the data frame. It is also possible to apply functions to only one column within the data frame. This is accomplished by specifying the data frame, then the column, separated by a $. The examples below will provide the summary, mean, and standard deviation of only the y vector (or second column) of the data frame.

summary(fake.data$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    13.5    17.0    17.1    20.5    24.0
mean(fake.data$y)
## [1] 17.1
sd(fake.data$y)
## [1] 4.72464

1.5 Loading Data from External Sources

R can read data from many sources and file types. However, the goal here is not to provide a comprehensive list of available methods for reading data into R. For all data sets used in this book, we will use the read.csv() or read.table() functions. The read.csv() function is designed to read comma separated value (.csv) files. Comma separated value files can be read by almost any spreadsheet program (including MS Excel, Lotus 1-2-3, OpenOffice.org) or text editor, and thus it is a common file type that almost any researcher should be able to read and save. Most research software and database software will export data to a .csv format, including the commonly used ARM from Gylling Data Management, Inc. The read.csv() syntax is rather simple. If the csv file is in your current working directory, you can simply type read.csv("filname.csv"). It is also possible to specify the full path to the file, but this will differ depending on the operating system (Windows, Mac, Linux). All of the data sets used in this text can be downloaded here as a zip file: http://rstats4ag.org/data/Rstats4ag_data.zip. But for simplicity, we have provided the code in each chapter to read the csv files directly from the web by specifying the URL in the read.csv() function. For example:

corn.dat <- read.csv("http://rstats4ag.org/data/irrigcorn.csv")

The default behavior of the read.csv() function considers periods (.) as a decimal point, and commas (,) as the symbol used to separate values within a row. Depending on geography, though, this may not be a standard format. In many areas of Europe and Lating America, the decimal seperator is a comma (,) and the semicolon (;) is the variable separator. The read.csv() can still be used with alternate formats by specifying the sep and dec arguments for separator and decimal symbols, respectively.

newdata <- read.csv("filename.csv", sep=";", dec=",")

Alternatively, the read.csv2() function can be used, where the semicolon and comma are the default seperator and decimal symbols, respectively.

corn2.dat <- read.csv2("http://rstats4ag.org/data/irrigcorn.csv")
_Illustration of the difference between two common forms of comma separated value (csv) files that can be read using the `read.csv()` and `read.csv2()` functions._

Figure 1.1: Illustration of the difference between two common forms of comma separated value (csv) files that can be read using the read.csv() and read.csv2() functions.

Experience tells us that this is unfortunately not a trivial matter, because in some instances the two csv systems are mixed. For other file formats, the read.table() function provides additional flexibility. You can learn more about the options available by typing ?read.table.

If no error or warning messages appear, then R has presumably read the irrigcorn.csv file, and stored the data in an object called corn.dat. To view the first few lines of the imported data file to ensure that the data was read successfully, use the head() function; to produce a summary of the data columns use the summary() function.

head(corn.dat)
##   Variety Maturity Irrig Population.A Population.ha Block Yield.BuA
## 1       1       92  Full        23000         56810     1       202
## 2       1       92  Full        23000         56810     2       163
## 3       1       92  Full        23000         56810     3       186
## 4       1       92  Full        23000         56810     4       178
## 5       2       86  Full        23000         56810     1       177
## 6       2       86  Full        23000         56810     2       176
##   Yield.tonha
## 1       12.68
## 2       10.23
## 3       11.68
## 4       11.17
## 5       11.11
## 6       11.05
summary(corn.dat)
##     Variety       Maturity         Irrig     Population.A  
##  Min.   :1.0   Min.   :76.00   Full   :48   Min.   :23000  
##  1st Qu.:2.0   1st Qu.:80.00   Limited:48   1st Qu.:23000  
##  Median :3.5   Median :85.00                Median :28000  
##  Mean   :3.5   Mean   :84.33                Mean   :28000  
##  3rd Qu.:5.0   3rd Qu.:88.00                3rd Qu.:33000  
##  Max.   :6.0   Max.   :92.00                Max.   :33000  
##  Population.ha       Block        Yield.BuA      Yield.tonha   
##  Min.   :56810   Min.   :1.00   Min.   :117.0   Min.   : 7.34  
##  1st Qu.:56810   1st Qu.:1.75   1st Qu.:160.5   1st Qu.:10.08  
##  Median :69160   Median :2.50   Median :173.0   Median :10.86  
##  Mean   :69160   Mean   :2.50   Mean   :172.2   Mean   :10.81  
##  3rd Qu.:81510   3rd Qu.:3.25   3rd Qu.:189.2   3rd Qu.:11.88  
##  Max.   :81510   Max.   :4.00   Max.   :206.0   Max.   :12.93

The head() function does tell you that the data looked as if they came in right. However, you cannot see if a column of numbers really is numbers or just characters. It could happen if the csv file is not properly cleansed for strange characters, e.g. or ; or even letters. Sometime it is difficult to figure out what is wrong with a variable and even when you presumably have a column of pure numbers, you still get them converted to characters in the reading process. The summary() function will indicate whether the data is being viewed as numeric/integer or something else. One other useful function is str(), which provides information on the structure of the R object. For data frames, the str() function will tell us the format for each column in the data frame. In the corn.dat data frame, we can see there are 96 rows and 8 columns, most of the columns contain interger (int) data, the ‘Irrig’ variable is a factor with 2 levels, and the last column is a numeric (num) variable.

str(corn.dat)
## 'data.frame':    96 obs. of  8 variables:
##  $ Variety      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ Maturity     : int  92 92 92 92 86 86 86 86 84 84 ...
##  $ Irrig        : Factor w/ 2 levels "Full","Limited": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population.A : int  23000 23000 23000 23000 23000 23000 23000 23000 23000 23000 ...
##  $ Population.ha: int  56810 56810 56810 56810 56810 56810 56810 56810 56810 56810 ...
##  $ Block        : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ Yield.BuA    : int  202 163 186 178 177 176 179 168 172 150 ...
##  $ Yield.tonha  : num  12.7 10.2 11.7 11.2 11.1 ...

In some cases, we may have a variable coded as an integer that we would like R to recognize as a factor variable. In the corn data set, the ‘Variety’ variable is numbered, but there is no numeric order to the varieties. To convert this column to a factor variable, we can use the as.factor() function, and store the result as the same name as the original variable.

corn.dat$Variety <- as.factor(corn.dat$Variety)
str(corn.dat)
## 'data.frame':    96 obs. of  8 variables:
##  $ Variety      : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Maturity     : int  92 92 92 92 86 86 86 86 84 84 ...
##  $ Irrig        : Factor w/ 2 levels "Full","Limited": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population.A : int  23000 23000 23000 23000 23000 23000 23000 23000 23000 23000 ...
##  $ Population.ha: int  56810 56810 56810 56810 56810 56810 56810 56810 56810 56810 ...
##  $ Block        : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ Yield.BuA    : int  202 163 186 178 177 176 179 168 172 150 ...
##  $ Yield.tonha  : num  12.7 10.2 11.7 11.2 11.1 ...

Sometimes, particularly when exporting data from Excel, numeric data is recognized by R as text. This is sometimes due to spaces in the data file, but sometimes the origin of the problem is difficult to find (especially with large data files). In these cases, one trick that can be tried is to convert the data to character, then back to numeric with the following code: as.numeric(as.character(data$var)).