Getting Data Into R

Loading Data from External Sources

R can read data from many sources and file types. However, the goal here is not to provide a comprehensive list of available methods for reading data into R. For all data sets used in this book, we will use the read.csv() or read.table() functions. The read.csv() function is designed to read comma separated value (.csv) files. Comma separated value files can be read by almost any spreadsheet program (including MS Excel, Lotus 1-2-3, OpenOffice.org) or text editor, and thus it is a common file type that almost any researcher should be able to read and save. Most research software and database software will export data to a .csv format, including the commonly used ARM from Gylling Data Management, Inc. The read.csv() syntax is rather simple. If the csv file is in your current working directory, you can simply type read.csv("filname.csv"). It is also possible to specify the full path to the file, but this will differ depending on the operating system (Windows, Mac, Linux). For nearly all data sets used on this website, we will read the csv files directly from the web by specifying the URL.

corn.dat <- read.csv("http://rstats4ag.org/data/irrigcorn.csv")

The default behavior of the read.csv() function considers periods (.) as a decimal point, and commas (,) as the symbol used to separate values within a row. Depending on geography, though, this may not be a standard format. In many areas of Europe and Lating America, the decimal seperator is a comma (,) and the semicolon (;) is the variable separator. The read.csv() can still be used with alternate formats by specifying the sep and dec arguments for separator and decimal symbols, respectively.

newdata <- read.csv("filename.csv", sep=";", dec=",")

Alternatively, the read.csv2() function can be used, where the semicolon and comma are the default seperator and decimal symbols, respectively.

corn2.dat <- read.csv2("http://rstats4ag.org/data/irrigcorn.csv")

Experience tells us that this is unfortunately not a trivial matter, because in some instances the two csv systems are mixed. For other file formats, the read.table() function provides additional flexibility. You can learn more about the options available by typing ?read.table.

If no error or warning messages appear, then R has presumably read the irrigcorn.csv file, and stored the data in an object called corn.dat. To view the first few lines of the imported data file to ensure that the data was read successfully, use the head() function; to produce a summary of the data columns use the summary() function.

head(corn.dat)
##   Variety Maturity Irrig Population.A Population.ha Block Yield.BuA
## 1       1       92  Full        23000         56810     1       202
## 2       1       92  Full        23000         56810     2       163
## 3       1       92  Full        23000         56810     3       186
## 4       1       92  Full        23000         56810     4       178
## 5       2       86  Full        23000         56810     1       177
## 6       2       86  Full        23000         56810     2       176
##   Yield.tonha
## 1       12.68
## 2       10.23
## 3       11.68
## 4       11.17
## 5       11.11
## 6       11.05
summary(corn.dat)
##     Variety       Maturity        Irrig     Population.A   Population.ha  
##  Min.   :1.0   Min.   :76.0   Full   :48   Min.   :23000   Min.   :56810  
##  1st Qu.:2.0   1st Qu.:80.0   Limited:48   1st Qu.:23000   1st Qu.:56810  
##  Median :3.5   Median :85.0                Median :28000   Median :69160  
##  Mean   :3.5   Mean   :84.3                Mean   :28000   Mean   :69160  
##  3rd Qu.:5.0   3rd Qu.:88.0                3rd Qu.:33000   3rd Qu.:81510  
##  Max.   :6.0   Max.   :92.0                Max.   :33000   Max.   :81510  
##      Block        Yield.BuA    Yield.tonha   
##  Min.   :1.00   Min.   :117   Min.   : 7.34  
##  1st Qu.:1.75   1st Qu.:160   1st Qu.:10.08  
##  Median :2.50   Median :173   Median :10.86  
##  Mean   :2.50   Mean   :172   Mean   :10.81  
##  3rd Qu.:3.25   3rd Qu.:189   3rd Qu.:11.88  
##  Max.   :4.00   Max.   :206   Max.   :12.93

The head() function does tell you that the data looked as if they came in right. However, you cannot see if a column of numbers really is numbers or just characters. It could happen if the csv file is not properly cleansed for strange characters, e.g. or ; or even letters. Sometime it is difficult to figure out what is wrong with a variable and even when you presumably have a column of pure numbers, you still get them converted to characters in the reading process. The summary() function will indicate whether the data is being viewed as numeric/integer or something else. One other useful function is str(), which provides information on the structure of the R object. For data frames, the str() function will tell us the format for each column in the data frame. In the corn.dat data frame, we can see there are 96 rows and 8 columns, most of the columns contain interger (int) data, the 'Irrig' variable is a factor with 2 levels, and the last column is a numeric (num) variable.

str(corn.dat)
## 'data.frame':    96 obs. of  8 variables:
##  $ Variety      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ Maturity     : int  92 92 92 92 86 86 86 86 84 84 ...
##  $ Irrig        : Factor w/ 2 levels "Full","Limited": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population.A : int  23000 23000 23000 23000 23000 23000 23000 23000 23000 23000 ...
##  $ Population.ha: int  56810 56810 56810 56810 56810 56810 56810 56810 56810 56810 ...
##  $ Block        : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ Yield.BuA    : int  202 163 186 178 177 176 179 168 172 150 ...
##  $ Yield.tonha  : num  12.7 10.2 11.7 11.2 11.1 ...

In some cases, we may have a variable coded as an integer that we would like R to recognize as a factor variable. In the corn data set, the 'Variety' variable is numbered, but there is no numeric order to the varieties. To convert this column to a factor variable, we can use the as.factor() function, and store the result as the same name as the original variable.

corn.dat$Variety <- as.factor(corn.dat$Variety)
str(corn.dat)
## 'data.frame':    96 obs. of  8 variables:
##  $ Variety      : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Maturity     : int  92 92 92 92 86 86 86 86 84 84 ...
##  $ Irrig        : Factor w/ 2 levels "Full","Limited": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population.A : int  23000 23000 23000 23000 23000 23000 23000 23000 23000 23000 ...
##  $ Population.ha: int  56810 56810 56810 56810 56810 56810 56810 56810 56810 56810 ...
##  $ Block        : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ Yield.BuA    : int  202 163 186 178 177 176 179 168 172 150 ...
##  $ Yield.tonha  : num  12.7 10.2 11.7 11.2 11.1 ...

Sometimes, particularly when exporting data from Excel, numeric data is recognized by R as text. This is sometimes due to spaces in the data file, but sometimes the origin of the problem is difficult to find (especially with large data files). In these cases, one trick that can be tried is to convert the data to character, then back to numeric with the following code: as.numeric(as.character(data$var)).

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *