15 Visualization
Pictures and graphs may say more than 1000 words. After collection of data, the hard work of analyzing them according to the experimental design plans begins. But before you dig too much into statistics and jump to conclusions on the basis of the statistical tests, you should actually look at the data graphically. You must try to illustrate the data in such a way that it is informative not only to you, but also to potential readers of the project or the paper you wish to publish. In a way you could look at the visualization as the interface between science and art. One of the best examples of this merger of science and art is the campaign of Napoleon in Russia 1812-1813.
Minard’s work (Figure 15.1) has been called one of the best graphs ever produced in thematic cartography and statistical graphics. In short, it shows the size of the Army at the beginning of the attack and at the final retreat, some of the more important battles and the temperature during the retreat from Moscow. It contains so many dimensions and so much information and history that usually fill hundreds of pages in history books. Our work in weed science is rarely that exciting for the general public; although our discipline can be an esoteric at times, it plays a great role in the food production and sustainability of the farming and environment in the world.
15.1 Base R Graphics
Visualization of data is an important step to get a first impression of the quality of data and to see if some observations stand out from the rest. It can also help us observe whether the data show systematic relationships. Base R graphics are a quick and extensible way to do both this preliminary data inspection, but also for creating publication quality figures for manuscripts and presentations. We will begin this chapter with some basic examples of how to customize base R graphics to create high-quality figures.
15.1.1 Scatterplots
One of the most common figures is a scatter plot or x-y plot - typically used to see the relationship between an independent variable (x-axis) and a response variable (y-axis), although often used to show other relationships as well. Without much work, we can get a very simple scatterplot suitable for inspecting the data in the cars
data set that is part of the base R
installation (Figure 15.2).
Code to produce Figure 15.2:
plot(dist ~ speed, data = cars, main="Recorded in the 1920's")
While Figure 15.2 is adequate for preliminary inspection of one’s own data, it is woefully insufficient for submission to a journal. We can improve the figure greatly by making some simple additions to the basic plot code. To create Figure 15.3, the par()
function is called to change some of the default graphing parameters. In this case we adjust the margins so that the bottom and left margins are set to 3.2, the top and right margins are set to 0.5. We have removed the main title (since that information would typically be contained in a figure caption according to most journal instructions), and also adjusted the axis labels, by setting them closer to the axis compared to default values using mgp
. The mgp
argument requires 3 numbers, which set the distance from the axis for the axis title, axis labels, and axis line, respectively. The defaults for mgp
are 3, 1, 0 - so in this code we’ve removed some of the axis white space by moving the labels closer to the axis. We have also added axis labels and a main title, changed the plotting character (pch=19
, which is a closed circle), and added some transparency to the points using the rgb()
function and the alpha
argument. Finally, we have, adjusted the x-axis to include 0, and removed the box around the figure by setting bty="n"
. Setting bty="l"
will leave the bottom and left lines of the box, and bty="o"
will plot the entire box (the default).
Code to produce Figure 15.3:
par(mar=c(3.2,3.2,0.5,.5), mgp=c(2,.7,0))
plot(dist ~ speed, data = cars,
xlab="Speed of car (mph)",
ylab="Distance required to stop (ft)",
pch=19, col=rgb(red=.3, blue=.3, green=.3, alpha=0.7),
bty="n", xlim=c(0,25))
15.1.2 Box and Whisker Plots
If we have some data where the “treatment” is not based upon a continuous x variable, then we can illustrate the variation with a boxplot. The data for this example InsectSprays
are also included with the basic R installation.
Code to produce Figure 15.4:
par(mfrow = c(1,2), mar=c(3.2,3.2,.5,.5), mgp=c(2,.7,0))
plot(count ~ spray, data = InsectSprays)
legend("topleft","(a)", bty="n")
plot(count ~ as.numeric(spray), data = InsectSprays, pch=19)
legend("topleft","(b)", bty="n")
Figure 15.4 illustrates how the default plot()
function works in R; if the x-axis is based on factors (A through F) you automatically get a boxplot (Figure 15.4a), where the median and the quartiles are shown, as well as an indication of extreme observations. On the other hand, if we convert the factors A through F to numerical values of x by using function as.numeric()
, we get an ordinary continuous x variable (Figure 15.4b). Factor A becomes 1 and factor B becomes 2 and so on. In this particular data we can change the names (e.g., F is becoming A and A becoming F) of the various sprays and it will just change the order on the x-axis. In doing so it could have a great influence on perceived relationship.
It is possible to combine these two panels in Figure 15.4 into a single panel, to more concisely show a high level of detail on the variance within and among treatments. To do so, we modify the code for Figure 15.4 by using points()
instead of plot()
for the second plot so that the points are added to the first plot instead of creating a new plotting space; jittering the points, or offsetting them slightly, in the ‘x’ direction so that counts that are similar or the same can be more easily seen; adding some transparency to the points so that we can see if multiple points are on top of one another; and by removing the outlier points from the initial boxplot since all points will be added later.
Code to produce Figure 15.5:
par(mfrow = c(1,1), mar=c(3.2,3.2,.5,.5), mgp=c(2,.7,0))
plot(count ~ spray, data = InsectSprays, pch=NA)
points(count ~ jitter(as.numeric(spray)), data = InsectSprays,
pch=19, col=rgb(.1,.4,.8, alpha=0.7))
For data with continuous x we can also directly call the boxplot()
function rather than relying on R
to correctly identify the independent variable type. We can also use some additional arguments within the boxplot()
function to modify the look of our boxplot (Figure 15.6).
Code to produce Figure 15.6:
par(mfrow = c(1,1), mar=c(3.2,8.2,.5,.5), mgp=c(2,.7,0))
boxplot(count ~ as.numeric(spray), data = InsectSprays, horizontal=TRUE, pch=NA,
names=c("Antibug","Bug Kill","Clear","Destroy","Endgame","Fuhgeddaboudit"),
las=1, xlab="Insect counts", ylab="")
points(jitter(as.numeric(spray)) ~ count, data = InsectSprays,
pch=19, col=rgb(.1,.4,.8, alpha=0.7))
15.1.3 More than Two Variables
One of the classical examples in statistics are the Fisher’s or Anderson’s iris data that give measurements of the sepal length and width and petal length and width, respectively, for each of three species of Iris. The pairs()
function is a powerful (and quick) way to show the pairwise relationships between many variables at the same time.
Code to produce Figure 15.7:
head(iris,3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
par(mar=c(2.2,2.2,2.2,2.2), mgp=c(2,.7,0))
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
It is obvious from Figure 15.7 that the relationships sometimes seem to be separated into two groups. However, knowing that the data are separated into three Iris species, we could once more use as.numeric()
to identify which data belong to which species, and make each species a different color (Figure 15.8).
Code to produce Figure 15.8:
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris,
col = as.numeric(iris$Species))
The pairs()
function has even more power when used together with a function called panel.cor()
that can be found at the end of the ?pairs
help page. If you find outliers in data and wish to omit them, then by using the original panel function you must omit the whole observation. Here, the function panel.cor has been improved by using the argument (use = "pairwise.complete.obs"
).
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use = "pairwise.complete.obs"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste(prefix, txt, sep = "")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
For the average R user (and even the authors of this chapter), this looks very complicated and it probably is; but by running it before you run the pairs()
you get a fantastic overview of relationships among variables. In soils and plant sciences you often have numerous variables, i.e. elements or chemicals or morphological attributes. Before launching the statistics weaponry it would be nice to see the relationships between these variables, as we will show using the Nitrate data set.
Nitrate <- read.table("http://rstats4ag.org/data/nitrate.txt", header=T)
head(Nitrate,3)
## age no3n pH Na K
## 1 0 0.13 4.91 4.36 0.14
## 2 0 0.15 5.00 3.91 0.21
## 3 0 0.05 5.37 3.98 0.16
We execute the panel.cor()
function and call it within the pairs()
function to see the relationships between all variables in the data set. Note that the correlations are absolute.
Code to produce Figure 15.9:
pairs(Nitrate, lower.panel = panel.smooth, upper.panel = panel.cor)
It is clear from Figure 15.9 that K has virtually no correlation with the other variables. This is solely due to an observation with a rather high K concentration, seen in the last panel row. In a case like this, it would certainly be wise to determine what is going on with that particular observation - measurement or data entry errors could be potential explainers. For now, we will exclude that observation to look at the overall trends. To do so, we can simply find the observation with the largest K-value:
which.max(Nitrate[["K"]])
## [1] 10
The maximum value is in the 10th observation. Secondly, we can change this putative outlier to a missing value (expressed as ‘NA’ in R
). To do this, we define a new data set (Nitrate1
) in order to not change the original data we have. We will then select the value of K that matches the maximum value of K (the 10th observation), and replace that value with ‘NA’:
Nitrate1<-Nitrate
Nitrate1$K[Nitrate1$K==max(Nitrate1$K)] <- NA # This is missing value in R
We can then re-run the pairs()
function on the new data set to see how removal of the outlier changes the relationship between K and the other variables.
Code to produce Figure 15.10:
pairs(Nitrate1, lower.panel = panel.smooth, upper.panel = panel.cor)
A field experiment, with mechanical control of weeds in cereals, was done by measuring the effect on leaf cover as a function of number of harrow runs. The harrowing was done either along the crop rows or across the crop rows at a certain stage of cereal development.
leafCover<-read.csv("http://rstats4ag.org/data/LeafCover.csv")
head(leafCover,3)
## Leaf Intensity Direction
## 1 0.18623052 0 Along
## 2 0.12696533 1 Along
## 3 0.09006582 2 Along
If we have more than one relationship within a data set, we can separate them as done below. There is a classification with the factors levels: Across and Along. In Figure 15.11 you can see how you can identify the observation within harrowing Along and Across
Code to produce Figure 15.11:
par(mar=c(3.2,3.2,.5,.5), mgp=c(2,.7,0))
plot(Leaf~Intensity, data = leafCover, bty="l",
pch = as.numeric(Direction),
xlab="Harrow intensity", ylab="Leaf area")
legend("bottomleft", levels(leafCover$Direction), pch=c(1,2))
Obviously, there is a relationship, and this will be properly addressed in the ANCOVA chapter. There are also other plotting facilities that can be used the xyplot()
in the package,lattice, which comes with basic R.
Code to produce Figure 15.12
library(lattice)
xyplot(Leaf ~ Intensity | Direction, data = leafCover)
15.2 ggplot2
Within the R community there is an array of versatile packages to visualize data, we have barely scratched the surface showing a few examples we felt may be of immediate interest. The numbers of graphic interfaces are numerous and increasing, some are easier to use than others, and sometimes it takes a rather large amount of time to get the hang of it. One package that has received a great deal of (much-deserved) praise is ggplot2
. The ‘gg’ branding is “the grammar of graphics.”
We will not spend too much time discussing the syntax of ggplot2
, but we will provide several examples of plots created in this package to illustrate its powerful and elegant capabilities. We can begin by using the same leafCover
data in Figures 15.11 and 15.12. In the ‘gg’ way of doing things, we first provide the data set and the primary variables of interest, then continue adding on additional layers to the plot, separated by a +
.
Code to produce Figure15.13
library(ggplot2)
ggplot(leafCover, aes(x=Intensity, y=Leaf)) +
geom_point(aes(color=Direction, shape=Direction)) +
stat_smooth(aes(color=Direction, linetype=Direction), method="lm", se=TRUE) +
ylab("Leaf area") + xlab("Harrow intensity") +
theme_classic() + theme(legend.position="right")
One of the strengths of the ggplot2
package is the ability to use linear or local regression models using the stat_smooth()
function, as shown in Figure 15.13. In this example, the linear model fit is drawn in the same colors as the points, and a gray shaded area is added that represents the standard error. All of these options can be customized (for example, by changing to se=FALSE
to remove the error shading).
To create a barplot with confidence intervals from a one way analysis of variance, ANOVA, we can again use the InsectSprays
data set, and start by generating the linear model. The next step is to turn the relevant results from the ANOVA into a data.frame, for use in plotting, then passing that information along to the geom_bar()
and geom_errorbar()
functions within ggplot
.
m1<-lm(count~spray-1,data=InsectSprays)
#Note the -1 presents the parameteres as sheer means not as differences
coef(m1) #the means
## sprayA sprayB sprayC sprayD sprayE sprayF
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
confint(m1) #the conficence intervals
## 2.5 % 97.5 %
## sprayA 12.2395786 16.760421
## sprayB 13.0729119 17.593755
## sprayC -0.1770881 4.343755
## sprayD 2.6562453 7.177088
## sprayE 1.2395786 5.760421
## sprayF 14.4062453 18.927088
d.frame.m1<-data.frame(cbind(coef(m1),confint(m1)),pest=rownames(confint(m1))) #Combining the two
head(d.frame.m1,2)
## V1 X2.5.. X97.5.. pest
## sprayA 14.50000 12.23958 16.76042 sprayA
## sprayB 15.33333 13.07291 17.59375 sprayB
The results are provided in Figure 15.14. When doing an ANOVA to separate treatment effect it is not correct to give the standard error of the replicates within a treatment. It is based upon too few observations and essentially is a waste of ink. The only correct way is to use the ANOVA standard error. The trained weed and crop scientists will know this rather simple use of ANOVA results is not obeyed in a large proportion of papers with tables that give the individual treatments standard errors. But because this redundant presentation of data is generally accepted in most journals, whatever their impact factors, it should be discouraged.
Code to produce Figure15.14
ggplot(d.frame.m1, aes(x=pest, y=V1)) +
geom_bar(stat="identity", fill="lightgreen",
colour="black") +
geom_errorbar(aes(ymin=pmax(X2.5..,0),
ymax=X97.5..), width=0.5) +
xlab("Pesticide Spray")+
ylab("Insect Counts")
Paired barplots are another (more complicated, but often useful) way of arranging barplots. This is relevant for some factorial or split-plot designs. For example using data from a study looking at five crop species each grown in two different row orientations (east-west vs north-south), it makes sense to group the data by crops to look for consistent trends of row orientation (Figure 15.15).
Code to produce Figure15.15
CropRowOrientation<-read.csv("http://rstats4ag.org/data/CropRowOrientation.csv")
head(CropRowOrientation)
## Crop Row Yield
## 1 Barley East-west 1149
## 2 Wheat East-west 1195
## 3 Canola East-west 626
## 4 Field pea East-west 500
## 5 Lupine East-west 493
## 6 Barley North-south 856
c <- ggplot(CropRowOrientation)
c + aes(x = Crop, y = Yield, fill = Row, color = Row) +
geom_bar(stat="identity", width=.5, position="dodge")