Data Visualization & Plotting

Visualization

Pictures and graphs may say more than 1000 words. After collection of data, the hard work of analyzing them according to the experimental design plans begins. But before you dig too much into statistics and jump to conclusions on the basis of the statistical tests, you should actually look at the data graphically. You must try to illustrate the data in such a way that it is informative not only to you, but also to potential readers of the project or the paper you wish to publish. In a way you could look at the visualization as the interface between science and art. One of the best examples of this merger of science and art is the campaign of Napoleon in Russia 1812-1813.

image from Joseph Minard (1781-1870)

Figure 1. Charles Joseph Minard (1781-1870) is most widely known for a single work, his illustration of the fate of Napoleon' s Grand Army in Russia
http://www.datavis.ca/gallery/re-minard.php

The illustration above is called the best graphs ever produced in thematic cartography and statistical graphics. In short, it shows the size of the Army at the beginning of the attack and at the final retreat, some of the more important battles and the temperature during the retreat from Moscow. It contains so many dimensions and so much information and history that usually fill hundreds of pages in history books.

Our work in weed science will probably never get that exciting for the general public, our discipline is an esoteric one, which, however, plays a great role in the food production and sustainability of the farming and environment in the world.

Visualization of data is helpful to get a first impression of the quality of data and to see if some observations stand out from the rest. Last but no least it helps us observe if the data show a systematic relationship between dependent variable, y, and independent variable, x.The example below comes with basic R. Notice the use of ~ in R. The = is often used in other statistical programmes.

plot(dist ~ speed, data = cars, 
     ylab="Speed of Cars(mph)", xlab="Distances taken to Stop (ft)",
     main="Recorded in the 1920s.")

plot of chunk unnamed-chunk-1

Figure 2. The speed of cars and the distances required for the car to stop. Note that the data were recorded in the 1920s. The tilde ~ can be difficult to find at you keyboard, depending on you language, but it is the defined ASCII code 126 = ~ ( Tilde ; swung dash ). In Windows you can press the ALT key activate Num. Lock and press 126.

If we have some data where the “treatment” is not based upon a continuous x variable, then we can illustrate the variation with a boxplot. These data also come with basic R.

par(mfrow = c(1,2))
head(InsectSprays)
##   count spray
## 1    10     A
## 2     7     A
## 3    20     A
## 4    14     A
## 5    14     A
## 6    12     A
plot(count ~ spray, data = InsectSprays)
plot(count ~ as.numeric(spray), data = InsectSprays)

plot of chunk unnamed-chunk-2

Figure 3. Counts of insects in agricultural experimental units treated with different insecticides. Note that as.numeric() function turns the letters into numerical values.

Figure 3 illustrates how an ordinary plot() works in R; if the x-axis is based on factors (A-F) you get automatically a boxplot, where the median and the quartiles are shown; and it gives indication of extreme observations. On the other hand if we convert the factors to numerical x by using the argument as.numeric() we get an ordinary continuous x variable. Factor A becomes 1 and factor B becomes 2 and so on and so forth. In this particular data we can change the names (e.g., F is becoming A and A becoming F) of the various sprays and it will just change the order on the x-axis. If doing so it would have a great influence on perceived relationship.

For data with continuous x we can also use the boxplot()function

boxplot(count ~ as.numeric(spray), data = InsectSprays)

plot of chunk unnamed-chunk-3

Figure 4. The counts of insects in agricultural experimental units treated with different insecticides.The names of the various insecticides have been converted to continuous numbers.

A field experiment, with mechanical control of weeds in cereals, was done by measuring the effect on leaf cover as a function of number of harrow runs. The harrowing was done either along the crop rows or across the crop rows at a certain stage of cereal development.

leafCover<-read.csv("http://rstats4ag.org/data/LeafCover.csv")
head(leafCover,3)
##         Leaf Intensity Direction
## 1 0.18623052         0     Along
## 2 0.12696533         1     Along
## 3 0.09006582         2     Along

If we have more than one relationship within a data set, we can separate them as done below. There is a classification with the factors levels: Across and Along. In Figure 5 you can see how you can identify the observation within harrowing Along and Across

plot(Leaf~Intensity, data = leafCover,
     pch = as.numeric(Direction))
legend("topright", levels(leafCover$Direction), pch=c(1,2))

plot of chunk unnamed-chunk-5

Figure 5. The leaf area of weeds and crop in relation of the intensity of harrowing either across or along the crop rows. Note how we changed Across and Along to numerical values by using the function as.numeric() in order to get different symbols

Obviously, there is a relationship, and this will be properly addressed in the ANCOVA chapter. There are also other plotting facilities that can be used the xyplot() in the package,lattice, which comes with basic R.

library(lattice)
xyplot(Leaf ~ Intensity | Direction, data = leafCover)

plot of chunk unnamed-chunk-6

Figure 6. The individual plots for each classification variable. Notice that the variable after the | defines the classification.

One of the classical examples in statistics are the Fisher's or Anderson's iris data that give the measurements in centimeters of the sepal length and width and petal length and width, respectively, for each of three species of Iris.There is a neat way to show relationship between the various measurements on the same entity by using the pairs() function. The pairs() function is an extremely strong way to illustrate relationships between variable.

head(iris,3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

plot of chunk unnamed-chunk-7

Figure 7. Sepal and petal measurements on 50 flowers from each of three species of Iris: setosa, versicolor, and virginica.

It is obvious from Figure 7 above that the relationships sometimes seem to be separated into two groups. However, knowing that the data are separated into three Iris species, we could once more use as.numeric() to identify which data belong to which species.

pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris,
     col = as.numeric(iris$Species))

plot of chunk unnamed-chunk-8

Figure 8. By using the argument col = as.numeric(iris$Species) the colors separate the various species; there are obvious differences in the relationships among species.

The pairs() function has even more power than seen above. Together with a function called panel.cor() that can be found in the examples in pairs() at the end of the help page (write ?pairs and go to the examples at the bottom of the help page). However, if you find outlayers in data and wish to omit them then by using the original panel function you must omit the whole observations.Here, the function panel.cor has been improved by using the argument (use = "pairwise.complete.obs").

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
 r <- abs(cor(x, y, use = "pairwise.complete.obs"))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste(prefix, txt, sep = "")
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

For the average R user and the author of this chapter, it looks very complicated and it probably is; but by running it before you run the pairs() you get a fantastic overview of relationships among variables.

In soils and plant science you often have measured numerous variable, i.e. elements or chemicals or morphological attributes. Before launching the statistics weaponry it would be nice to see the relationships between these variables. Such as in Figures 9 and 10 below.

Nitrate <- read.table("http://rstats4ag.org/data/nitrate.txt", header=T)
head(Nitrate,3)
##   age no3n   pH   Na    K
## 1   0 0.13 4.91 4.36 0.14
## 2   0 0.15 5.00 3.91 0.21
## 3   0 0.05 5.37 3.98 0.16

We execute the panel.cor() function and call it within the pairs() function as shown below and get a fantastic overview of the relationships and their correlations. Note that the correlations are absolute.

pairs(Nitrate, lower.panel = panel.smooth, upper.panel = panel.cor)

plot of chunk unnamed-chunk-11

Figure 9. The lower part of the panel shows the relationships among variables of some soil samples. The upper part shows the absolute correlation coefficients the sizes of which are based upon the magnitude of the correlation.

It is clear from the plot above that for K has virtually no correlations. This is solely due to an observation with a rather high K concentration, seen in the last panel row in Figure 9.

Firstly, we will find the observation with the largest K-value

OBS<-which.max(Nitrate[["K"]])
OBS
## [1] 10

The value is in the 10th observation

Secondly, we wish to change this putative outlayer to a missing value and it is done by

Nitrate1<-Nitrate
Nitrate1[10,5]<-NA #This is missing value in R

In order not to change the original data we have defined a new data set

pairs(Nitrate1, lower.panel = panel.smooth, upper.panel = panel.cor)

plot of chunk unnamed-chunk-14

Figure 10. K has now gotten a higher correlation because the concentration in th 10th observation has been set to missing value. Notice the other correlations have not changed, only K.

Within the R community there is an array of versatile packages to illustate data, but in order to keep the options down, we will only show those that could be of immediate interest. The numbers of graphic interfaces are numerous and increasing, some are easier to use than others, and sometimes it takes a rather large amount of time to get the hang of it. One of those are the package ggplot2, the branding is “the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts”.The example in Figure 11 is based on ggplot2.
It is a package with an entirely different concept and it is an elegant alternative to the base graphics and lattice plotting systems. We will only show few examples to whet you interest for more, which you can get at the URL below

http://ggplot2.org/

we use a barplot and confidence intervals from a one way analysis of variance, ANOVA.

m1<-lm(count~spray-1,data=InsectSprays)
#Note the -1 presents the parameteres as sheer means not as differences 

The next step is to turn the relevant results from the ANOVA into a data.frame, which is the only data format ggplot2 accepts

coef(m1)#the means
##    sprayA    sprayB    sprayC    sprayD    sprayE    sprayF 
## 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667
confint(m1)#teh conficence intervals
##             2.5 %    97.5 %
## sprayA 12.2395786 16.760421
## sprayB 13.0729119 17.593755
## sprayC -0.1770881  4.343755
## sprayD  2.6562453  7.177088
## sprayE  1.2395786  5.760421
## sprayF 14.4062453 18.927088
d.frame.m1<-data.frame(cbind(coef(m1),confint(m1)),pest=rownames(confint(m1))) #Combining the two
head(d.frame.m1,2)
##              V1   X2.5..  X97.5..   pest
## sprayA 14.50000 12.23958 16.76042 sprayA
## sprayB 15.33333 13.07291 17.59375 sprayB

Figure 11 illustrates the results. When doing an ANOVA to separate treatment effect it is not correct to give the standard error of the replicates within a treatment. It is based upon too few observations and essential is waste of ink. The only correct way is to use the powerful ANOVA standard error. The trained weed and crop scientists will know this rather simple use of ANOVA results is not obeyed in a large proportion of papers with tables that give the individual treatments standard errors. But because this redundant presentation of data is generally accepted in most journals, whatever their impact factors, it should be discouraged. But perhaps this authors is tilting against windmill?

library(ggplot2)
ggplot(d.frame.m1, aes(x=pest, y=V1)) +
  geom_bar(stat="identity", fill="lightgreen",
           colour="black") +
  geom_errorbar(aes(ymin=pmax(X2.5..,0), 
                    ymax=X97.5..), width=0.5) +
  xlab("Pesticide Sprays")+
  ylab("Counts of Insects")

plot of chunk unnamed-chunk-17

Figure 11. InsectSpray data in a barplot with 95 % confidence intervals. There are numerous way to improve the plot for publication and interested students should consult http://ggplot2.org/

Another more complicated way of arranging barplots is shown in Figure 12

CropRowOrientation<-read.csv("http://rstats4ag.org/data/Crop%20Row%20Orientation.csv")
head(CropRowOrientation)
##          Crop          Row Yield
## 1      Barley    East-west  1149
## 2       Wheat    East-west  1195
## 3      Canola    East-west   626
## 4   Field pea    East-west   500
## 5      Lupine    East-west   493
## 6      Barley  North-south   856
library(ggplot2)
c <- ggplot(CropRowOrientation)
c + aes(x = Crop, y = Yield, fill = Row, color = Row) +
      geom_bar(stat="identity", width=.5, position="dodge")

plot of chunk unnamed-chunk-18

Figure 12. The effect on yields of the orientation of direction of sowing for five crops.The illustration is based on means not raw data, so there are no error bars

2 Comments

  1. In the example given to do bar plots with standard deviation, it takes too long to try to understand the code to copy and apply it, if we don't know the "SprayMean" file to see how to consider variables.
    Referring to:
    "library(dplyr)
    SprayMean <- InsectSprays %.%
    group_by(spray) %.%"

    It's difficult to see how to apply it without the original data.

    The ggplot2 is possible to understand and apply because the data set is given and one can analyze it and understand. But then we cannot add standard errors and/or standard deviations. In the example given in the "R" help page, we cannot apply it with the example given since the data set created with vectors make it difficult to apply when we have own set of data. For people that is not into programming and hardly understands why a matrix is different from the original set of data one has, it comes a little bit frustrating to try to apply R for graphics.

  2. Dear Gabiela, sorry to be so late but I did not expect questions so far because it is not yet being published. However, I have corrected the webpage. I now use the lm() and extract the important means and confidence intervals. furthermore I have changed the barplot to a qqplot2 to get a better look;-)
    If you have already solved the problem then this might come too late.
    If you want the Confidence intervals in a plot like the last one then we have to look how to get it:-)

Leave a Reply

Your email address will not be published. Required fields are marked *