Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. ANOVA is used to test general rather than specific differences among means.

Load packages

library(tidyverse)
library(agricolae)
## Warning: package 'agricolae' was built under R version 3.3.3

Read input data

irisdata<-read_csv("irisdata.csv")
## Parsed with column specification:
## cols(
##   Sepal.Length = col_double(),
##   Sepal.Width = col_double(),
##   Petal.Length = col_double(),
##   Petal.Width = col_double(),
##   Species = col_character()
## )

Inspect input data

str(irisdata)
## Classes 'tbl_df', 'tbl' and 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ Sepal.Length: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Sepal.Width : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Petal.Length: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Petal.Width : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Species     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Convert into factors

Since Species is not a factor, we will convert it into a factor:

irisdata$Species <- as.factor(irisdata$Species)

Fit a model and view residuals

ANOVA tests the non-specific null hypothesis that all three population means are equal. Here, our question is, does sepal length differ between species?

model1 <- aov(Sepal.Length ~ Species, data=irisdata)
par(mfrow=c(2,2))
plot(model1)

Get summary of model

summary(model1)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows that there is statistically significant difference in Sepal Length between the species, but with ANOVA, we can not distinguish which Species are significantly different than other Species.

To find this out, we need to perform a post-hoc test like Tukey HSD, which is discussed in the next post.

<– Click here to go to the previous tutorial                          Click here to go to the next tutorial –>