Load packages and data:

library(tidyverse)
irisdata <- read_csv("irisdata.csv")
## Parsed with column specification:
## cols(
##   Sepal.Length = col_double(),
##   Sepal.Width = col_double(),
##   Petal.Length = col_double(),
##   Petal.Width = col_double(),
##   Species = col_character()
## )

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function. group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean Sepal.Length by Species:

irisdata %>% 
  group_by(Species) %>% 
  summarize(Mean_Length = mean(Sepal.Length))
## # A tibble: 3 x 2
##   Species    Mean_Length
##   <chr>            <dbl>
## 1 setosa            5.01
## 2 versicolor        5.94
## 3 virginica         6.59

Tallying

To know the number of observations found for each factor or combination of factors, we can use tally(). For example, if we wanted to group by Species and find the number of rows of data for each Species, we would do:

irisdata %>% 
  group_by(Species) %>% 
  tally()
## # A tibble: 3 x 2
##   Species        n
##   <chr>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50

Here, tally() is the action applied to the groups created by group_by() and counts the total number of records for each category.

Dealing with missing data

Missing data is coded by NA in R. If we want to remove those, we could insert a filter() in the chain: is.na() is a function that determines whether something is an NA. The ! symbol negates the result. So, if we want to get every row, where Sepal.Length is not an NA, we can do the following:

irisdata %>% 
  filter(!is.na(Sepal.Length)) %>% 
  mutate(Sepal.Length_mm = Sepal.Length*10) %>% 
  select(Species, Sepal.Length, Sepal.Length_mm) %>% 
  head()
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 6 x 3
##   Species Sepal.Length Sepal.Length_mm
##   <chr>          <dbl>           <dbl>
## 1 setosa          5.10            51.0
## 2 setosa          4.90            49.0
## 3 setosa          4.70            47.0
## 4 setosa          4.60            46.0
## 5 setosa          5.00            50.0
## 6 setosa          5.40            54.0

Exporting data

We can use the function write_csv in order to export data. First, create a new dataframe to export.

setosadata <- irisdata %>%
  filter(Species=="Setosa") %>%
  select(Species, Sepal.Length, Sepal.Width)

Now, export this new dataframe:

write_csv(setosadata, path="output/setosadata.csv")

<– Click here to go to the previous tutorial                          Click here to go to the next tutorial –>