dplyr is a package for making tabular data manipulation easier. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.
Load packages and data:
library(tidyverse)
irisdata <- read_csv("irisdata.csv")
## Parsed with column specification:
## cols(
## Sepal.Length = col_double(),
## Sepal.Width = col_double(),
## Petal.Length = col_double(),
## Petal.Width = col_double(),
## Species = col_character()
## )
Inspect the data
str(irisdata)
Selecting columns
To select columns of a data frame, use select(). The first argument to this function is the data frame (irisdata), and the subsequent arguments are the columns to keep.
select(irisdata, Species, Sepal.Length, Sepal.Width)
## # A tibble: 150 x 3
## Species Sepal.Length Sepal.Width
## <chr> <dbl> <dbl>
## 1 setosa 5.10 3.50
## 2 setosa 4.90 3.00
## 3 setosa 4.70 3.20
## 4 setosa 4.60 3.10
## 5 setosa 5.00 3.60
## 6 setosa 5.40 3.90
## 7 setosa 4.60 3.40
## 8 setosa 5.00 3.40
## 9 setosa 4.40 2.90
## 10 setosa 4.90 3.10
## # ... with 140 more rows
Selecting rows
To choose rows based on a specific criteria, use filter():
filter(irisdata, Species=="Setosa")
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 0 x 5
## # ... with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## # Petal.Length <dbl>, Petal.Width <dbl>, Species <chr>
What if you want to select and filter at the same time?
There are three ways to do this: use intermediate steps, nested functions, or pipes. The most convenient way to do this is by using pipe %>%.
Enter Pipe
Pipes (%>%) take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac to get the pipe function.
Select only Sepal Length & Sepal Width of Setosa species
irisdata %>%
filter(Species=="Setosa") %>%
select(Species, Sepal.Length, Sepal.Width)
## # A tibble: 0 x 3
## # ... with 3 variables: Species <chr>, Sepal.Length <dbl>,
## # Sepal.Width <dbl>
In the above code, we use the pipe to send the irisdata dataset first through filter() to keep rows where Species was Setosa, then through select() to keep only the Species, Sepal.Length, and Sepal.Width. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more. It is easire to read the pipe like the word then. For instance, in the above example, we took the data frame irisdata, then we filtered for rows with Species==”Setosa”, then we selected columns Species, Sepal.Length, Sepal.Width.
Create new objects
If we want to create a new object with this smaller version of the data, we can assign it a new name:
setosadata <- irisdata %>%
filter(Species=="Setosa") %>%
select(Species, Sepal.Length, Sepal.Width)
setosadata
## # A tibble: 0 x 3
## # ... with 3 variables: Species <chr>, Sepal.Length <dbl>,
## # Sepal.Width <dbl>
Note that the final data frame is the leftmost part of this expression.
Create new columns
If we want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns, we’ll use mutate().
To create a new column of Sepal.Length in mm:
irisdata %>%
mutate(Sepal.Length_mm= Sepal.Length*10) %>% head()
## # A tibble: 6 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.10 3.50 1.40 0.200 setosa
## 2 4.90 3.00 1.40 0.200 setosa
## 3 4.70 3.20 1.30 0.200 setosa
## 4 4.60 3.10 1.50 0.200 setosa
## 5 5.00 3.60 1.40 0.200 setosa
## 6 5.40 3.90 1.70 0.400 setosa
## # ... with 1 more variable: Sepal.Length_mm <dbl>
<– Click here to go to the previous tutorial Click here to go to the next tutorial –>