Intro to R Part 3:

Data Wrangling

Andy Lyons
October 14, 2022

https://ucanr-igis.github.io/IntroR_Oct22/

Warm-Up Quiz

Today’s Outline

Overview of data wrangling
What is tidy data?
dplyr
Exercise 5
joining tables
reshaping tables
Exercise 6

Data Wrangling: What do we mean?

Whatever is needed to get your data frame ready
for the function(s) you want to use for analysis and visualization.

AKA data munging, manipulation, transformation, etc.

Often includes one or more of:

dropping columns
renaming columns
changing the order of columns
creating new columns with an expression
filtering rows
sorting rows
going from ‘long’ to ‘wide’ formats
joining data frames based on a common field
merging data frames together
splitting tables
aggregating rows into groups

What is “tidy data”?

R functions like tidy data!

Are these data ‘tidy’?

Better Ways to Import Data

how to interpret the file format
skip rows

a fixed number of rows
rows that start with a specific character (i.e., comments)

whether the file contain a header
option to rename columns
option to specify the data type for each column

Data Wrangling with `dplyr`

An alternative (usually better) way to wrangle data frames than base R.

Part of the tidyverse.

Best way to familiarize yourself - explore the cheat sheet:

Most Popular `dplyr` Functions

Wrangling

subset rows	filter(), slice()
order rows	arrange()
pick column(s)	select(), pull()
add new columns	mutate()
join data frames	left_join(), right_join(), inner_join()

Table summaries

summary stats for individual columns	summarize() with: n(), mean(), median(), sum(), first(), sd(), IQR(), etc.
group rows (i.e., change the unit of analysis)	group_by()

Chaining dplyr functions

Most dplyr functions take a tibble, and return a tibble.

This makes them very pipe friendly.

Example

Look at the storms tibble:

library(dplyr)
dim(storms)

## [1] 11859    13

head(storms)

## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status categ…¹  wind press…² tropi…³
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>   <int>   <int>   <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1         25    1013      NA
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1         25    1013      NA
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1         25    1013      NA
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1         25    1013      NA
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1         25    1012      NA
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1         25    1012      NA
## # … with 1 more variable: hurricane_force_diameter <int>, and abbreviated
## #   variable names ¹category, ²pressure, ³tropicalstorm_force_diameter

What month has the largest number of strong (category 3 or higher) storms?

We can answer this question in one easy-to-read expression:

storms %>% 
  select(name, year, month, category) %>%     ## select the columns we need
  filter(category >= 3) %>%                   ## filter those rows where category >= 3
  group_by(month) %>%                         ## group the rows by month
  summarize(num_strong_hurricanes = n())      ## for each group, report the count

## # A tibble: 5 × 2
##   month num_strong_hurricanes
##   <dbl>                 <int>
## 1     7                    18
## 2     8                   183
## 3     9                   524
## 4    10                   166
## 5    11                    40

Observe that with dplyr functions you generally don’t have to put column names in quotes

Many more examples in the exercise!

R Notebooks

R Notebooks are written in “R Markdown”, which combines text and R code.

Exercise 5: Wrangle the Penguins

Exercises 5 will be R Notebook!

We’ll be looking at the Palmer Penguins dataset.

Exercise 5 Topics

Importing messy data from Excel
Subsetting columns
Calculating columns
Sorting and subsetting rows
Dealing with missing values
Creating summaries

Break!

Joining Tables

To join two data frames based on a common field, you can use:

left_join(x, y, by)

where x and y are data frames, and by is the name of a column they have in common.

If there is only one column in common, and if it has the same name in both data frames, you can omit the by argument.

If the common column is named differently in the two data frames, you can deal with that by passing a named vector as the by argument. See below.

To illustrate a table join, we’ll first import a csv with some fake data about the genetics of different iris species:

# Create a data frame with additional info about the three IRIS species
iris_genetics <- data.frame(Species=c("setosa", "versicolor", "virginica"),
                          num_genes = c(42000, 41000, 43000),
                          prp_alles_recessive = c(0.8, 0.76, 0.65))

iris_genetics

##      Species num_genes prp_alles_recessive
## 1     setosa     42000                0.80
## 2 versicolor     41000                0.76
## 3  virginica     43000                0.65

We can join these additional columns to the iris data frame with left_join():

iris %>% 
  left_join(iris_genetics, by = "Species") %>% 
  slice(1:10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species num_genes
## 1           5.1         3.5          1.4         0.2  setosa     42000
## 2           4.9         3.0          1.4         0.2  setosa     42000
## 3           4.7         3.2          1.3         0.2  setosa     42000
## 4           4.6         3.1          1.5         0.2  setosa     42000
## 5           5.0         3.6          1.4         0.2  setosa     42000
## 6           5.4         3.9          1.7         0.4  setosa     42000
## 7           4.6         3.4          1.4         0.3  setosa     42000
## 8           5.0         3.4          1.5         0.2  setosa     42000
## 9           4.4         2.9          1.4         0.2  setosa     42000
## 10          4.9         3.1          1.5         0.1  setosa     42000
##    prp_alles_recessive
## 1                  0.8
## 2                  0.8
## 3                  0.8
## 4                  0.8
## 5                  0.8
## 6                  0.8
## 7                  0.8
## 8                  0.8
## 9                  0.8
## 10                 0.8

If you need to join tables on multiple columns, add additional column names to the by argument.

Join columns must be the same data type (i.e., both numeric or both character).

There are several variants of left_join(), the most common being right_join() and inner_join(). See help for details.

Joining Tables When the Column Name is Different

If the join column is named differently in the two tables, you can pass a named character vector as the by argument. A named vector is a vector whose elements have been assigned names. You can construct a named vector with c().

For example if the join column was named ‘SpeciesName’ in x, and just ‘Species’ in y, your expression would be:

left_join(x, y, by = c("SpeciesName" = "Species"))

Reshaping Data

Reshaping data includes:

turning rows into columns (aka pivot tables, cross tab query)
turning columns into rows

The go-to Tidyverse package for reshaping data frames is tidyr

The two most common tidyr functions:

pivot_longer()

pivot_wider()