Intro to R Part 3:

Data Wrangling


Andy Lyons
October 14, 2022

https://ucanr-igis.github.io/IntroR_Oct22/



Warm-Up Quiz


Today’s Outline

Data Wrangling: What do we mean?

Whatever is needed to get your data frame ready
for the function(s) you want to use for analysis and visualization.

AKA data munging, manipulation, transformation, etc.

Often includes one or more of:


What is “tidy data”?

R functions like tidy data!

Are these data ‘tidy’?

Better Ways to Import Data

  • a fixed number of rows
  • rows that start with a specific character (i.e., comments)


Data Wrangling with dplyr

An alternative (usually better) way to wrangle data frames than base R.

Part of the tidyverse.

Best way to familiarize yourself - explore the cheat sheet:


Example

Look at the storms tibble:

library(dplyr)
dim(storms)
## [1] 11859    13
head(storms)
## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status categ…¹  wind press…² tropi…³
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>   <int>   <int>   <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1         25    1013      NA
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1         25    1013      NA
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1         25    1013      NA
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1         25    1013      NA
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1         25    1012      NA
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1         25    1012      NA
## # … with 1 more variable: hurricane_force_diameter <int>, and abbreviated
## #   variable names ¹​category, ²​pressure, ³​tropicalstorm_force_diameter


What month has the largest number of strong (category 3 or higher) storms?

We can answer this question in one easy-to-read expression:

storms %>% 
  select(name, year, month, category) %>%     ## select the columns we need
  filter(category >= 3) %>%                   ## filter those rows where category >= 3
  group_by(month) %>%                         ## group the rows by month
  summarize(num_strong_hurricanes = n())      ## for each group, report the count
## # A tibble: 5 × 2
##   month num_strong_hurricanes
##   <dbl>                 <int>
## 1     7                    18
## 2     8                   183
## 3     9                   524
## 4    10                   166
## 5    11                    40

Observe that with dplyr functions you generally don’t have to put column names in quotes

Many more examples in the exercise!


R Notebooks

R Notebooks are written in “R Markdown”, which combines text and R code.

Exercise 5: Wrangle the Penguins

Exercises 5 will be R Notebook!

We’ll be looking at the Palmer Penguins dataset.

Exercise 5 Topics

  1. Importing messy data from Excel
  2. Subsetting columns
  3. Calculating columns
  4. Sorting and subsetting rows
  5. Dealing with missing values
  6. Creating summaries

Break!


Joining Tables

To join two data frames based on a common field, you can use:

left_join(x, y, by)

where x and y are data frames, and by is the name of a column they have in common.

If there is only one column in common, and if it has the same name in both data frames, you can omit the by argument.

If the common column is named differently in the two data frames, you can deal with that by passing a named vector as the by argument. See below.

To illustrate a table join, we’ll first import a csv with some fake data about the genetics of different iris species:

# Create a data frame with additional info about the three IRIS species
iris_genetics <- data.frame(Species=c("setosa", "versicolor", "virginica"),
                          num_genes = c(42000, 41000, 43000),
                          prp_alles_recessive = c(0.8, 0.76, 0.65))

iris_genetics
##      Species num_genes prp_alles_recessive
## 1     setosa     42000                0.80
## 2 versicolor     41000                0.76
## 3  virginica     43000                0.65

We can join these additional columns to the iris data frame with left_join():

iris %>% 
  left_join(iris_genetics, by = "Species") %>% 
  slice(1:10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species num_genes
## 1           5.1         3.5          1.4         0.2  setosa     42000
## 2           4.9         3.0          1.4         0.2  setosa     42000
## 3           4.7         3.2          1.3         0.2  setosa     42000
## 4           4.6         3.1          1.5         0.2  setosa     42000
## 5           5.0         3.6          1.4         0.2  setosa     42000
## 6           5.4         3.9          1.7         0.4  setosa     42000
## 7           4.6         3.4          1.4         0.3  setosa     42000
## 8           5.0         3.4          1.5         0.2  setosa     42000
## 9           4.4         2.9          1.4         0.2  setosa     42000
## 10          4.9         3.1          1.5         0.1  setosa     42000
##    prp_alles_recessive
## 1                  0.8
## 2                  0.8
## 3                  0.8
## 4                  0.8
## 5                  0.8
## 6                  0.8
## 7                  0.8
## 8                  0.8
## 9                  0.8
## 10                 0.8


If you need to join tables on multiple columns, add additional column names to the by argument.

Join columns must be the same data type (i.e., both numeric or both character).

There are several variants of left_join(), the most common being right_join() and inner_join(). See help for details.

Joining Tables When the Column Name is Different

If the join column is named differently in the two tables, you can pass a named character vector as the by argument. A named vector is a vector whose elements have been assigned names. You can construct a named vector with c().

For example if the join column was named ‘SpeciesName’ in x, and just ‘Species’ in y, your expression would be:

left_join(x, y, by = c("SpeciesName" = "Species"))

Reshaping Data

Reshaping data includes:

The go-to Tidyverse package for reshaping data frames is tidyr


The two most common tidyr functions:

pivot_longer()

pivot_wider()


END!