Andy
Lyons
October 14, 2022
Whatever is needed to get your data frame ready
for the function(s) you want to use for analysis and visualization.
AKA data munging, manipulation, transformation, etc.
Often includes one or more of:
R functions like tidy data!
dplyr
An alternative (usually better) way to wrangle data frames than base R.
Part of the tidyverse.
Best way to familiarize yourself - explore the cheat sheet:
dplyr
Functionssubset rows | filter(), slice() |
order rows | arrange() |
pick column(s) | select(), pull() |
add new columns | mutate() |
join data frames | left_join(), right_join(), inner_join() |
summary stats for individual columns |
summarize() with: n(), mean(), median(), sum(), first(), sd(), IQR(), etc. |
group rows (i.e., change the unit of analysis) | group_by() |
Most dplyr
functions take a tibble, and
return a tibble.
This makes them very pipe friendly.
Look at the storms tibble:
## [1] 11859 13
## # A tibble: 6 × 13
## name year month day hour lat long status categ…¹ wind press…² tropi…³
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013 NA
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013 NA
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013 NA
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013 NA
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012 NA
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012 NA
## # … with 1 more variable: hurricane_force_diameter <int>, and abbreviated
## # variable names ¹category, ²pressure, ³tropicalstorm_force_diameter
What month has the largest number of strong (category 3 or higher) storms?
We can answer this question in one easy-to-read expression:
storms %>%
select(name, year, month, category) %>% ## select the columns we need
filter(category >= 3) %>% ## filter those rows where category >= 3
group_by(month) %>% ## group the rows by month
summarize(num_strong_hurricanes = n()) ## for each group, report the count
## # A tibble: 5 × 2
## month num_strong_hurricanes
## <dbl> <int>
## 1 7 18
## 2 8 183
## 3 9 524
## 4 10 166
## 5 11 40
Observe that with dplyr functions you generally don’t have to put column names in quotes
Many more examples in the exercise!
R Notebooks are written in “R Markdown”, which combines text and R code.
Exercises 5 will be R Notebook!
We’ll be looking at the Palmer Penguins dataset.
Exercise 5 Topics
To join two data frames based on a common field, you can use:
left_join(x, y, by)
where x and y are data frames, and by is the name of a column they have in common.
If there is only one column in common, and if it has the same name in both data frames, you can omit the by argument.
If the common column is named differently in the two data frames, you can deal with that by passing a named vector as the by argument. See below.
To illustrate a table join, we’ll first import a csv with some fake data about the genetics of different iris species:
# Create a data frame with additional info about the three IRIS species
iris_genetics <- data.frame(Species=c("setosa", "versicolor", "virginica"),
num_genes = c(42000, 41000, 43000),
prp_alles_recessive = c(0.8, 0.76, 0.65))
iris_genetics
## Species num_genes prp_alles_recessive
## 1 setosa 42000 0.80
## 2 versicolor 41000 0.76
## 3 virginica 43000 0.65
We can join these additional columns to the iris data frame with
left_join()
:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species num_genes
## 1 5.1 3.5 1.4 0.2 setosa 42000
## 2 4.9 3.0 1.4 0.2 setosa 42000
## 3 4.7 3.2 1.3 0.2 setosa 42000
## 4 4.6 3.1 1.5 0.2 setosa 42000
## 5 5.0 3.6 1.4 0.2 setosa 42000
## 6 5.4 3.9 1.7 0.4 setosa 42000
## 7 4.6 3.4 1.4 0.3 setosa 42000
## 8 5.0 3.4 1.5 0.2 setosa 42000
## 9 4.4 2.9 1.4 0.2 setosa 42000
## 10 4.9 3.1 1.5 0.1 setosa 42000
## prp_alles_recessive
## 1 0.8
## 2 0.8
## 3 0.8
## 4 0.8
## 5 0.8
## 6 0.8
## 7 0.8
## 8 0.8
## 9 0.8
## 10 0.8
If you need to join tables on multiple columns, add additional column
names to the by
argument.
Join columns must be the same data type (i.e., both numeric or both character).
There are several variants of left_join()
, the most
common being right_join()
and
inner_join()
. See help for details.
If the join column is named differently in the two tables, you can
pass a named character vector as the by
argument. A named
vector is a vector whose elements have been assigned names. You can
construct a named vector with c()
.
For example if the join column was named ‘SpeciesName’ in x, and just ‘Species’ in y, your expression would be:
left_join(x, y, by = c("SpeciesName" = "Species"))
Reshaping data includes:
The go-to Tidyverse package for reshaping data frames is tidyr
The two most common tidyr
functions:
pivot_longer()
pivot_wider()