Although your data may be in a spreadsheet or CSV file, it may not be in an ideal format. See this registration list:
Here we see a bunch of stuff before the data, the column “names” are unwieldy, there are several extra columns we don’t need, missing values, etc.
You can tackle many of these issues when you import the data. First, we construct the file name and verify it exists:
reg_list_fn <- "./data/SurveyResults_36100.xlsx"
file.exists(reg_list_fn)
[1] TRUE
To import Excel files, we’ll go to read_xlsx()
from the
readxl package. This function has a lot of options, so it’s
worth reading the help page:
library(readxl)
reg_list_tbl <- read_xlsx(path = reg_list_fn,
sheet = "Survey Output",
skip = 3,
col_names = c("user", "reg_date", "fname", "lname",
"email", "title", "org",
"cosponsor", "allergies"),
col_types = c("text", "text", "text", "text", "text", "text",
"numeric", "text", "text"))
reg_list_tbl
Likewise, the readr package provides super-charged functions for importing and exporting rectangular data in text formats.
Take Home Messages
Data wrangling starts when you import
You’re not limited to just using read.csv()
dplyr
methodsLoad dplyr:
library(dplyr)
We can proactively prevent function name clashes with the
conflicted
package:
## Load the conflicted package
library(conflicted)
# Set conflict preferences
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
conflict_prefer("arrange", "dplyr", quiet = TRUE)
We begin by looking at the Palmer Penguins data:
library(palmerpenguins)
head(penguins)
dplyr provides glimpse()
which is an alternative way
preview a tibble:
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad~
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers~
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, ~
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.0, 20.7, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194, 174, 18~
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, ~
$ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female, male, male, female, female, ~
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Suppose we want to compare how bill length and flipper length look for each species. We could start by selecting just those columns:
penguins |>
select(species, bill_length_mm, flipper_length_mm) |>
head()
You can also rename columns with
select()
(dplyr also has a rename()
function):
penguins |>
select(species, bill = bill_length_mm, flipper = flipper_length_mm) |>
head()
If you want to extract the values from a single column and get the
results back as a vector, use pull()
(similar to the
$
operator):
penguins |> pull(species) |> table()
Adelie Chinstrap Gentoo
152 68 124
To add a column, we can use mutate()
.
Let’s create a new column with the ratio of the bill to flipper
length:
peng_billflip_tbl <- penguins |>
select(species, bill_length_mm, flipper_length_mm) |>
mutate(bill_flip_ratio = bill_length_mm / flipper_length_mm)
head(peng_billflip_tbl)
mutate()
can also be used to modify existing columns,
for example we could round bill_flip_ratio to 3 decimal places.
peng_billflip_tbl <- peng_billflip_tbl |>
mutate(bill_flip_ratio = round(bill_flip_ratio, 3))
head(peng_billflip_tbl)
Add a column called bill_flip_sum
which represents the
sum of the bill and flipper length.
Hint: use mutate()
## Your answer here
You can sort a table using arrange()
,
passing the column(s) you want to sort by. The default sort order is
ascending, but you can change that to descending by wrapping the column
name in desc()
.
Sort by flipper size (smallest to largest):
peng_billflip_tbl |>
arrange(flipper_length_mm) |>
head()
Largest to smallest:
peng_billflip_tbl |>
arrange(desc(flipper_length_mm)) |>
head()
The two functions most commonly used to subset rows include
filter()
and slice()
.
use slice()
to subset based on indices
use filter()
to subset based on a logical
expressions
Logical expressions you can use in filter()
include:
Suppose we only wanted to work with observations of Gentoo penguins:
gentoo_billflip_tbl <- peng_billflip_tbl |>
filter(species == "Gentoo")
glimpse(gentoo_billflip_tbl)
Rows: 124
Columns: 4
$ species <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Ge~
$ bill_length_mm <dbl> 46.1, 50.0, 48.7, 50.0, 47.6, 46.5, 45.4, 46.7, 43.3, 46.8, 40.9, 49.0, 45.5, 48.4, 45.8, 49.3, 42.0, 49.2~
$ flipper_length_mm <int> 211, 230, 210, 218, 215, 210, 211, 219, 209, 215, 214, 216, 214, 213, 210, 217, 210, 221, 209, 222, 218, 2~
$ bill_flip_ratio <dbl> 0.218, 0.217, 0.232, 0.229, 0.221, 0.221, 0.215, 0.213, 0.207, 0.218, 0.191, 0.227, 0.213, 0.227, 0.218, 0~
If you want to filter on two criteria, add a second expression to
filter()
:
peng_billflip_tbl |>
filter(species == "Gentoo", flipper_length_mm > 220) |>
select(species, flipper_length_mm) |>
glimpse()
Rows: 35
Columns: 2
$ species <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Ge~
$ flipper_length_mm <int> 230, 221, 222, 222, 230, 225, 222, 225, 225, 224, 221, 231, 230, 229, 223, 221, 221, 230, 223, 221, 224, 2~
Suppose we wanted to identify the top 5 Gentoo observations with the
largest bill_flip_ratio value. We can do that using a combination of
arrange()
and slice()
:
gentoo_billflip_tbl |>
arrange(desc(bill_flip_ratio)) |>
slice(1:5)
We could also use top_n()
:
gentoo_billflip_tbl |>
top_n(5, bill_flip_ratio) |>
arrange(desc(bill_flip_ratio))
Write an expression that pulls out just the males on Torgersen Island. HOw many are there?
Tip: For challenge questions 2 & 3, mix and match the following functions:
select()
, mutate()
, pull()
,
arrange()
, filter()
, slice()
## Your answer here
Ans. 23
COmpute the ratio of body mass to bill length and add it as a new column. Return just the species, sex, and this ratio.
# Your answer here
That’s it! Remember to save your work to render a copy of your Notebook in HTML