Data Wrangling with R Part II

October 17, 2025
Andy Lyons

https://ucanr-igis.github.io/DataWranglingR/

Workshop Goals

Gain a better understanding of the fundamentals of data wrangling
Be able to find the packages and functions that can do what you need
Grow your library of working code recipes
Be better equipped to trouble-shoot your code
Come out slightly higher on the learning curve!

Review of Part I

Tidy Data

each variable is saved in its own column
each observation is saved in its own row
each ‘type’ of observation in a single table

Smarter ways to import data

specialized import packages like:

readr
readxl
googlesheets4
haven

these packages have import functions with arguments to:

select which columns to import
rename columns
define column types
how to interpret missing values
skip rows

Row manipulations: `dplyr`

filter(), slice()
arrange()
slice_min(), slice_max()

Column manipulations: `dplyr`

rename(), rename_with()
select(), pull()
mutate()

Useful functions to use inside `mutate()`

min_rank(), dense_rank(), row_number()
if_else(), case_when()

Text manipulations: `stringr`

str_to_lower()
str_replace_all()
str_split_i()
str_trim()

Split columns

mutate() + str_split_i() + str_trim()
tidyr::separate_wider_delim()

Join tables based on matching column(s): `dplyr`

left_join()

Group rows: `dplyr`

group_by()
summarize()

ChatGPT

Why do I have to learn this?

Can’t ChatGPT take care of it for me?

Maybe

Let's explore that, shall we...

What Are Your Goals for Using GenAI?

To get working code quickly
To improve my code
To figure out my workflow
As a really nice search engine; to look things up
To learn R

GenAI tools can do all of these things.

But how you use GenAI tools depends on your goals.

Learning R with GenAI

https://www.tidyverse.org/blog/2025/04/learn-tidyverse-ai/

Highlights

ChatGPT will give you code.
It may or may not work.
It may or may not be good code (often not).
You should always cross-check with the documentation.

“Use code completion tools sparingly if you’re a new user.”

Suggestions for Using GenAI to Learn R

ChatGPT is good for getting starter code, discovering functions, arguments, solving a simple, well-defined tasks, etc.
Break tasks into smaller steps.
If you’re goal is to learn, ask a series of questions as though you were in Office Hours.
Be wary of code you don’t understand. And always test it.
The more you understand the fundamentals of R, the better you can evaluate and learn from GenAI.

What are your best practices for using GenAI tools?

Reshaping Data

Which direction?

Option 1. Turn columns into rows (wide-to-long)

Option 2. Turning rows into columns (long-to-wide, aka pivot tables, cross tab query)

`pivot_longer()`

cases_long_tbl <- cases_tbl |> 
  pivot_longer(
    cols = c(`2011`, `2012`, `2013`),
    names_to = "Year",
    values_to = "Cases"
  )

`pivot_wider()`

case_wide_tbl <- cases_long_tbl |> 
  pivot_wider(
    id_cols = Country, 
    names_from = Year, 
    values_from = Cases)

More Complex Pivots

For more info and examples, see the Pivoting Vignette in the tidyr package.

Dealing with Missing Data

Question 1: Do I have missing values?

How are missing values encoded?

-99
""
NA

Question 2: How many missing values do I have?

summary()

is.na(), !is.na()

Step 2: How do you want to handle missing data?

Option 1: Keep the observations but ignore missing values in column summaries

mean(x, na.rm = TRUE)

Option 2: Throw away the entire observation

tidyr::drop_na()

Option 3: Fill up / down

tidyr::fill()

Option 4: Replace missing values with a constant

common replacement choices: 0, mean, median, etc.

mutate( new_col = if_else(…) )

tidyr::replace_na()

Option 5: Replace missing values with another column

mutate( new_col = if_else(…) )

mutate( new_col = coalesce(…) )

Option 6: Impute from values in other rows

repeat last value
linear interpolate
spline interpolate
etc.

See imputeTS or zoo packages.

Exercise 3

Warm up:

resample a character column using values from a related table
divide data into training and validation

New stuff:

Reshape the data (wide to long)
Visualize and summarize all the quizzes
Compute student averages
Work with NAs

This exercise will be done in an Quarto Notebook!

https://posit.cloud/content/11191104

Next-Up

Mutating within Groups

Splitting Tables

Splitting Variable Width Columns

Working with Dates

Exercise 4

Data wrangling operations:

mutate within groups
split a table to anonymize the data
split variable width columns
pivot from long-to-wide
add the date each quiz was given
summarize the data using dates and date parts