Data Wrangling Exercise 3

Resample columns, Pivot Wide-to-Long, Manage Missing Values

In this exercise, we’ll continue to work with the pop-quiz data using the following techniques:

resampling character values using a related table
pivoting from wide-to-long
global summaries and filters
three ways to manage missing values

Setup

Load the packages we’ll be using:

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(conflicted)
conflicts_prefer(
  dplyr::select,
  dplyr::filter,
  dplyr::arrange
)

We’ll continue to work with the pop-quiz data we saw in the first exercise:

ss_fn <- here::here("./exercises/data/student_scores.tsv")
ss_nn_tbl <- readr::read_tsv(file = ss_fn, show_col_types = FALSE) |> 
  rename_all(str_to_lower) |>                       
  rename_all(~ str_replace_all(., " ", "_"))
head(ss_nn_tbl)

Resample the `year` Column

Let’s look at the values of the year column.

ss_nn_tbl |> 
  group_by(year) |> 
  summarize(num_rows = n())

Suppose we want to replace these codes with their corresponding labels (i.e., ‘sophomore’, ‘junior’, ‘senior’). An easy way to add labels for codes or abbreviations is:

Import (or create) a data frame containing the codes and their labels
Join the data frames together based on the matching codes

Step 1: Create a tibble with the codes and their labels:

year_codes_tbl <- tribble(
  ~year, ~year_lbl,
  "FRS", "Freshman",
  "SPH", "Sophomore",
  "JNR", "Junior",
  "SNR", "Senior"
)
year_codes_tbl

Step 2: Join the labels table to the data:

ss_year_lbl_tbl <- ss_nn_tbl |> 
  left_join(year_codes_tbl, by = "year") |> 
  select(name, year, year_lbl)

ss_year_lbl_tbl |> slice(1:20)

Randomly Divide the Data into Training and Validation Sets

It’s very common in research to divide data into ‘training’ and ‘validation’ sets. The ‘training’ data is use to train a regression model, after which the validation set is used to assess the accuracy of the model.

Here, we create a column called ‘set’, and populate it such that ~80% of the rows are tagged for training, and 20% for validation.

ss_trainvalid_tbl <- ss_nn_tbl |> 
  mutate(set = sample(c("training", "validation"), 
                      size = nrow(ss_nn_tbl),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))) |> 
  relocate(set, .after = name) 

head(ss_trainvalid_tbl)

## See how close we got to 80/20
table(ss_trainvalid_tbl$set)


  training validation 
        76         24

CHALLENGE

Modify the ‘set’ column such that all the students in Discussion Sections 1-3 are used for model fitting, and Discussion Section 4 is used as the validation dataset.

## Modify the 'set' column such that all the students in Discussion Sections 1-3
## are used for model fitting, and Discussion Section 4 is used as the
## validation dataset.

ss_nn_tbl |> 
  mutate(set = if_else(section == "Section 4",
                       "validation",
                       "training")) |> 
  select(name, section, set)

Reshape the Data

In Part I, we only worked with quiz_01. But there are 17 other quizzes!

Wouldn’t it be nice if we could:

compare the distribution of scores across all quizzes
compute for each student the average of all quiz scores (i.e., for their final grade)

All of the above would be a lot easier to do if we make these data tidy!

In this case, to make the data tidy we have to transform it from its current wide format to a long format. We can do with this with tidyr::pivot_longer() with the following arguments:

cols - a tidyselect expression identifying the columns we want turn into new rows
names_to - the name of a new column where the old columns names should go (because those columns are going to disappear)
values_to - the name of a new column for the values of old columns

ss_long_tbl <- ss_nn_tbl |> 
  pivot_longer(
    cols = starts_with("quiz"),
    names_to = "quiz",
    values_to = "score"
  )

See what we got:

dim(ss_long_tbl)

[1] 1800    9

slice(ss_long_tbl, 1:20)

Visualize and Summarise all of the Quizes

Now that the data are tidy, we can make a box-and-whiskers plots for all of the quizzes:

ggplot(ss_long_tbl, aes(y = score, x = quiz)) +
  geom_boxplot() +
  labs(title = "Quiz Scores")

Warning: Removed 259 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Something’s off with Quiz 7! Let’s remove that one from our dataset.

ss_cln_tbl <- ss_long_tbl |> 
  filter(quiz != "quiz_07")
dim(ss_cln_tbl)

[1] 1700    9

Now view the box plots again:

ggplot(ss_cln_tbl, aes(y = score, x = quiz)) +
  geom_boxplot() +
  labs(title = "Quiz Scores")

Warning: Removed 247 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Much better!

Compute Average Scores

Next, let’s compute the average score per student

ss_cln_tbl |> 
  group_by(name) |> 
  summarize(avg_score = mean(score, na.rm = TRUE))

CHALLENGE

Compute the overall average (all quizzes, all students) for each treatment group.

## Compute the overall average (all quizzes, all students) for each treatment group

ss_cln_tbl |> 
  group_by(treatment) |> 
  summarize(avg_score = mean(score, na.rm = TRUE))

Work with Missing Values

Let’s begin by summarizing the number and proportion of missing values per student:

ss_stdnt_na_tbl <- ss_cln_tbl |> 
  group_by(name) |> 
  summarize(num_missing = sum(is.na(score)),
            prp_missing = num_missing / n()) 

head(ss_stdnt_na_tbl)

It will be easier to see the distribution as a histogram:

ss_stdnt_na_tbl |> 
  ggplot(aes(x = prp_missing)) +
  geom_histogram() +
  labs(title = "Proportion of Missing Quiz Scores per Student")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

CHALLENGE

How many students completed all the quizzes?

## How many students completed all the quizzes?

ss_cln_tbl |> 
  group_by(name) |> 
  summarize(num_missing = sum(is.na(score))) |> 
  filter(num_missing == 0)

Which quiz had the most missing scores?

## Which quiz had the most missing scores?
  
ss_cln_tbl |> 
  group_by(quiz) |> 
  summarize(num_missing = sum(is.na(score))) |> 
  arrange(desc(num_missing))

Drop NAs

One option is simply to drop any row where the score is NA. We can do this with tidyr::drop_na().

ss_nomissing_tbl <- ss_cln_tbl |> drop_na(score)

## Compare the number of rows before and after dropping the NAs
nrow(ss_cln_tbl); nrow(ss_nomissing_tbl)

[1] 1700

[1] 1453

Replace NAs with 0

We can easily replace NAs with a constant like 0 using tidyr::replace_na().

ss_cln_tbl |> replace_na(list(score = 0)) |> slice(1:20)

CHALLENGE

How does replacing NA’s with 0’s affect the overall class average?

## How does replacing NA's with 0's affect the overall class average?

ss_cln_tbl |> pull(score) |> mean(na.rm = TRUE)

[1] 84.15692

ss_cln_tbl |> replace_na(list(score = 0)) |> pull(score) |> mean(na.rm = TRUE)

[1] 71.92941

Replace NA’s with Students’ Average Score

Another option might be to replace the NAs with the average of all the other student’s quiz scores. We can do this by:

computing students’ average scores in a separate table
joining the average score table to the data
replacing NAs with the average score

Step 1: Construct a table of each student’s overall average scores on the quizzes they took.

ss_std_avg_tbl <- ss_cln_tbl |> 
  group_by(name) |> 
  summarize(overall_avg = mean(score, na.rm = TRUE)) 

head(ss_std_avg_tbl)

Step 2: Next, join the overall average table to the data:

ss_scores_plus_avg_tbl <- ss_cln_tbl |> 
  left_join(ss_std_avg_tbl, by = "name") |> 
  select(name, quiz, score, overall_avg)

ss_scores_plus_avg_tbl |> slice(1:20)

Step 3. Replace the NAs with the students’ average using mutate():

ss_na2avg_tbl <- ss_scores_plus_avg_tbl |> 
  mutate(score = if_else(is.na(score), overall_avg, score))

ss_na2avg_tbl |> slice(1:40)

How does this change the overall average?

ss_cln_tbl |> pull(score) |> mean(na.rm = TRUE)

[1] 84.15692

ss_na2avg_tbl |> pull(score) |> mean(na.rm = TRUE)

[1] 82.93143

CHALLENGE (Homework)

Drop students who missed more than 40% of the quizzes.

## Drop students who missed more than 40% of the quizzes.

## Step 1: Compute the proportion of quizzes missed in a separate table:

ss_prp_na_tbl <- ss_cln_tbl |> 
  group_by(name) |> 
  summarize(prp_na = sum(is.na(score)) / n()) 

## Step 2: Join the proportion of missed quizzes to the data, and use
## as a filter

ss_no_big_na_tbl <- ss_cln_tbl |> 
  left_join(ss_prp_na_tbl, by = "name") |> 
  filter(prp_na <= 0.4)

## How many rows were dropped?
dim(ss_cln_tbl); dim(ss_no_big_na_tbl)

[1] 1700    9

[1] 1547   10

What do these data say about the theory that coming to class supports learning?

## Your answer here

DONE!

Remember to render your Quarto document so you have a pretty HTML file to keep for future reference.

Setup

Resample the year Column

Randomly Divide the Data into Training and Validation Sets

CHALLENGE

Reshape the Data

Visualize and Summarise all of the Quizes

Compute Average Scores

CHALLENGE

Work with Missing Values

CHALLENGE

Drop NAs

Replace NAs with 0

CHALLENGE

Replace NA’s with Students’ Average Score

CHALLENGE (Homework)

DONE!

Resample the `year` Column