Data Wrangling Exercise 3

Resample columns, Pivot Wide-to-Long, Manage Missing Values

In this exercise, we’ll continue to work with the pop-quiz data using the following techniques:

Setup

Load the packages we’ll be using:

We’ll continue to work with the pop-quiz data we saw in the first exercise:

ss_fn <- here::here("./exercises/data/student_scores.tsv")
ss_nn_tbl <- readr::read_tsv(file = ss_fn, show_col_types = FALSE) |> 
  rename_all(str_to_lower) |>                       
  rename_all(~ str_replace_all(., " ", "_"))
head(ss_nn_tbl)

Resample the year Column

Let’s look at the values of the year column.

ss_nn_tbl |> 
  group_by(year) |> 
  summarize(num_rows = n())


Suppose we want to replace these codes with their corresponding labels (i.e., ‘sophomore’, ‘junior’, ‘senior’). An easy way to add labels for codes or abbreviations is:

  1. Import (or create) a data frame containing the codes and their labels
  2. Join the data frames together based on the matching codes

Step 1: Create a tibble with the codes and their labels:

year_codes_tbl <- tribble(
  ~year, ~year_lbl,
  "FRS", "Freshman",
  "SPH", "Sophomore",
  "JNR", "Junior",
  "SNR", "Senior"
)
year_codes_tbl

Step 2: Join the labels table to the data:

ss_year_lbl_tbl <- ss_nn_tbl |> 
  left_join(year_codes_tbl, by = "year") |> 
  select(name, year, year_lbl)

ss_year_lbl_tbl |> slice(1:20)


Randomly Divide the Data into Training and Validation Sets

It’s very common in research to divide data into ‘training’ and ‘validation’ sets. The ‘training’ data is use to train a regression model, after which the validation set is used to assess the accuracy of the model.

Here, we create a column called ‘set’, and populate it such that ~80% of the rows are tagged for training, and 20% for validation.

ss_trainvalid_tbl <- ss_nn_tbl |> 
  mutate(set = sample(c("training", "validation"), 
                      size = nrow(ss_nn_tbl),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))) |> 
  relocate(set, .after = name) 

head(ss_trainvalid_tbl)
## See how close we got to 80/20
table(ss_trainvalid_tbl$set)

  training validation 
        82         18 


CHALLENGE

  1. Modify the ‘set’ column such that all the students in Discussion Sections 1-3 are used for model fitting, and Discussion Section 4 is used as the validation dataset.
## Your answer here


Reshape the Data

In Part I, we only worked with quiz_01. But there are 17 other quizzes!

Wouldn’t it be nice if we could:

  • compare the distribution of scores across all quizzes
  • compute for each student the average of all quiz scores (i.e., for their final grade)

All of the above would be a lot easier to do if we make these data tidy!

In this case, to make the data tidy we have to transform it from its current wide format to a long format. We can do with this with tidyr::pivot_longer() with the following arguments:

  • cols - a tidyselect expression identifying the columns we want turn into new rows
  • names_to - the name of a new column where the old columns names should go (because those columns are going to disappear)
  • values_to - the name of a new column for the values of old columns
ss_long_tbl <- ss_nn_tbl |> 
  pivot_longer(
    cols = starts_with("quiz"),
    names_to = "quiz",
    values_to = "score"
  )

See what we got:

dim(ss_long_tbl)
[1] 1800    9
slice(ss_long_tbl, 1:20)


Visualize and Summarise all of the Quizes

Now that the data are tidy, we can make a box-and-whiskers plots for all of the quizzes:

ggplot(ss_long_tbl, aes(y = score, x = quiz)) +
  geom_boxplot() +
  labs(title = "Quiz Scores")
Warning: Removed 259 rows containing non-finite outside the scale range
(`stat_boxplot()`).


Something’s off with Quiz 7! Let’s remove that one from our dataset.

ss_cln_tbl <- ss_long_tbl |> 
  filter(quiz != "quiz_07")
dim(ss_cln_tbl)
[1] 1700    9

Now view the box plots again:

ggplot(ss_cln_tbl, aes(y = score, x = quiz)) +
  geom_boxplot() +
  labs(title = "Quiz Scores")
Warning: Removed 247 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Much better!

Compute Average Scores

Next, let’s compute the average score per student

ss_cln_tbl |> 
  group_by(name) |> 
  summarize(avg_score = mean(score, na.rm = TRUE))


CHALLENGE

  1. Compute the overall average (all quizzes, all students) for each treatment group.
## Your answer here

Work with Missing Values

Let’s begin by summarizing the number and proportion of missing values per student:

ss_stdnt_na_tbl <- ss_cln_tbl |> 
  group_by(name) |> 
  summarize(num_missing = sum(is.na(score)),
            prp_missing = num_missing / n()) 

head(ss_stdnt_na_tbl)

It will be easier to see the distribution as a histogram:

ss_stdnt_na_tbl |> 
  ggplot(aes(x = prp_missing)) +
  geom_histogram() +
  labs(title = "Proportion of Missing Quiz Scores per Student")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

CHALLENGE

  1. How many students completed all the quizzes?
## Your answer here


  1. Which quiz had the most missing scores?
## Your answer here

Drop NAs

One option is simply to drop any row where the score is NA. We can do this with tidyr::drop_na().

ss_nomissing_tbl <- ss_cln_tbl |> drop_na(score)

## Compare the number of rows before and after dropping the NAs
nrow(ss_cln_tbl); nrow(ss_nomissing_tbl)
[1] 1700
[1] 1453


Replace NAs with 0

We can easily replace NAs with a constant like 0 using tidyr::replace_na().

ss_cln_tbl |> replace_na(list(score = 0)) |> slice(1:20)

CHALLENGE

  1. How does replacing NA’s with 0’s affect the overall class average?
## Your answer here

Replace NA’s with Students’ Average Score

Another option might be to replace the NAs with the average of all the other student’s quiz scores. We can do this by:

  • computing students’ average scores in a separate table
  • joining the average score table to the data
  • replacing NAs with the average score

Step 1: Construct a table of each student’s overall average scores on the quizzes they took.

ss_std_avg_tbl <- ss_cln_tbl |> 
  group_by(name) |> 
  summarize(overall_avg = mean(score, na.rm = TRUE)) 

head(ss_std_avg_tbl)


Step 2: Next, join the overall average table to the data:

ss_scores_plus_avg_tbl <- ss_cln_tbl |> 
  left_join(ss_std_avg_tbl, by = "name") |> 
  select(name, quiz, score, overall_avg)

ss_scores_plus_avg_tbl |> slice(1:20)


Step 3. Replace the NAs with the students’ average using mutate():

ss_na2avg_tbl <- ss_scores_plus_avg_tbl |> 
  mutate(score = if_else(is.na(score), overall_avg, score))

ss_na2avg_tbl |> slice(1:40)


How does this change the overall average?

ss_cln_tbl |> pull(score) |> mean(na.rm = TRUE)
[1] 84.15692
ss_na2avg_tbl |> pull(score) |> mean(na.rm = TRUE)
[1] 82.93143


CHALLENGE (Homework)

  1. Drop students who missed more than 40% of the quizzes.
## Your answer here


  1. What do these data say about the theory that coming to class supports learning?
## Your answer here


DONE!

Remember to render your Quarto document so you have a pretty HTML file to keep for future reference.