In this Notebook, we’ll practice working with data frames including:
R comes with several sample data frames, for example mtcars
.
mtcars
Wondering where the mtcars
data came from? Just like functions, sample datasets usually have their help pages. Run ?mtcars
to see where this one came from.
csv (comma separated values) is a common format for tabular data. You can import a csv file using base R with read.csv()
.
Import sf_libraries.csv in the data directory:
csv_fn <- "./data/ca_breweries.csv"
file.exists(csv_fn)
[1] TRUE
breweries_df <- read.csv(csv_fn)
head(breweries_df)
You can view the number of rows and columns of a dataframe with nrow()
and ncol()
:
nrow(breweries_df)
[1] 311
ncol(mtcars)
[1] 11
You can view the names of the columns in a data frame with names():
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
The tibble package has a nice function called glimpse()
that will show you the names, column types, and first few values for each column in a concise format:
tibble::glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30~
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 275.8, 472.0, 460.0~
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 150, 24~
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.00, 3.23, 4.08, 4.~
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3.780, 5.250, 5.424~
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 18.00, 17.98, 17.82~
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2
You can grab a single column using the $
operator.
Extract the values in the mpg column:
mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3
[25] 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
Compute the average mpg of vehicles in mtcars. Answer
# Your answer here
Answer the following questions about the quakes
data frame, which has data about some earthquakes: Answer
range()
function)# Your answer here
You can filter (subset) rows and columns using square bracket notation. Example:
my_df[rows-expression, cols-expression]
To view the first 5 rows of breweries_df
, we pass a vector of integers as the rows expression:
quakes[1:5, ]
NOTE: You can omit the rows-expression or cols-expression, but you still need a comma instead the square brackets.
View every 5th row in quakes:
quakes[ c(5, 10, 15, 20, 25), ]
To return rows that meet a certain condition, rows can be an expression that returns TRUE/FALSE values:
## Quakes whose magnitude was >= 5.9
quakes[ quakes$mag >= 5.9, ]
How many earthquakes were detected by 100 or more stations? Answer
# Your answer here
What was the largest earthquake on record? Answer
# Your answer here
You can also use the rows expression to sort the rows. The key to this is using order()
, which returns the indices of elements in a vector sorted:
x <- c(50, 20, 70, 40, 90)
x
[1] 50 20 70 40 90
order(x)
[1] 2 4 1 3 5
To sort rows in a data frame, we simply pass a vector of integers in the desired order:
quakes[ order(quakes$mag), ]
The cols-expression can be vector of integers (corresponding to column numbers you want returned), or a character vector containing column names. You can also use the cols-expression to reorder the columns.
Write an expression that will return the longitude and latitude columns only (in that order) for the biggest 10 earthquakes (by magnitude).
mag_topten_idx <- order(quakes$mag, decreasing = TRUE)[1:10]
quakes[mag_topten_idx, ]
Using the mtcars data frame, compute the average mpg for 4, 6, and 8 cylinder vehicles. Answer
# Your answer here