This notebook will demonstrate two ways of dealing with larger volumes of data:
A functional definition of “large data” might include:
To manage large downloads and avoid downloading data twice, caladaptR provides ca_getvals_db()
. ca_getvals_db()
is very similar to ca_getvals_tbl()
, but saves the data into a SQLite database file on your computer as it comes in. Before making an API request, it checks to see if the data have already been downloaded, and if so skips it. If you get disconnected during a download, you can run the command again and it’ll just pick up where it left off. ca_getvals_db()
returns a ‘remote tibble’, which functions very similar to a regular ‘in-memory’ tibble however it points to the SQLite database file on disk.
For additional info on working with SQLite databases, see the Article on making Large Queries.
In this example we’ll download daily temperature data for 16 Congressional Districts in the LA region.
Load caladaptR and the other package we’re going to need. (If you haven’t installed these yet, see this setup script).
library(caladaptr)
library(units)
library(ggplot2)
library(dplyr)
library(lubridate)
library(sf)
library(tidyr)
library(tmap)
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
The LA Congressional district boundaries are saved in the ‘data’ folder as a geopackage:
cdist_la_fn <- "./data/cdistricts_la.gpkg"
file.exists(cdist_la_fn)
[1] TRUE
cdist_la_sf <- st_read(cdist_la_fn)
Reading layer `cdistricts_la' from data source
`D:\GitHub\cal-adapt\caladaptr-res\docs\workshops\ca_intro_feb22\notebooks\data\cdistricts_la.gpkg'
using driver `GPKG'
Simple feature collection with 15 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -118.9449 ymin: 33.70403 xmax: -117.3529 ymax: 34.8233
Geodetic CRS: WGS 84
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(cdist_la_sf) + tm_borders()
Pro Tip:
Here we use ca_loc_sf()
as the location function for an API request for 30-years of modeled daily temperature data (minimum and maximum) for 4 GCMs and 2 RCPs:
cdist_la_cap <- ca_loc_sf(loc = cdist_la_sf, idfld = "geoid") %>%
ca_gcm(gcms[1:4]) %>%
ca_scenario(c("rcp45", "rcp85")) %>%
ca_period("day") %>%
ca_years(start = 2070, end = 2099) %>%
ca_cvar(c("tasmin", "tasmax")) %>%
ca_options(spatial_ag = "mean")
cdist_la_cap
Cal-Adapt API Request
Location(s):
Simple Feature MULTIPOLYGON (15 feature(s))
ID field: geoid
Variable(s): tasmin, tasmax
Temporal aggregration period(s): day
GCM(s): HadGEM2-ES, CNRM-CM5, CanESM2, MIROC5
Scenario(s): rcp45, rcp85
Dates: 2070-01-01 to 2099-12-31
Options:
spatial ag: mean
Do the standard plotting and preflight checks:
plot(cdist_la_cap, locagrid = TRUE)
cdist_la_cap %>% ca_preflight()
General issues
- none found
Issues for querying values
- none found
Issues for downloading rasters
- none found
To copy downloaded data into a database, use ca_getvals_db()
instead of ca_getvals_tbl()
. ca_getvals_db()
has two arguments which are mandatory:
db_fn
- the file name of a SQLite database (will be created if it doesn’t exist)
db_tbl
- the name of a table inside the database where the values should be saved
my_database_fn <- "./data/cdist_la_temp_data.sqlite"
cdist_la_rtbl <- cdist_la_cap %>%
ca_getvals_db(db_fn = my_database_fn,
db_tbl = "temp_data",
quiet = FALSE)
|
| | 0%
|
|= | 1%
|
|== | 2%
|
|=== | 3%
|
|==== | 4%
|
|===== | 5%
|
|====== | 5%
|
|====== | 6%
|
|======= | 7%
|
|======== | 8%
|
|========= | 8%
|
|========= | 9%
|
|========== | 10%
|
|=========== | 10%
|
|=========== | 11%
|
|============ | 11%
|
|============ | 12%
|
|============= | 13%
|
|============== | 13%
|
|============== | 14%
|
|=============== | 14%
|
|=============== | 15%
|
|================ | 15%
|
|================ | 16%
|
|================= | 16%
|
|================= | 17%
|
|================== | 18%
|
|=================== | 18%
|
|=================== | 19%
|
|==================== | 19%
|
|==================== | 20%
|
|===================== | 21%
|
|====================== | 21%
|
|====================== | 22%
|
|======================= | 22%
|
|======================= | 23%
|
|======================== | 23%
|
|======================== | 24%
|
|========================= | 24%
|
|========================= | 25%
|
|========================== | 25%
|
|========================== | 26%
|
|=========================== | 26%
|
|=========================== | 27%
|
|============================ | 27%
|
|============================ | 28%
|
|============================= | 28%
|
|============================= | 29%
|
|============================== | 29%
|
|============================== | 30%
|
|=============================== | 30%
|
|=============================== | 31%
|
|================================ | 31%
|
|================================ | 32%
|
|================================= | 32%
|
|================================= | 33%
|
|================================== | 33%
|
|=================================== | 34%
|
|=================================== | 35%
|
|==================================== | 35%
|
|==================================== | 36%
|
|===================================== | 36%
|
|====================================== | 37%
|
|====================================== | 38%
|
|======================================= | 38%
|
|======================================== | 39%
|
|========================================= | 40%
|
|========================================= | 41%
|
|========================================== | 41%
|
|=========================================== | 42%
|
|============================================ | 43%
|
|============================================ | 44%
|
|============================================= | 44%
|
|============================================== | 45%
|
|=============================================== | 46%
|
|================================================ | 47%
|
|================================================= | 48%
|
|================================================== | 49%
|
|=================================================== | 50%
|
|==================================================== | 51%
|
|===================================================== | 52%
|
|====================================================== | 53%
|
|======================================================= | 54%
|
|======================================================== | 55%
|
|========================================================= | 56%
|
|========================================================== | 56%
|
|========================================================== | 57%
|
|=========================================================== | 58%
|
|============================================================ | 59%
|
|============================================================= | 59%
|
|============================================================= | 60%
|
|============================================================== | 61%
|
|=============================================================== | 62%
|
|================================================================ | 62%
|
|================================================================ | 63%
|
|================================================================= | 64%
|
|================================================================== | 64%
|
|================================================================== | 65%
|
|=================================================================== | 65%
|
|=================================================================== | 66%
|
|==================================================================== | 67%
|
|===================================================================== | 67%
|
|===================================================================== | 68%
|
|====================================================================== | 68%
|
|====================================================================== | 69%
|
|======================================================================= | 69%
|
|======================================================================= | 70%
|
|======================================================================== | 70%
|
|======================================================================== | 71%
|
|========================================================================= | 71%
|
|========================================================================= | 72%
|
|========================================================================== | 72%
|
|========================================================================== | 73%
|
|=========================================================================== | 73%
|
|=========================================================================== | 74%
|
|============================================================================ | 74%
|
|============================================================================ | 75%
|
|============================================================================= | 75%
|
|============================================================================= | 76%
|
|============================================================================== | 76%
|
|============================================================================== | 77%
|
|=============================================================================== | 77%
|
|=============================================================================== | 78%
|
|================================================================================ | 78%
|
|================================================================================ | 79%
|
|================================================================================= | 79%
|
|================================================================================== | 80%
|
|================================================================================== | 81%
|
|=================================================================================== | 81%
|
|=================================================================================== | 82%
|
|==================================================================================== | 82%
|
|===================================================================================== | 83%
|
|===================================================================================== | 84%
|
|====================================================================================== | 84%
|
|====================================================================================== | 85%
|
|======================================================================================= | 85%
|
|======================================================================================= | 86%
|
|======================================================================================== | 86%
|
|======================================================================================== | 87%
|
|========================================================================================= | 87%
|
|========================================================================================== | 88%
|
|========================================================================================== | 89%
|
|=========================================================================================== | 89%
|
|=========================================================================================== | 90%
|
|============================================================================================ | 90%
|
|============================================================================================= | 91%
|
|============================================================================================= | 92%
|
|============================================================================================== | 92%
|
|=============================================================================================== | 93%
|
|================================================================================================ | 94%
|
|================================================================================================ | 95%
|
|================================================================================================= | 95%
|
|================================================================================================== | 96%
|
|=================================================================================================== | 97%
|
|==================================================================================================== | 98%
|
|===================================================================================================== | 99%
|
|======================================================================================================| 100%
Inspect the results:
cdist_la_rtbl %>% head()
The number of rows we retrieved:
cdist_la_rtbl %>% count() %>% pull(n)
[1] 2629680
Many of the base R operations that work with in-memory tibbles may or many not work with remote tibbles. For example as we saw above cdist_la_rtbl %>% count()
works, but:
dim(cdist_la_rtbl)
[1] NA 8
nrow(cdist_la_rtbl)
[1] NA
In general, the best way to work with remote tibbles is with
DBI
packageSimple filtering, sorting, grouping and simple numeric summaries generally work fine with dplyr verbs:
cdist_la_rtbl %>%
filter(geoid == "0632") %>%
group_by(scenario, gcm, cvar) %>%
summarize(mean_temp = mean(val, na.rm = TRUE))
`summarise()` has grouped output by 'scenario', 'gcm'. You can override using the `.groups` argument.
Pro Tip:
collect()
at the end of a dplyr expression.
If your wrangling workflow involves a lot steps that are difficult or impossible to do with remote tibbles, a good strategy is to do your filtering and grouping with dplyr statements on the remote tibble, and then convert the results to a regular tibble with collect()
.
Below we convert the grouped summary table into a tibble so we can use pivot_wider()
(which doesn’t work on remote tibbles):
temp_long_tbl <- cdist_la_rtbl %>%
filter(geoid == "0632") %>%
group_by(scenario, gcm, cvar) %>%
summarize(mean_temp = mean(val, na.rm = TRUE)) %>%
collect()
`summarise()` has grouped output by 'scenario', 'gcm'. You can override using the `.groups` argument.
class(temp_long_tbl)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
temp_wide_tbl <- temp_long_tbl %>%
pivot_wider(names_from = cvar, values_from = mean_temp) %>%
mutate(tas_range = tasmax - tasmin)
temp_wide_tbl %>% head()
Create a histogram of the mean daily temperatures for one Congressional District and one emissions scenario, grouping the data by GCM. Does the distribution of mean average temperature look the same across GCMs?
## Your answer here
Another way to deal with large data needs is to download the data as raster or TIF files. The same API Request object can be used to get raster data if you feed it into ca_getrst_stars()
.
For additional info on downloading and analyzing rasters, see the 3 articles on Downloading Rasters.
Below we get a raster of observed historic temperature data for the Sierra climate region:
sierra_cap <- ca_loc_aoipreset(type = "climregions", idfld = "name", idval = "Sierra") %>%
ca_livneh(TRUE) %>%
ca_period("year") %>%
ca_cvar("pr") %>%
ca_years(start = 1970, end = 2010)
sierra_cap
Cal-Adapt API Request
Location(s):
AOI Preset: climregions
name(s): Sierra
Variable(s): pr
Temporal aggregration period(s): year
Livneh data: TRUE
Dates: 1970-01-01 to 2010-12-31
plot(sierra_cap, locagrid = TRUE)
sierra_cap %>% ca_preflight()
General issues
- none found
Issues for querying values
- A spatial aggregation function is required to query values from polygon areas. See `ca_options`.
Issues for downloading rasters
- none found
To fetch the data as TIFs, use :
tiff_dir <- "./data"
sierra_tiff_fn <- sierra_cap %>%
ca_getrst_stars(out_dir = tiff_dir, mask = TRUE, quiet = TRUE, overwrite = FALSE)
Pro Tip:
overwrite = FALSE
ca_getrst_stars()
works differently than retrieving tabular climate values. It returns a vector of TIF files that were downloaded. To work with them, you next have to load them back into R as stars objects (space-time arrays) using ca_stars_read()
:
sierra_stars_lst <- ca_stars_read(sierra_tiff_fn)
sierra_stars_lst[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
pr_year_livneh_name-Sierra 0.2009817 2.037323 2.821092 3.106411 3.863086 12.43626 123615
dimension(s):
from to offset delta refsys point values x/y
x 1 58 -121.688 0.0625 WGS 84 FALSE NULL [x]
y 1 71 40.125 -0.0625 WGS 84 FALSE NULL [y]
year 1 41 1970 1 NA NA NULL
To plot a stars objects, you have to decide which layer(s) to plot. In this case, each layer represents a year from 1970 to 2010. Below we plot 4 of the 40 years:
plot(sierra_stars_lst[[1]] %>% slice(index = seq(1,40,length.out =4), along = "year"),
axes = TRUE,
main = attributes(sierra_stars_lst[[1]])$ca_metadata$slug)
Not sure what the units are? You can double-check by viewing the metadata for the slug from the catalog:
ca_catalog_search("pr_year_livneh")
pr_year_livneh
name: Livneh yearly average precipitation historical
url: https://api.cal-adapt.org/api/series/pr_year_livneh/
tres: annual
begin: 1950-01-01T00:00:00Z
end: 2013-12-31T00:00:00Z
units: mm
num_rast: 64
id: 382
xmin: -124.5625
xmax: -113.375
ymin: 31.5625
ymax: 43.75
There is a lot more you can do with rasters, including pixel summaries, combining them into higher dimensional data cubes, spatially mosaicing them, etc. For more info, see the Rasters articles on the website.
Download historic precipitation data for the county where you live or work. [Answer]
## Example: Mendocino County
ca_aoipreset_geom("counties", quiet = TRUE) %>%
st_drop_geometry() %>%
filter(name == "Mendocino") %>%
select(name, state_name, fips)
name state_name fips
1 Mendocino California 06045
mendocino_cap <- ca_loc_aoipreset(type = "counties", idfld = "fips", idval = "06045") %>%
ca_livneh(TRUE) %>%
ca_period("year") %>%
ca_cvar("pr") %>%
ca_years(start = 1970, end = 2010)
plot(mendocino_cap)
mendocino_fn <- mendocino_cap %>%
ca_getrst_stars(out_dir = tempdir(), mask = TRUE, quiet = TRUE, overwrite = FALSE)
mendocino_stars_lst <- ca_stars_read(mendocino_fn)
mendocino_stars_lst[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
pr_year_livneh_fips-06045 0.9326811 2.792288 3.627976 3.892968 4.744806 11.229 7134
dimension(s):
from to offset delta refsys point values x/y
x 1 20 -124.062 0.0625 WGS 84 FALSE NULL [x]
y 1 21 40.0625 -0.0625 WGS 84 FALSE NULL [y]
year 1 41 1970 1 NA NA NULL
plot(mendocino_stars_lst[[1]] %>% slice(index = seq(1,40,length.out =4), along = "year"),
axes = TRUE,
main = attributes(mendocino_stars_lst[[1]])$ca_metadata$slug)