Overview

This notebook will demonstrate two ways of dealing with larger volumes of data:

A functional definition of “large data” might include:

Saving Cal-Adapt Data to SQLite Database

To manage large downloads and avoid downloading data twice, caladaptR provides ca_getvals_db(). ca_getvals_db() is very similar to ca_getvals_tbl(), but saves the data into a SQLite database file on your computer as it comes in. Before making an API request, it checks to see if the data have already been downloaded, and if so skips it. If you get disconnected during a download, you can run the command again and it’ll just pick up where it left off. ca_getvals_db() returns a ‘remote tibble’, which functions very similar to a regular ‘in-memory’ tibble however it points to the SQLite database file on disk.

For additional info on working with SQLite databases, see the Article on making Large Queries.

Example: Download Daily Data for Congressional Disticts

In this example we’ll download daily temperature data for 16 Congressional Districts in the LA region.


Setup

Load caladaptR and the other package we’re going to need. (If you haven’t installed these yet, see this setup script).

library(caladaptr)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
caladaptr (version 0.6.1)
URL: https://ucanr-igis.github.io/caladaptr
Bug reports: https://github.com/ucanr-igis/caladaptr/issues
library(units)
udunits database from C:/Users/Andy/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml
library(ggplot2)
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library(sf)
Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(tidyr)
library(tmap)
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)


Import the Congressional Districts Boundaries

The LA Congressional district boundaries are saved in the ‘data’ folder as a geopackage:

cdist_la_fn <- "./data/cdistricts_la.gpkg"
file.exists(cdist_la_fn)
[1] TRUE
cdist_la_sf <- st_read(cdist_la_fn)
Reading layer `cdistricts_la' from data source 
  `D:\GitHub\cal-adapt\caladaptr-res\docs\workshops\ca_intro_oct21\notebooks\data\cdistricts_la.gpkg' 
  using driver `GPKG'
Simple feature collection with 15 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -118.9449 ymin: 33.70403 xmax: -117.3529 ymax: 34.8233
Geodetic CRS:  WGS 84
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(cdist_la_sf) + tm_borders()

Pro Tip:

  • to use a sf object the location for an API Request, it must be in geographic coordinates (EPSG 4326)


Create the API Request

Here we use ca_loc_sf() as the location function for an API request for 30-years of modeled daily temperature data (minimum and maximum) for 4 GCMs and 2 RCPs:

cdist_la_cap <- ca_loc_sf(loc = cdist_la_sf, idfld = "geoid") %>% 
  ca_gcm(gcms[1:4]) %>%                                 
  ca_scenario(c("rcp45", "rcp85")) %>%
  ca_period("day") %>%
  ca_years(start = 2070, end = 2099) %>%
  ca_cvar(c("tasmin", "tasmax")) %>% 
  ca_options(spatial_ag = "mean")

cdist_la_cap
Cal-Adapt API Request
Location(s): 
  Simple Feature MULTIPOLYGON (15 feature(s))
  ID field: geoid
Variable(s): tasmin, tasmax
Temporal aggregration period(s): day
GCM(s): HadGEM2-ES, CNRM-CM5, CanESM2, MIROC5
Scenario(s): rcp45, rcp85
Dates: 2070-01-01 to 2099-12-31
Options:
  spatial ag: mean
 


Do the standard plotting and preflight checks:

plot(cdist_la_cap, locagrid = TRUE)
cdist_la_cap %>% ca_preflight()
General issues
 - none found
Issues for querying values
 - none found
Issues for downloading rasters
 - none found


Fetch Data

To copy downloaded data into a database, use ca_getvals_db() instead of ca_getvals_tbl(). ca_getvals_db() has two arguments which are mandatory:

db_fn - the file name of a SQLite database (will be created if it doesn’t exist)

db_tbl - the name of a table inside the database where the values should be saved

my_database_fn <- "./data/cdist_la_temp_data.sqlite"
  
cdist_la_rtbl <- cdist_la_cap %>% 
  ca_getvals_db(db_fn = my_database_fn, 
                db_tbl = "temp_data",
                quiet = TRUE)


Inspect the results:

cdist_la_rtbl %>% head()


The number of rows we retrieved:

cdist_la_rtbl %>% count() %>% pull(n)
[1] 2629680


Wrangling a Remote Tibble

Many of the base R operations that work with in-memory tibbles may or many not work with remote tibbles. For example as we saw above cdist_la_rtbl %>% count() works, but:

dim(cdist_la_rtbl)
[1] NA  8
nrow(cdist_la_rtbl)
[1] NA


In general, the best way to work with remote tibbles is with

  1. dplyr functions, or
  2. SQL statements passed using the DBI package

Simple filtering, sorting, grouping and simple numeric summaries generally work fine with dplyr verbs:

cdist_la_rtbl %>% 
  filter(geoid == "0632") %>% 
  group_by(scenario, gcm, cvar) %>% 
  summarize(mean_temp = mean(val, na.rm = TRUE))
`summarise()` has grouped output by 'scenario', 'gcm'. You can override using the `.groups` argument.


Pro Tip:

  • You can ‘convert’ a Remote Tibble to a regular in-memory Tibble by tacking on collect() at the end of a dplyr expression.


If your wrangling workflow involves a lot steps that are difficult or impossible to do with remote tibbles, a good strategy is to do your filtering and grouping with dplyr statements on the remote tibble, and then convert the results to a regular tibble with collect().

Below we convert the grouped summary table into a tibble so we can use pivot_wider() (which doesn’t work on remote tibbles):

temp_long_tbl <- cdist_la_rtbl %>% 
  filter(geoid == "0632") %>% 
  group_by(scenario, gcm, cvar) %>% 
  summarize(mean_temp = mean(val, na.rm = TRUE)) %>% 
  collect()
`summarise()` has grouped output by 'scenario', 'gcm'. You can override using the `.groups` argument.
class(temp_long_tbl)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
temp_wide_tbl <- temp_long_tbl %>% 
  pivot_wider(names_from = cvar, values_from = mean_temp) %>% 
  mutate(tas_range = tasmax - tasmin)

temp_wide_tbl %>% head()


Your Turn

Create a histogram of the mean daily temperatures for one Congressional District and one emissions scenario, grouping the data by GCM. Does the distribution of mean average temperature look the same across GCMs?

## Your answer here


Downloading Rasters

Another way to deal with large data needs is to download the data as raster or TIF files. The same API Request object can be used to get raster data if you feed it into ca_getrst_stars().

For additional info on downloading and analyzing rasters, see the 3 articles on Downloading Rasters.

Below we get a raster of observed historic temperature data for the Sierra climate region:

sierra_cap <- ca_loc_aoipreset(type = "climregions", idfld = "name", idval = "Sierra") %>% 
  ca_livneh(TRUE) %>% 
  ca_period("year") %>% 
  ca_cvar("pr") %>% 
  ca_years(start = 1970, end = 2010)

sierra_cap
Cal-Adapt API Request
Location(s): 
  AOI Preset: climregions
  name(s): Sierra
Variable(s): pr
Temporal aggregration period(s): year
Livneh data: TRUE
Dates: 1970-01-01 to 2010-12-31
 
plot(sierra_cap, locagrid = TRUE)

sierra_cap %>% ca_preflight()
General issues
 - none found
Issues for querying values
 - A spatial aggregation function is required to query values from polygon areas. See `ca_options`.
Issues for downloading rasters
 - none found


To fetch the data as TIFs, use :

tiff_dir <- "./data"

sierra_tiff_fn <- sierra_cap %>% 
  ca_getrst_stars(out_dir = tiff_dir, mask = TRUE, quiet = TRUE, overwrite = FALSE)

Pro Tip:


ca_getrst_stars() works differently than retrieving tabular climate values. It returns a vector of TIF files that were downloaded. To work with them, you next have to load them back into R as stars objects (space-time arrays) using ca_stars_read():

sierra_stars_lst <- ca_stars_read(sierra_tiff_fn)
sierra_stars_lst[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s):
                                 Min.  1st Qu.   Median     Mean  3rd Qu.     Max.   NA's
pr_year_livneh_name-Sierra  0.2009817 2.037323 2.821092 3.106411 3.863086 12.43626 123615
dimension(s):
     from to   offset   delta refsys point values x/y
x       1 58 -121.688  0.0625 WGS 84 FALSE   NULL [x]
y       1 71   40.125 -0.0625 WGS 84 FALSE   NULL [y]
year    1 41     1970       1     NA    NA   NULL    


To plot a stars objects, you have to decide which layer(s) to plot. In this case, each layer represents a year from 1970 to 2010. Below we plot 4 of the 40 years:

plot(sierra_stars_lst[[1]] %>% slice(index = seq(1,40,length.out =4), along = "year"), 
     axes = TRUE,
     main = attributes(sierra_stars_lst[[1]])$ca_metadata$slug)
downsample set to c(0,0,1)

Not sure what the units are? You can double-check by viewing the metadata for the slug from the catalog:

ca_catalog_search("pr_year_livneh")

pr_year_livneh
  name: Livneh yearly average precipitation historical
  url: https://api.cal-adapt.org/api/series/pr_year_livneh/
  tres: annual
  begin: 1950-01-01T00:00:00Z
  end: 2013-12-31T00:00:00Z
  units: mm
  num_rast: 64
  id: 410
  xmin: -124.5625
  xmax: -113.375
  ymin: 31.5625
  ymax: 43.75


There is a lot more you can do with rasters, including pixel summaries, combining them into higher dimensional data cubes, spatially mosaicing them, etc. For more info, see the Rasters articles on the website.


Your Turn

Download historic precipitation data for the county where you live or work. [Answer]

## Your answer here


---
title: "Large Queries and Caching Results"
output:
  html_notebook: 
    css: https://ucanr-igis.github.io/caladaptr-res/assets/nb_css01.css
    includes:
      after_body: https://ucanr-igis.github.io/caladaptr-res/assets/nb_footer01.html
---

# Overview

This notebook will demonstrate two ways of dealing with larger volumes of data:

- caching retrieved into a SQLite database  
- downloading data as TIFFs

A functional definition of "large data" might include:

- any volume of data that you wouldn't want to download twice  
- any volume of data that might bog down (or worse, crash) the Cal-Adapt server, and annoy the system administrators  

## Saving Cal-Adapt Data to SQLite Database

To manage large downloads and avoid downloading data twice, caladaptR provides `ca_getvals_db()`. `ca_getvals_db()` is very similar to `ca_getvals_tbl()`, but saves the data into a SQLite database file on your computer as it comes in. Before making an API request, it checks to see if the data have already been downloaded, and if so skips it. If you get disconnected during a download, you can run the command again and it'll just pick up where it left off. `ca_getvals_db()` returns a 'remote tibble', which functions very similar to a regular 'in-memory' tibble however it points to the SQLite database file on disk.

For additional info on working with SQLite databases, see the Article on making [Large Queries](https://ucanr-igis.github.io/caladaptr/articles/large-queries.html).

# Example: Download Daily Data for Congressional Disticts

In this example we'll download daily temperature data for 16 Congressional Districts in the LA region. 

\

## Setup

Load caladaptR and the other package we're going to need. (If you haven't installed these yet, see this [setup script](https://github.com/ucanr-igis/caladaptr-res/blob/main/docs/workshops/ca_intro_oct21/scripts/caladaptr_setup.R)). 

```{r chunk01, message=FALSE, warning=FALSE, results='hold'}
library(caladaptr)
library(units)
library(ggplot2)
library(dplyr)
library(lubridate)
library(sf)
library(tidyr)
library(tmap)
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
```

\

## Import the Congressional Districts Boundaries

The LA Congressional district boundaries are saved in the 'data' folder as a geopackage:

```{r chunk02}
cdist_la_fn <- "./data/cdistricts_la.gpkg"
file.exists(cdist_la_fn)

cdist_la_sf <- st_read(cdist_la_fn)

tmap_mode("view")
tm_shape(cdist_la_sf) + tm_borders()
```

**Pro Tip:**

- to use a sf object the location for an API Request, it must be in geographic coordinates (EPSG 4326)


\

## Create the API Request

Here we use `ca_loc_sf()` as the location function for an API request for 30-years of modeled daily temperature data (minimum and maximum) for 4 GCMs and 2 RCPs:

```{r chunk03}
cdist_la_cap <- ca_loc_sf(loc = cdist_la_sf, idfld = "geoid") %>% 
  ca_gcm(gcms[1:4]) %>%                                 
  ca_scenario(c("rcp45", "rcp85")) %>%
  ca_period("day") %>%
  ca_years(start = 2070, end = 2099) %>%
  ca_cvar(c("tasmin", "tasmax")) %>% 
  ca_options(spatial_ag = "mean")

cdist_la_cap
```

\

Do the standard plotting and preflight checks:

```{r chunk04}
plot(cdist_la_cap, locagrid = TRUE)
```


```{r chunk05}
cdist_la_cap %>% ca_preflight()
```

\

## Fetch Data

To copy downloaded data into a database, use `ca_getvals_db()` instead of `ca_getvals_tbl()`. `ca_getvals_db()` has two arguments which are mandatory:

`db_fn` - the file name of a SQLite database (will be created if it doesn't exist)

`db_tbl` - the name of a table inside the database where the values should be saved

```{r chunk06}
my_database_fn <- "./data/cdist_la_temp_data.sqlite"
  
cdist_la_rtbl <- cdist_la_cap %>% 
  ca_getvals_db(db_fn = my_database_fn, 
                db_tbl = "temp_data",
                quiet = TRUE)
```

\

Inspect the results:

```{r chunk07}
cdist_la_rtbl %>% head()
```

\

The number of rows we retrieved:

```{r chunk08}
cdist_la_rtbl %>% count() %>% pull(n)
```

\

## Wrangling a Remote Tibble

Many of the base R operations that work with in-memory tibbles may or many not work with remote tibbles. For example as we saw above `cdist_la_rtbl %>% count()` works, but:

```{r chunk09}
dim(cdist_la_rtbl)
nrow(cdist_la_rtbl)
```

\

In general, the best way to work with remote tibbles is with 

i) dplyr functions, or   
ii) SQL statements passed using the `DBI` package 

Simple filtering, sorting, grouping and simple numeric summaries generally work fine with dplyr verbs:

```{r chunk10}
cdist_la_rtbl %>% 
  filter(geoid == "0632") %>% 
  group_by(scenario, gcm, cvar) %>% 
  summarize(mean_temp = mean(val, na.rm = TRUE))
```

\


**Pro Tip:**

- You can 'convert' a Remote Tibble to a regular in-memory Tibble by tacking on `collect()` at the end of a dplyr expression.

\

If your wrangling workflow involves a lot steps that are difficult or impossible to do with remote tibbles, a good strategy is to do your filtering and grouping with dplyr statements on the remote tibble, and then convert the results to a regular tibble with `collect()`.

Below we convert the grouped summary table into a tibble so we can use `pivot_wider()` (which doesn't work on remote tibbles):

```{r chunk11}
temp_long_tbl <- cdist_la_rtbl %>% 
  filter(geoid == "0632") %>% 
  group_by(scenario, gcm, cvar) %>% 
  summarize(mean_temp = mean(val, na.rm = TRUE)) %>% 
  collect()

class(temp_long_tbl)

temp_wide_tbl <- temp_long_tbl %>% 
  pivot_wider(names_from = cvar, values_from = mean_temp) %>% 
  mutate(tas_range = tasmax - tasmin)

temp_wide_tbl %>% head()
```

\

#### Your Turn

Create a histogram of the mean daily temperatures for one Congressional District and one emissions scenario, grouping the data by GCM. Does the distribution of mean average temperature look the same across GCMs? 

```{r chunk12}
## Your answer here
```

\

# Downloading Rasters

Another way to deal with large data needs is to download the data as raster or TIF files. The same API Request object can be used to get raster data if you feed it into `ca_getrst_stars()`.

For additional info on downloading and analyzing rasters, see the 3 articles on [Downloading Rasters](https://ucanr-igis.github.io/caladaptr/articles/rasters-pt1.html).

Below we get a raster of observed historic temperature data for the Sierra climate region:

```{r chunk13}
sierra_cap <- ca_loc_aoipreset(type = "climregions", idfld = "name", idval = "Sierra") %>% 
  ca_livneh(TRUE) %>% 
  ca_period("year") %>% 
  ca_cvar("pr") %>% 
  ca_years(start = 1970, end = 2010)

sierra_cap

plot(sierra_cap, locagrid = TRUE)

sierra_cap %>% ca_preflight()
```

\

To fetch the data as TIFs, use :

```{r chunk14}
tiff_dir <- "./data"

sierra_tiff_fn <- sierra_cap %>% 
  ca_getrst_stars(out_dir = tiff_dir, mask = TRUE, quiet = TRUE, overwrite = FALSE)
```

**Pro Tip:**

 - to avoid downloading the same TIF multiple times, use the same output directory and set `overwrite = FALSE`
 
\
 
`ca_getrst_stars()` works differently than retrieving tabular climate values. It returns a vector of TIF files that were downloaded. To work with them, you next have to load them back into R as stars objects (space-time arrays) using `ca_stars_read()`:

```{r chunk15, paged.print = FALSE}
sierra_stars_lst <- ca_stars_read(sierra_tiff_fn)
sierra_stars_lst[[1]]
```

\

To plot a stars objects, you have to decide which layer(s) to plot. In this case, each layer represents a year from 1970 to 2010. Below we plot 4 of the 40 years:

```{r chunk16}
plot(sierra_stars_lst[[1]] %>% slice(index = seq(1,40,length.out =4), along = "year"), 
     axes = TRUE,
     main = attributes(sierra_stars_lst[[1]])$ca_metadata$slug)
```

Not sure what the units are? You can double-check by viewing the metadata for the slug from the catalog:

```{r chunk17}
ca_catalog_search("pr_year_livneh")
```

\

There is a **lot** more you can do with rasters, including pixel summaries, combining them into higher dimensional data cubes, spatially mosaicing them, etc. For more info, see the Rasters articles on the [website](https://ucanr-igis.github.io/caladaptr/).
 
\

#### Your Turn

Download historic precipitation data for the county where you live or work. [[Answer](https://bit.ly/3m7BcWM)]
 
```{r chunk18, paged.print = FALSE}
## Your answer here
```

\
