How Does CIMIS Compute Daily Summaries?

An exercise in reverse engineering

Author

Andy Lyons

Published

April 3, 2024

TLDR

The CIMIS API allows you to query weather station data by hour or day. The daily summaries appear to be generated by aggregating hourly data with the definition of a day as starting at 1am and ending at 24:00. This is a little unusual but that’s how they do it.

There is a small unexplained difference in the aggregated hourly data and the daily summaries. This could be due to rounding errors or the possibility that hourly and daily summaries are both generated from an even shorter time period (e.g., 5 minutes).

Overview

The CIMIS API allows you to query data in hourly or daily time intervals. This notebook does some analysis to understand exactly how the daily summaries are constructed from the hourly data.

Why is this useful? A very good question. This is useful because i) we need daily values for certain models, such as degree days or irrigation requirements, but ii) we might only have access to the hourly data (i.e, if we are accessing the data from the Synoptic API, which is a lot more stable and reliable than the CIMIS API1).

To figure this out, we’re going to import Hourly and Daily CIMIS Data for the same time period. We’ll then compute daily totals starting at different hours (we do this off camera in Excel for convenience), and compare the totals to the daily summaries.

Setup

suppressPackageStartupMessages({
  library(cimir)
  library(dplyr)
  library(lubridate)
})


Load my CIMIS Key:

my_cimis_key <- readLines("andy_cimis_key.txt", warn = FALSE)
set_key(my_cimis_key)


Define a directory for cached data

cache_dir <- here::here("cimis_cache")
dir.exists(cache_dir)
[1] TRUE


Check if the API is alive

# library(httr)
# (xx <- httr::GET(url = "http://et.water.ca.gov/api/station/125"))
# httr::status_code(xx) == 200


View CIMIS Items

cimis_items()


Get one month of daily precip data

precip_day_cache_fn <- file.path(cache_dir, "station152_day-precip_2024-03-01_2024-03-31.Rds")

if (file.exists(precip_day_cache_fn)) {
  cat("Going to use cached daily data \n")
  precip_day_tbl <- readRDS(precip_day_cache_fn)
} else {
  cat("Going to fetch daily data \n")
  precip_day_tbl <- cimis_data(targets = 152, start.date = "2024-03-01", end.date = "2024-03-31",
                                 items = "day-precip")
  saveRDS(precip_day_tbl, file = precip_day_cache_fn)
}
Going to use cached daily data 


Get one month of hourly precip data

precip_hly_cache_fn <- file.path(cache_dir, "station152_hourly-precip_2024-03-01_2024-03-31.Rds")

if (file.exists(precip_hly_cache_fn)) {
  cat("Going to use cached hourly data \n")
  precip_hly_tbl <- readRDS(precip_hly_cache_fn)

} else {
  cat("Going to fetch hourly data \n")
  precip_hly_tbl <- cimis_data(targets = 152, start.date = "2024-03-01", end.date = "2024-03-31",
                                 items = "hly-precip")
  saveRDS(precip_hly_tbl, file = precip_hly_cache_fn)
}
Going to use cached hourly data 
precip_hly_tbl |> 
  select(Name, Type, Date, Hour, Station, Scope, Item, Value) |> 
  slice(1:30)

Note: we see our first clue above. The first hour of data for March 1 is 1am, and the final hour is “2400”. Most people would consider midnight the first hour of the day, but CIMIS considers it the last minute (“24:00”) of the previous day.


Do some date conversion and pull out just a couple of fields

hly_tbl <- precip_hly_tbl |>
  mutate(date = ymd(Date), hour_num = as.numeric(Hour) / 100, dt = date + hours(hour_num)) |>
  select(dt, precip_hly = Value)
head(hly_tbl, n=30)
day_tbl <- precip_day_tbl |>
  mutate(date = ymd(Date)) |>
  select(date, precip_day = Value)
head(day_tbl)


Next, we export the hourly data to Excel, where it will be easier to explore the hour:

precip_hourly_xlsx_fn <- file.path(cache_dir, "station152_day-precip_2024-03-01_2024-03-31.xlsx")

if (file.exists(precip_hourly_xlsx_fn)) {
  cat("Excel file already exists. \n")
} else {
  cat("Going to save hourly data as Excel \n")
  library(openxlsx)
  openxlsx::write.xlsx(hly_tbl, file = precip_hourly_xlsx_fn)
}
Excel file already exists. 


The Excel investigation confirms our suspicions. We get a very good match between the sum of the hourly data and the daily summaries if we start counting at 1am. There is still a small discrepancy which we can’t explain, but is small and could be due to a rounding error (or maybe the data are recorded at an even finer temporal resolution than 1 hour).

To check the closeness of the summed up hourly totals (starting at 1am) and the daily summaries, we can run the following:

hly_tbl |> 
  mutate(midnight_offset = as.numeric(hour(dt) == 0), date_group = date(dt) - days(midnight_offset)) |> 
  group_by(date = date_group) |> 
  summarise(precip_hly_sum = sum(precip_hly), .groups = "drop") |> 
  left_join(day_tbl, by = "date") |> 
  mutate(diff = precip_hly_sum - precip_day)

Footnotes

  1. The CIMIS API is getting an upgrade in September 2024, so this may be a moot point by then.↩︎