Consider adding download caching to the client #29

dshemetov · 2022-03-31T21:55:53Z

I would think the following are higher priority than trying to implement streaming:

Tools to cache and perform smart updates on the cache.
...

Originally posted by @brookslogan in #13 (comment)

We have work on adding data caching to evalcast here. There is the option of porting that logic here. It would be natural to keep data-fetching/caching logic contained in this package. Then evalcast and the packages that supersede it can simply be updated to use this client and get caching functionality for free.

brookslogan · 2022-04-13T22:27:44Z

A few caching approaches based on complexity, featurefulness, efficiency, package:

per-query lease cache: (perhaps just in utility functions/parameters rather than integrated into fetch internals:) for every query a user performs, require/allow a cache file name/prefix; if a cache entry doesn't exist, query&store; if cache entry exists but is too old, query&update; if cache entry exists and is new enough, return cache.
evalcast caching port: simple-ish (see functional PR link above) but requires inspecting each endpoint we improve, latest and as_of queries over all geos of a geo type (will require updates for endpoints that don't use "*", maybe for those that use weeks instead of dates, that have different additional keys than covidcast (which has data_source, signal), etc.), can be inefficient in bandwidth & disk space, could belong in this package
disk-backed, updating epi_archive: medium complexity (but only requires individual endpoint treatment if it lands in this package), roughly same features as evalcast port for delphi-epidata but might apply to non-delphi-epidata sources, more bandwidth- and disk- efficient, not sure which package
smart cache --- something like the updating epi_archive but not requiring the entire history downloaded nor a fixed set of geos, and potentially compressing things by removing rows that match LOCF: highest complexity and requires inspecting each endpoint to cache, most featureful, overhead on some access patterns, would belong in this package

Anything requiring interaction with individual endpoints may suffer from needing to think about details and bugs in the individual endpoints, so it probably would just be applied to a few of the endpoints.

I think the two major development approaches would be:

first implement the evalcast port, then the updating full-history fixed-geo-set epi_archive, then the smarter-updating epi_archive and build using it into the endpoints
jump to the updating full-history fixed-geo-set epi_archive and don't build it into the endpoints

And an orthogonal detail: we sometimes have updates to an issue within its publication window (plus replication delays), so real-time queries of that issue will not match retrospective queries of that issue. This detail can be addressed in a few ways but might mean caching will require some careful messaging and/or be limited to advanced users. (This is still ignoring the possibility of nonrecent history mangling in exceptional circumstances.)

brookslogan · 2022-09-30T18:29:15Z

Priority here is hard to judge. [We'll want to play around with using epidatr more to try to get an intuition.]

We may want to wait/settle for caching&updating epi_archives, which we hope to add to epiprocess or a new package, but this may not be worked on for a while, and its result would be more general-purpose (applying to things outside epidatr) and less convenient than something integrated into epidatr. No one is actively working on these yet either.
If epidatr testing/users find that it's slow or that they miss caching, then we'll want to up the priority.
If repeated duplicate queries are causing issues for the API servers, then we'll want to up the priority. (Note that rate-limiting might obviate a need for automatic caching, but also increase the demand for either automatic or manual caching.)

dshemetov · 2022-09-30T20:52:04Z

Brainstorm: Which package should contain the caching code is a tricky design question. And it's something we want to get right ahead of time - don't want to add and then remove caching functionality later.

it could live in this package (and/or Python)
- con: would need to duplicate cache system in Python
it could live in epi_archive (if I understand correctly that the answer to my first question above is that it's a dynamic cache store)
- con: Python client would be left out
it could live in its own repository,
- con?: is designing a common interface for R/Python feasible

Suggestion: I am leaning toward the latter, mostly because duplicating the caching system in two languages would take a lot of work. I also wonder if it may be a scope drift for epi_archive.

Observation: An updating caching system feels like duplicating the API server locally. Request your local store first and then request the missing / stale pieces from the remote store. Wonder if it's a crazy idea to write a caching system as a local API (e.g. using hug).

dshemetov · 2023-07-26T21:09:30Z

Using the memoize package in R, we can add disk-caching to the downloads. This is similar to what we already have in evalcast::download_signals.
* Pro: don’t have to roll our own caching key-value system based on file identifiers, but instead have it be based on function arguments.
* Pro: max_age will allow us to make caches that naturally expire after a day. This will allow fast iteration in one session, but not a persistent cache that stays stale.
* Con: more code, more maintenance. Possibly something we could offer as an example in the docs instead.

dsweber2 · 2023-08-23T16:47:33Z

So I wanted to jot down some (public) notes on comparing "out of the box" caching methods. This post was the most extensive single discussion of options I could find (check boxes indicate ones that are superseded by others):

storr
last update: 2020
plays nicely with parallelism
same dev as thor (the next one)
thor
last update: 2023
specifically made for disk caching
wrapper for LMDB, a memory mapped database, written in C.
basically an R clone of py-lmdb
is kind of fighting R's paradigms, and is heavily OO
might provide cross-language cache portability
this requires using serialize which I don't think makes it a good option, since that's not portable, and does some type mangling.
R.cache
last (CRAN) update: 2022
most recent issue is 2022
doesn't seem to have a way to manage the age of the disk cache?
memoize
last (CRAN) update: 2021
R age issues
has an S3 caching function
actually just a convenience wrapper for cachem
cachem
last update: 2023
apparently did their actual CRAN release this year
tidyverse associated
pruning maybe introduces problematic delays. It's also happening every 5 seconds; for a memory object this makes sense, but a storage/disk one not so much. default amounts are definitely aimed more at memory (time in seconds, size is 200MB, etc).
update: looks like we can use whatever save format we want, so we could make this file portable and more space efficient e.g. feather/arrow/csv/etc via the read_fn and write_fn arguments
handwritten
last update: never
I don't wanna

I'm going to experiment with thor but if folks have thoughts lmk. We thankfully have a very natural key as the string of the url, so we don't need the full complexity of a memorizer. This also means the cache can be shared between python and R calls (as long as the user sets the directories correctly, of course).

dsweber2 · 2023-08-23T17:08:21Z

oh, we also need a default caching location:

here::here() so that it's project specific
tmp directory
suggestions?

dshemetov · 2023-08-24T23:39:37Z

We converged on:

default to here::here(".epidatr_cache"), but allow an absolute path specification
the pro of here::here is that it's OS independent; the con of possibly leaving a trail of caches on the user's machine is preferrable to figuring out an OS independent user-config file

Also cachem has an 80 character limit for the keys (and storr's limit is between 150 and 300). We decided to use a hash of the query URL as the key instead of the full URL to get around this.

brookslogan added P1 medium priority P2 low priority and removed P1 medium priority labels Sep 30, 2022

brookslogan added the triage Should assign/revisit priority level, other tags label Sep 30, 2022

dshemetov removed the triage Should assign/revisit priority level, other tags label May 3, 2023

dsweber2 added the enhancement New feature or request label May 17, 2023

dshemetov mentioned this issue Aug 23, 2023

Release the package #90

Closed

37 tasks

dsweber2 mentioned this issue Aug 24, 2023

add caching #153

Merged

dshemetov assigned dsweber2 Aug 30, 2023

dshemetov added P1 medium priority and removed P2 low priority labels Aug 30, 2023

dshemetov closed this as completed in #153 Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding download caching to the client #29

Consider adding download caching to the client #29

dshemetov commented Mar 31, 2022 •

edited

Loading

brookslogan commented Apr 13, 2022 •

edited

Loading

brookslogan commented Sep 30, 2022 •

edited

Loading

dshemetov commented Sep 30, 2022 •

edited

Loading

dshemetov commented Jul 26, 2023

dsweber2 commented Aug 23, 2023 •

edited

Loading

dsweber2 commented Aug 23, 2023

dshemetov commented Aug 24, 2023 •

edited

Loading

Consider adding download caching to the client #29

Consider adding download caching to the client #29

Comments

dshemetov commented Mar 31, 2022 • edited Loading

brookslogan commented Apr 13, 2022 • edited Loading

brookslogan commented Sep 30, 2022 • edited Loading

dshemetov commented Sep 30, 2022 • edited Loading

dshemetov commented Jul 26, 2023

dsweber2 commented Aug 23, 2023 • edited Loading

dsweber2 commented Aug 23, 2023

dshemetov commented Aug 24, 2023 • edited Loading

dshemetov commented Mar 31, 2022 •

edited

Loading

brookslogan commented Apr 13, 2022 •

edited

Loading

brookslogan commented Sep 30, 2022 •

edited

Loading

dshemetov commented Sep 30, 2022 •

edited

Loading

dsweber2 commented Aug 23, 2023 •

edited

Loading

dshemetov commented Aug 24, 2023 •

edited

Loading