Skip to content

Consider adding download caching to the client #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #90
dshemetov opened this issue Mar 31, 2022 · 7 comments · Fixed by #153
Closed
Tracked by #90

Consider adding download caching to the client #29

dshemetov opened this issue Mar 31, 2022 · 7 comments · Fixed by #153
Assignees
Labels
enhancement New feature or request P1 medium priority

Comments

@dshemetov
Copy link
Contributor

dshemetov commented Mar 31, 2022

I would think the following are higher priority than trying to implement streaming:

  • Tools to cache and perform smart updates on the cache.
  • ...

Originally posted by @brookslogan in #13 (comment)

We have work on adding data caching to evalcast here. There is the option of porting that logic here. It would be natural to keep data-fetching/caching logic contained in this package. Then evalcast and the packages that supersede it can simply be updated to use this client and get caching functionality for free.

@brookslogan
Copy link
Contributor

brookslogan commented Apr 13, 2022

A few caching approaches based on complexity, featurefulness, efficiency, package:

  • per-query lease cache: (perhaps just in utility functions/parameters rather than integrated into fetch internals:) for every query a user performs, require/allow a cache file name/prefix; if a cache entry doesn't exist, query&store; if cache entry exists but is too old, query&update; if cache entry exists and is new enough, return cache.
  • evalcast caching port: simple-ish (see functional PR link above) but requires inspecting each endpoint we improve, latest and as_of queries over all geos of a geo type (will require updates for endpoints that don't use "*", maybe for those that use weeks instead of dates, that have different additional keys than covidcast (which has data_source, signal), etc.), can be inefficient in bandwidth & disk space, could belong in this package
  • disk-backed, updating epi_archive: medium complexity (but only requires individual endpoint treatment if it lands in this package), roughly same features as evalcast port for delphi-epidata but might apply to non-delphi-epidata sources, more bandwidth- and disk- efficient, not sure which package
  • smart cache --- something like the updating epi_archive but not requiring the entire history downloaded nor a fixed set of geos, and potentially compressing things by removing rows that match LOCF: highest complexity and requires inspecting each endpoint to cache, most featureful, overhead on some access patterns, would belong in this package

Anything requiring interaction with individual endpoints may suffer from needing to think about details and bugs in the individual endpoints, so it probably would just be applied to a few of the endpoints.

I think the two major development approaches would be:

  • first implement the evalcast port, then the updating full-history fixed-geo-set epi_archive, then the smarter-updating epi_archive and build using it into the endpoints
  • jump to the updating full-history fixed-geo-set epi_archive and don't build it into the endpoints

And an orthogonal detail: we sometimes have updates to an issue within its publication window (plus replication delays), so real-time queries of that issue will not match retrospective queries of that issue. This detail can be addressed in a few ways but might mean caching will require some careful messaging and/or be limited to advanced users. (This is still ignoring the possibility of nonrecent history mangling in exceptional circumstances.)

@brookslogan brookslogan added P1 medium priority P2 low priority and removed P1 medium priority labels Sep 30, 2022
@brookslogan
Copy link
Contributor

brookslogan commented Sep 30, 2022

Priority here is hard to judge. [We'll want to play around with using epidatr more to try to get an intuition.]

  • We may want to wait/settle for caching&updating epi_archives, which we hope to add to epiprocess or a new package, but this may not be worked on for a while, and its result would be more general-purpose (applying to things outside epidatr) and less convenient than something integrated into epidatr. No one is actively working on these yet either.
  • If epidatr testing/users find that it's slow or that they miss caching, then we'll want to up the priority.
  • If repeated duplicate queries are causing issues for the API servers, then we'll want to up the priority. (Note that rate-limiting might obviate a need for automatic caching, but also increase the demand for either automatic or manual caching.)

@brookslogan brookslogan added the triage Should assign/revisit priority level, other tags label Sep 30, 2022
@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 30, 2022

Brainstorm: Which package should contain the caching code is a tricky design question. And it's something we want to get right ahead of time - don't want to add and then remove caching functionality later.

  • it could live in this package (and/or Python)
    • con: would need to duplicate cache system in Python
  • it could live in epi_archive (if I understand correctly that the answer to my first question above is that it's a dynamic cache store)
    • con: Python client would be left out
  • it could live in its own repository,
    • con?: is designing a common interface for R/Python feasible

Suggestion: I am leaning toward the latter, mostly because duplicating the caching system in two languages would take a lot of work. I also wonder if it may be a scope drift for epi_archive.

Observation: An updating caching system feels like duplicating the API server locally. Request your local store first and then request the missing / stale pieces from the remote store. Wonder if it's a crazy idea to write a caching system as a local API (e.g. using hug).

@dshemetov dshemetov removed the triage Should assign/revisit priority level, other tags label May 3, 2023
@dsweber2 dsweber2 added the enhancement New feature or request label May 17, 2023
@dshemetov
Copy link
Contributor Author

  • Using the memoize package in R, we can add disk-caching to the downloads. This is similar to what we already have in evalcast::download_signals.
    * Pro: don’t have to roll our own caching key-value system based on file identifiers, but instead have it be based on function arguments.
    * Pro: max_age will allow us to make caches that naturally expire after a day. This will allow fast iteration in one session, but not a persistent cache that stays stale.
    * Con: more code, more maintenance. Possibly something we could offer as an example in the docs instead.

@dshemetov dshemetov mentioned this issue Aug 23, 2023
37 tasks
@dsweber2
Copy link
Contributor

dsweber2 commented Aug 23, 2023

So I wanted to jot down some (public) notes on comparing "out of the box" caching methods. This post was the most extensive single discussion of options I could find (check boxes indicate ones that are superseded by others):

  • storr
    last update: 2020
    plays nicely with parallelism
    same dev as thor (the next one)
  • thor
    last update: 2023
    specifically made for disk caching
    wrapper for LMDB, a memory mapped database, written in C.
    basically an R clone of py-lmdb
    is kind of fighting R's paradigms, and is heavily OO
    might provide cross-language cache portability
    this requires using serialize which I don't think makes it a good option, since that's not portable, and does some type mangling.
  • R.cache
    last (CRAN) update: 2022
    most recent issue is 2022
    doesn't seem to have a way to manage the age of the disk cache?
  • memoize
    last (CRAN) update: 2021
    R age issues
    has an S3 caching function
    actually just a convenience wrapper for cachem
  • cachem
    last update: 2023
    apparently did their actual CRAN release this year
    tidyverse associated
    pruning maybe introduces problematic delays. It's also happening every 5 seconds; for a memory object this makes sense, but a storage/disk one not so much. default amounts are definitely aimed more at memory (time in seconds, size is 200MB, etc).
    update: looks like we can use whatever save format we want, so we could make this file portable and more space efficient e.g. feather/arrow/csv/etc via the read_fn and write_fn arguments
  • handwritten
    last update: never
    I don't wanna

I'm going to experiment with thor but if folks have thoughts lmk. We thankfully have a very natural key as the string of the url, so we don't need the full complexity of a memorizer. This also means the cache can be shared between python and R calls (as long as the user sets the directories correctly, of course).

@dsweber2
Copy link
Contributor

oh, we also need a default caching location:

  • here::here() so that it's project specific
  • tmp directory
  • suggestions?

@dsweber2 dsweber2 mentioned this issue Aug 24, 2023
@dshemetov
Copy link
Contributor Author

dshemetov commented Aug 24, 2023

We converged on:

  • default to here::here(".epidatr_cache"), but allow an absolute path specification
  • the pro of here::here is that it's OS independent; the con of possibly leaving a trail of caches on the user's machine is preferrable to figuring out an OS independent user-config file

Also cachem has an 80 character limit for the keys (and storr's limit is between 150 and 300). We decided to use a hash of the query URL as the key instead of the full URL to get around this.

@dshemetov dshemetov added P1 medium priority and removed P2 low priority labels Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants