Skip to content

Package evaluation & first pass documentation #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krivard opened this issue Sep 22, 2021 · 15 comments
Closed

Package evaluation & first pass documentation #13

krivard opened this issue Sep 22, 2021 · 15 comments
Labels
P3 very low priority

Comments

@krivard
Copy link
Contributor

krivard commented Sep 22, 2021

The goal of this task is to better understand whether this package does what we need it to, and permit other Delphi members to understand the same without having to read the source.

Delphi blog posts that include plots also include the code used to produce those plots. Reproduce as many of the plots as you can from the following blog posts, using this package instead of covidcastR/covidcast.

Use your newly acquired knowledge to repair the Roxygen docs so that they run successfully, and flesh out the package documentation with the basics.

@krivard
Copy link
Contributor Author

krivard commented Sep 22, 2021

@rafaelcatoia this is a good candidate for you to pick up next

@dshemetov
Copy link
Contributor

dshemetov commented Mar 18, 2022

@krivard @ryantibs I went through the feature list document and marked what is implemented, what is not, what is possibly out of scope, and what is not working correctly.

Feature Checklist

Implemented

  • Retrieve metadata from non-covidcast signals
  • Retrieve metadata from covidcast signals
  • Retrieve as_of a certain date (where supported)
  • Retrieve most recent (latest issue) data (where supported)
  • Retrieve given a certain lag (where supported)
  • Retrive as_of a range of time values (where supported)
  • Authenticate via API key, passed in each call
  • Allow changing endpoint to staging server

Not Implemented

  • Streaming mode support
    - Not implemented. Unclear what this means and unclear if feasible in R.
  • Asynchronous calls
    • Not implemented.
  • Optimized call batching
    • Not implemented. Requirement unclear.
  • Request multiple dataframes at once
    • Not implemented. Requirement unclear.
  • Included population data for msa, county, state
    • Not implemented.
  • Convert geo abbreviations to/from names to/from codes
    • Not implemented.

Not Implemented, Out of Scope or Unneeded

  • Apply shifts (lags/leads) to a signal
    • Out of scope (data processing/wrangling).
  • Plot choropleths for a signal
    • Out of scope (data plotting).
  • Add shape data to signal dataframe.
    • Out of scope (data plotting).
  • Append new data given locally stored data
    • Out of scope (data processing/wrangling).
  • Select earliest and latest issue for each observation in a signal
  • Convert arbitrary dataframe to signal format
    • Unneeded (signal format is in evalcast, which is being deprecated).
  • Correlate two returned signal dataframes together
    • Out of scope (data processing/wrangling).
  • Pivot a list of dataframes “wide” or "long"
    • Out of scope (data processing/wrangling).
  • Pivot a list of dataframes “long”
    • Out of scope (data processing/wrangling).
  • CoffeeScript and JS support
    • Out of scope (separate package).
  • Plot bubble maps for a signal
    • Out of scope (data plotting).
  • Animate a signal over time
    • Out of scope (data plotting).

@krivard
Copy link
Contributor Author

krivard commented Mar 21, 2022

Retrieve as_of a range of time values (where supported)

Do we have separate handling for as_of and issue? Range might only be supported for querying by issue.

@krivard
Copy link
Contributor Author

krivard commented Mar 21, 2022

Piecewise clarifications:

Add shape data to signal dataframe

this is more plotting

Streaming mode support / httr:retry

This is probably meant more in the sense of "can we begin processing the first few rows returned before the complete dataset has finished downloading" - which may be difficult/impossible in R?

Convert arbitrary dataframe to signal format

This is to support researchers who want to compare an epidata dataset with a custom one -- a function to, given some arbitrary data frame and a column mapping, produce a data frame in epidata format. Really only makes sense if we provide data plotting/processing/wrangling utilities as well.

Request multiple dataframes at once

this is either multiple signals at once (which should be possible for covidcast) or multiple epidata data frames at once (which may be in wildly different formats); the latter is probably difficult but useful.

@dshemetov
Copy link
Contributor

dshemetov commented Mar 22, 2022

separate handling for `as_of` and `issue`...

It looks like specifying both is possible and not much validation is done.

@dshemetov
Copy link
Contributor

@brookslogan any chance you know about R packages that allow streaming downloading / reading of CSV/JSON files? Probably not high priority or requiring a deep search.

@dshemetov
Copy link
Contributor

dshemetov commented Mar 22, 2022

this is either multiple signals at once ...

Current implementation can do multiple signals at once. A user can combine a few covidcast calls with bind_rows to get multiple data sources and epidata dataframes together in a single object.

@ryantibs
Copy link
Member

Thanks for diving in Dmitry!

Now that the other packages are taking shape (e.g., epiprocess, epipredict, and soon a geo aggregation package) I think it's become even more clear that we should consider data wrangling & plotting out of scope. So if delphi.epidata "just gives you data" then this seems like a perfectly good scope and will allow us to finish more quickly.

Misc comments:

  • No range for as_of, only for issue: that makes sense to me
  • Agree with Katie that converting outside data to given signal format not important given that we don't provide wrangling or plotting tools.
  • Just to be clear we aren't defining a single signal format here, right? There's no unified class or structure, each endpoint could conceivably give you back something different in format?)

Lastly, what would really help me to see how the package is shaping up, and help me see what decisions need to be made (if any), would be to have you flesh out the vignettes & documentation. Feel free to re-think what the structure of the vignettes is---think from the perspective of how best to teach a user about the array of offerings. It would be nice to have some significant examples from the FluView endpoint; and/or the vignettes can recreate examples from the blog posts Katie pointed out at the beginning of this issue.

@dshemetov
Copy link
Contributor

Right, no signal format definitions here - just downloading a JSON file and then using a reader to convert it to common types (csv, data.frame, tibble). 👍 on keeping this package minimal and just "get data".

Will take a look at the blog posts, vignettes, and see what I can do.

@brookslogan
Copy link
Contributor

Regarding downloading mechanism:

  • I haven't tried any streaming libraries. The most promising from a very short search is sparklyr, which also has functions that seem relevant (or duplicative?) to epiprocess and epipredict. But I'm not sure we would want to just use pre-packaged streaming, because we have I don't think we have guarantees about ordering of API response rows across all API endpoints + because of versioning.
  • It's still worth taking a look at sparklyr for epiprocess and epipredict. (And maybe arrow as well? Although I recall arrow causing some installation problems/waits.)
  • I think delphi.epidata currently uses csv as the transfer mechanism for most/all output formats.

@dshemetov
Copy link
Contributor

dshemetov commented Mar 24, 2022

From a quick glance at sparklyr streaming, it looks like it relies on Spark Streaming API, which requires us to be running a Spark engine. The closest thing to parsing an JSON file from an HTTP API request seems to be the TCP socket listener in the latter link? Seems a bit overkill for this application (Spark seems to be aimed at large data analytics processing, whereas this is just a small local HTTP request download). I might be wrong though, just trying to wrap my mind around it.

EDIT: Structured Streaming looks even more relevant.

@brookslogan
Copy link
Contributor

Looks like Spark doesn't support streaming from https, so sparklyr is out for streaming support unless we have a clever way to implement a supported protocol without too much implementation hassle. (Plus, the Spark setup is frustrating.) Might still be worth looking at if it's really great at the stuff we're trying to do in epiprocess.

I'm not convinced our data isn't potentially medium/large --- think issues & counties. But if it's not, then we don't need streaming. If it's only medium, then we can do streaming ourselves via chunking the download. We may be forced to use chunking anyway because we don't have uniform guarantees about the order of results from the API endpoints.

I would think the following are higher priority than trying to implement streaming:

  • Tools to cache and perform smart updates on the cache.
  • Speeding up download processing via consider more performance CSV parsing #6 or similar.
  • Optimizing cache format. (Is cache a collection of RDS files? How are they divided up? Is "cache" actually a database? ...)

And I'm not sure how high a priority the above are... I remember being pleased with how fast archive data downloads went with delphi.epidata when I tested it.

@krivard
Copy link
Contributor Author

krivard commented Mar 28, 2022

Re:size of data:

For reference, a JHU dataset as-of 2021-10-13 for all signals and geos between 2020-01-22 and 2021-09-26 (613 days) is ~1GB in submission CSV format, uncompressed. JHU updates each day's figures ~20x on average, so anyone needing the complete version history (instead of just the most recent version <= 2021-10-13) would need to pull down at least ~20GB of data. API response CSVs could be ~4x that if they include the signal, geo type, time type, time value, and issue columns, which submission CSVs don't have.

so it's not terabytes, but it's still pretty nontrivial. The good news is that JHU is the worst-case scenario -- every county in the USA is represented and there's lots of versioning; the next worst (CHNG, HSP, Quidel) have lots of versioning but don't hit every county. The bad news is that JHU is used by literally everyone, so it's in our best interest to optimize its performance.

@dshemetov
Copy link
Contributor

dshemetov commented Mar 28, 2022

Point taken on that use case. We do have query row limits at this point, so the user would need to batch those 20GB calls down (just pulling the latest full history issue would be 3000 counties x 600 days = 1.8 * 10e5 rows vs 10e6 row limit). But does make sense that the performance of this query is important. Probably a separate issue though.

@dsweber2
Copy link
Contributor

Dmitry and I went back through and made new issues for the remaining PRD docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 very low priority
Projects
None yet
Development

No branches or pull requests

5 participants