Package evaluation & first pass documentation #13

krivard · 2021-09-22T15:39:21Z

The goal of this task is to better understand whether this package does what we need it to, and permit other Delphi members to understand the same without having to read the source.

Delphi blog posts that include plots also include the code used to produce those plots. Reproduce as many of the plots as you can from the following blog posts, using this package instead of covidcastR/covidcast.

Use your newly acquired knowledge to repair the Roxygen docs so that they run successfully, and flesh out the package documentation with the basics.

The text was updated successfully, but these errors were encountered:

krivard · 2021-09-22T15:41:23Z

@rafaelcatoia this is a good candidate for you to pick up next

dshemetov · 2022-03-18T22:15:56Z

krivard · 2022-03-21T16:30:44Z

Retrieve as_of a range of time values (where supported)

Do we have separate handling for as_of and issue? Range might only be supported for querying by issue.

krivard · 2022-03-21T16:41:17Z

Piecewise clarifications:

Add shape data to signal dataframe

this is more plotting

Streaming mode support / httr:retry

This is probably meant more in the sense of "can we begin processing the first few rows returned before the complete dataset has finished downloading" - which may be difficult/impossible in R?

Convert arbitrary dataframe to signal format

This is to support researchers who want to compare an epidata dataset with a custom one -- a function to, given some arbitrary data frame and a column mapping, produce a data frame in epidata format. Really only makes sense if we provide data plotting/processing/wrangling utilities as well.

Request multiple dataframes at once

this is either multiple signals at once (which should be possible for covidcast) or multiple epidata data frames at once (which may be in wildly different formats); the latter is probably difficult but useful.

dshemetov · 2022-03-22T06:40:51Z

separate handling for `as_of` and `issue`...

It looks like specifying both is possible and not much validation is done.

dshemetov · 2022-03-22T06:43:40Z

@brookslogan any chance you know about R packages that allow streaming downloading / reading of CSV/JSON files? Probably not high priority or requiring a deep search.

dshemetov · 2022-03-22T06:48:19Z

this is either multiple signals at once ...

Current implementation can do multiple signals at once. A user can combine a few covidcast calls with bind_rows to get multiple data sources and epidata dataframes together in a single object.

ryantibs · 2022-03-22T10:22:08Z

Thanks for diving in Dmitry!

Now that the other packages are taking shape (e.g., epiprocess, epipredict, and soon a geo aggregation package) I think it's become even more clear that we should consider data wrangling & plotting out of scope. So if delphi.epidata "just gives you data" then this seems like a perfectly good scope and will allow us to finish more quickly.

Misc comments:

No range for as_of, only for issue: that makes sense to me
Agree with Katie that converting outside data to given signal format not important given that we don't provide wrangling or plotting tools.
Just to be clear we aren't defining a single signal format here, right? There's no unified class or structure, each endpoint could conceivably give you back something different in format?)

Lastly, what would really help me to see how the package is shaping up, and help me see what decisions need to be made (if any), would be to have you flesh out the vignettes & documentation. Feel free to re-think what the structure of the vignettes is---think from the perspective of how best to teach a user about the array of offerings. It would be nice to have some significant examples from the FluView endpoint; and/or the vignettes can recreate examples from the blog posts Katie pointed out at the beginning of this issue.

dshemetov · 2022-03-22T22:46:38Z

Right, no signal format definitions here - just downloading a JSON file and then using a reader to convert it to common types (csv, data.frame, tibble). 👍 on keeping this package minimal and just "get data".

Will take a look at the blog posts, vignettes, and see what I can do.

brookslogan · 2022-03-23T17:17:11Z

Regarding downloading mechanism:

I haven't tried any streaming libraries. The most promising from a very short search is sparklyr, which also has functions that seem relevant (or duplicative?) to epiprocess and epipredict. But I'm not sure we would want to just use pre-packaged streaming, because we have I don't think we have guarantees about ordering of API response rows across all API endpoints + because of versioning.
It's still worth taking a look at sparklyr for epiprocess and epipredict. (And maybe arrow as well? Although I recall arrow causing some installation problems/waits.)
I think delphi.epidata currently uses csv as the transfer mechanism for most/all output formats.

dshemetov · 2022-03-24T18:17:41Z

From a quick glance at sparklyr streaming, it looks like it relies on Spark Streaming API, which requires us to be running a Spark engine. The closest thing to parsing an JSON file from an HTTP API request seems to be the TCP socket listener in the latter link? Seems a bit overkill for this application (Spark seems to be aimed at large data analytics processing, whereas this is just a small local HTTP request download). I might be wrong though, just trying to wrap my mind around it.

EDIT: Structured Streaming looks even more relevant.

brookslogan · 2022-03-25T19:43:20Z

Looks like Spark doesn't support streaming from https, so sparklyr is out for streaming support unless we have a clever way to implement a supported protocol without too much implementation hassle. (Plus, the Spark setup is frustrating.) Might still be worth looking at if it's really great at the stuff we're trying to do in epiprocess.

I'm not convinced our data isn't potentially medium/large --- think issues & counties. But if it's not, then we don't need streaming. If it's only medium, then we can do streaming ourselves via chunking the download. We may be forced to use chunking anyway because we don't have uniform guarantees about the order of results from the API endpoints.

I would think the following are higher priority than trying to implement streaming:

Tools to cache and perform smart updates on the cache.
Speeding up download processing via consider more performance CSV parsing #6 or similar.
Optimizing cache format. (Is cache a collection of RDS files? How are they divided up? Is "cache" actually a database? ...)

And I'm not sure how high a priority the above are... I remember being pleased with how fast archive data downloads went with delphi.epidata when I tested it.

krivard · 2022-03-28T14:45:21Z

Re:size of data:

For reference, a JHU dataset as-of 2021-10-13 for all signals and geos between 2020-01-22 and 2021-09-26 (613 days) is ~1GB in submission CSV format, uncompressed. JHU updates each day's figures ~20x on average, so anyone needing the complete version history (instead of just the most recent version <= 2021-10-13) would need to pull down at least ~20GB of data. API response CSVs could be ~4x that if they include the signal, geo type, time type, time value, and issue columns, which submission CSVs don't have.

so it's not terabytes, but it's still pretty nontrivial. The good news is that JHU is the worst-case scenario -- every county in the USA is represented and there's lots of versioning; the next worst (CHNG, HSP, Quidel) have lots of versioning but don't hit every county. The bad news is that JHU is used by literally everyone, so it's in our best interest to optimize its performance.

dshemetov · 2022-03-28T17:24:28Z

Point taken on that use case. We do have query row limits at this point, so the user would need to batch those 20GB calls down (just pulling the latest full history issue would be 3000 counties x 600 days = 1.8 * 10e5 rows vs 10e6 row limit). But does make sense that the performance of this query is important. Probably a separate issue though.

dsweber2 · 2023-07-20T23:05:07Z

Dmitry and I went back through and made new issues for the remaining PRD docs.

This was referenced Mar 31, 2022

Documentation update pass #23

Merged

Allow streaming downloading, if possible #28

Closed

Consider adding download caching to the client #29

Closed

brookslogan added the P3 very low priority label Sep 30, 2022

dshemetov mentioned this issue Apr 26, 2023

Release the package #90

Closed

37 tasks

brookslogan mentioned this issue May 3, 2023

First-pass documentation + additional documentation issues #50

Closed

dshemetov mentioned this issue Jul 20, 2023

Convert geo abbreviations to/from names to/from codes #143

Open

dsweber2 closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package evaluation & first pass documentation #13

Package evaluation & first pass documentation #13

krivard commented Sep 22, 2021

krivard commented Sep 22, 2021

dshemetov commented Mar 18, 2022 •

edited

Loading

krivard commented Mar 21, 2022

krivard commented Mar 21, 2022

dshemetov commented Mar 22, 2022 •

edited

Loading

dshemetov commented Mar 22, 2022

dshemetov commented Mar 22, 2022 •

edited

Loading

ryantibs commented Mar 22, 2022

dshemetov commented Mar 22, 2022

brookslogan commented Mar 23, 2022

dshemetov commented Mar 24, 2022 •

edited

Loading

brookslogan commented Mar 25, 2022

krivard commented Mar 28, 2022

dshemetov commented Mar 28, 2022 •

edited

Loading

dsweber2 commented Jul 20, 2023

Package evaluation & first pass documentation #13

Package evaluation & first pass documentation #13

Comments

krivard commented Sep 22, 2021

krivard commented Sep 22, 2021

dshemetov commented Mar 18, 2022 • edited Loading

Feature Checklist

Implemented

Not Implemented

Not Implemented, Out of Scope or Unneeded

krivard commented Mar 21, 2022

krivard commented Mar 21, 2022

dshemetov commented Mar 22, 2022 • edited Loading

dshemetov commented Mar 22, 2022

dshemetov commented Mar 22, 2022 • edited Loading

ryantibs commented Mar 22, 2022

dshemetov commented Mar 22, 2022

brookslogan commented Mar 23, 2022

dshemetov commented Mar 24, 2022 • edited Loading

brookslogan commented Mar 25, 2022

krivard commented Mar 28, 2022

dshemetov commented Mar 28, 2022 • edited Loading

dsweber2 commented Jul 20, 2023

dshemetov commented Mar 18, 2022 •

edited

Loading

dshemetov commented Mar 22, 2022 •

edited

Loading

dshemetov commented Mar 22, 2022 •

edited

Loading

dshemetov commented Mar 24, 2022 •

edited

Loading

dshemetov commented Mar 28, 2022 •

edited

Loading