-
Notifications
You must be signed in to change notification settings - Fork 5
Package evaluation & first pass documentation #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@rafaelcatoia this is a good candidate for you to pick up next |
@krivard @ryantibs I went through the feature list document and marked what is implemented, what is not, what is possibly out of scope, and what is not working correctly. Feature ChecklistImplemented
Not Implemented
Not Implemented, Out of Scope or Unneeded
|
Do we have separate handling for |
Piecewise clarifications:
this is more plotting
This is probably meant more in the sense of "can we begin processing the first few rows returned before the complete dataset has finished downloading" - which may be difficult/impossible in R?
This is to support researchers who want to compare an epidata dataset with a custom one -- a function to, given some arbitrary data frame and a column mapping, produce a data frame in epidata format. Really only makes sense if we provide data plotting/processing/wrangling utilities as well.
this is either multiple signals at once (which should be possible for covidcast) or multiple epidata data frames at once (which may be in wildly different formats); the latter is probably difficult but useful. |
It looks like specifying both is possible and not much validation is done. |
@brookslogan any chance you know about R packages that allow streaming downloading / reading of CSV/JSON files? Probably not high priority or requiring a deep search. |
Current implementation can do multiple signals at once. A user can combine a few |
Thanks for diving in Dmitry! Now that the other packages are taking shape (e.g., Misc comments:
Lastly, what would really help me to see how the package is shaping up, and help me see what decisions need to be made (if any), would be to have you flesh out the vignettes & documentation. Feel free to re-think what the structure of the vignettes is---think from the perspective of how best to teach a user about the array of offerings. It would be nice to have some significant examples from the FluView endpoint; and/or the vignettes can recreate examples from the blog posts Katie pointed out at the beginning of this issue. |
Right, no signal format definitions here - just downloading a JSON file and then using a reader to convert it to common types (csv, data.frame, tibble). 👍 on keeping this package minimal and just "get data". Will take a look at the blog posts, vignettes, and see what I can do. |
Regarding downloading mechanism:
|
From a quick glance at sparklyr streaming, it looks like it relies on Spark Streaming API, which requires us to be running a Spark engine. The closest thing to parsing an JSON file from an HTTP API request seems to be the TCP socket listener in the latter link? Seems a bit overkill for this application (Spark seems to be aimed at large data analytics processing, whereas this is just a small local HTTP request download). I might be wrong though, just trying to wrap my mind around it. EDIT: Structured Streaming looks even more relevant. |
Looks like Spark doesn't support streaming from https, so sparklyr is out for streaming support unless we have a clever way to implement a supported protocol without too much implementation hassle. (Plus, the Spark setup is frustrating.) Might still be worth looking at if it's really great at the stuff we're trying to do in epiprocess. I'm not convinced our data isn't potentially medium/large --- think issues & counties. But if it's not, then we don't need streaming. If it's only medium, then we can do streaming ourselves via chunking the download. We may be forced to use chunking anyway because we don't have uniform guarantees about the order of results from the API endpoints. I would think the following are higher priority than trying to implement streaming:
And I'm not sure how high a priority the above are... I remember being pleased with how fast archive data downloads went with delphi.epidata when I tested it. |
Re:size of data: For reference, a JHU dataset as-of 2021-10-13 for all signals and geos between 2020-01-22 and 2021-09-26 (613 days) is ~1GB in submission CSV format, uncompressed. JHU updates each day's figures ~20x on average, so anyone needing the complete version history (instead of just the most recent version <= 2021-10-13) would need to pull down at least ~20GB of data. API response CSVs could be ~4x that if they include the signal, geo type, time type, time value, and issue columns, which submission CSVs don't have. so it's not terabytes, but it's still pretty nontrivial. The good news is that JHU is the worst-case scenario -- every county in the USA is represented and there's lots of versioning; the next worst (CHNG, HSP, Quidel) have lots of versioning but don't hit every county. The bad news is that JHU is used by literally everyone, so it's in our best interest to optimize its performance. |
Point taken on that use case. We do have query row limits at this point, so the user would need to batch those 20GB calls down (just pulling the latest full history issue would be 3000 counties x 600 days = 1.8 * 10e5 rows vs 10e6 row limit). But does make sense that the performance of this query is important. Probably a separate issue though. |
Dmitry and I went back through and made new issues for the remaining PRD docs. |
The goal of this task is to better understand whether this package does what we need it to, and permit other Delphi members to understand the same without having to read the source.
Delphi blog posts that include plots also include the code used to produce those plots. Reproduce as many of the plots as you can from the following blog posts, using this package instead of
covidcastR
/covidcast
.Use your newly acquired knowledge to repair the Roxygen docs so that they run successfully, and flesh out the package documentation with the basics.
The text was updated successfully, but these errors were encountered: