Skip to content

Consolidate fetch_* interfaces #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
dshemetov opened this issue Mar 15, 2023 · 5 comments · Fixed by #99
Closed
2 of 3 tasks

Consolidate fetch_* interfaces #72

dshemetov opened this issue Mar 15, 2023 · 5 comments · Fixed by #99
Assignees
Labels
enhancement New feature or request P0 Top priority question Further information is requested

Comments

@dshemetov
Copy link
Contributor

dshemetov commented Mar 15, 2023

TODO

  1. Remove fetch_df
  2. Print a suggestion to use a fetch_function when printing an EpidataCall object
  3. Consider further unification of fetch_classic, fetch_json, fetch_csv, fetch_tbl

Notes on the current interface

  • the fetching functions fetch_classic, fetch_json, fetch_csv reflect the interface of the API, but don't steer the user to the likeliest use case: download data and get it in a tibble, which is what fetch_tbl does
  • fetch_classic is not a great name: it sounds outdated even though it is the default return format for the API
  • fetch_classic is not clear that it captures epidata status codes from the server (while the other fetch function don't)
  • fetch_classic is not clear that it returns a data.frame for most endpoints, but a list for a few others (like the delphi endpoint)
  • fetch_json is not clear that it returns a data.frame
  • fetch_csv is not clear that it returns a plain CSV string
  • fetch_tbl is not clear that it requests a CSV first, before converting to a tibble

Proposed new interface

A default fetch function that returns a tibble for most endpoints and a list for the few endpoints that can't be safely converted to a tibble. The fetch function will call fetch_classic under the hood and will make a decision to convert to tibble based on the endpoint. Remove fetch_json, fetch_csv.

  • Pro: simplifies fetching for the most likely user workflow: getting a tibble; this improves user onboarding
  • Pro: reducing maintenance costs of this package by reducing interface complexity
  • Con: higher variance in RAM usage (before converting to a tibble)

Discussion of JSON vs CSV performance

After an extensive conversation and some tests with @brookslogan, we found that the CSV format is not substantially more performant than JSON for serving our data:

  • Bandwidth: uncompressed JSON is about ~3x the size of a CSV (for our data), but only ~1.25x compressed
  • Runtime: unzipping time is dominated by query/disk time (see benchmarks below)
  • RAM usage: our largest queries would be ~1GB in CSV uncompressed and therefore ~3GB in JSON, which is manageable by data science laptops

The pros of JSON over CSV are:

  • universally supported as the return hierarchical return format of our API, whereas CSV only handles tabular formats
  • no delimiter issues and UTF-8 compliant by default (minor, since we get to choose the encoding server-side)

h/t to @brookslogan for these benchmarks

     http_headers <- httr::add_headers("User-Agent" = paste0("epidatr/", version), "Accept-Encoding" = "gzip")
     call <- covidcast(
       data_source = "jhu-csse",
       signals = "confirmed_7dav_incidence_prop",
       time_type = "day",
       geo_type = "state",
       time_values = epirange(20200601, 20220801),
       geo_values = "*"
     )
     microbenchmark::microbenchmark(
     fetch_json(call),
     fetch_csv(call),
     fetch_tbl(call),
     times = 10L
     )

Unit: seconds
             expr      min       lq     mean   median       uq      max neval
 fetch_json(call) 2.918282 3.011072 3.183227 3.082953 3.403508 3.629481    10
  fetch_csv(call) 2.811736 2.847303 3.197557 2.884883 2.957755 4.573654    10
  fetch_tbl(call) 2.889439 3.011481 3.376560 3.125979 3.846497 4.012702    10
@dshemetov dshemetov added enhancement New feature or request question Further information is requested P0 Top priority labels Apr 18, 2023
@brookslogan
Copy link
Contributor

Pro: this actually gives us Epidata errors for everything. (We could get it while using csv delivery with some tweaks/details but I'm not sure the server does this now.)

Bug: delphi should be a list / nested structure, but fetch_classic doesn't actually respect this and turns it into a data frame. The culprit is probably

  r <- httr::content(res, "text", encoding = "UTF-8")
  m <- jsonlite::fromJSON(r)

vs. delphi_epidata.R's

content(res, 'parsed')

Potential Con: in some/all operating systems & R versions, R cannot give RAM back to the OS after its freed. So the RAM increase is not just before converting to tibble but also after. Probably fine; if you're working with a really large amount of data your downstream models and forecasts are probably really large too.
Con: removing csv functionality removes ability to handle csv parsing via other functions, e.g., data.table::fread, maybe equivalents from arrow/spark/polars/..., so users can't opt-in to more dependencies to potentially improve performance.
Question: is csv not UTF-8 compliant by default?

@dshemetov
Copy link
Contributor Author

dshemetov commented Apr 20, 2023

Re: CSV and UTF-8, nope:

RFC 4180 proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term "CSV" might refer to any file that:[1][5]

  • is plain text using a character encoding such as ASCII, various Unicode character encodings (e.g. UTF-8), EBCDIC, or Shift JIS,
  • consists of records (typically one record per line),
  • with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  • where every record has the same sequence of fields.

This point might be somewhat irrelevant, since we are in control of the CSV encoding server-side, so we can enforce a standard.

@brookslogan
Copy link
Contributor

brookslogan commented Apr 20, 2023

Potential Con: users that want old Epidata row lists can't get them if they can't force classic format. [no, they can just use fetch_classic() if it's fixed]

Todo if implemented: fix fetch_tbl() and make fetch() (on non-classic format) properly stop for Epidata errors [get the data in classic format instead of csv, stop for errors, else format to tbl. Then fetch() is just dispatching to either fetch_tbl() or fetch_classic().]


I like the default fetch proposal a lot. A couple of design alternatives I recall from past discussions, thoughts?

A: current proposal:

fetch(call) # fetch via classic, format based on endpoint
fetch_tbl(call) # fetch via classic(?), format to tbl
fetch_classic(call) # fetch via classic, format to classic

B: one exported fetch:

fetch(call, format = c("default","tbl","classic", {any others})) # forward to nonexported functions below
# non-exported:
fetch_default(call) # same as `fetch()` in A
fetch_tbl(call) # fetch via classic(?), format to tbl
fetch_classic(call) # fetch via classic, format to classic

C: no exported fetch functions. Like B, but there's such a format arg in each endpoint function. Would likely need an extra format option or two, like "epidatacall" or "request_url"


Question: since user might use csv option more performantly than our client code, should we allow it?

Wish: improve naming of classic and fix any other bugs

Worry: at some point, "classic" also referred to a list of data frames in covidcast. Doesn't seem to be the case when I try now, but not sure if somewhere this name would trip us up.

@brookslogan
Copy link
Contributor

brookslogan commented Apr 20, 2023

Question: should we make fetch_classic() and fetch() on classic-only endpoints also stop for status codes else return the epidata? That seems like a really common operation. The current classic error representation could be put into something like fetch_result() if we want. [this might already be implied in the proposed new interface, I'm just not sure.]

@dshemetov
Copy link
Contributor Author

Re: alternative interface proposals

  • I don't like C due to code duplication
  • The difference between B and A is very small, but if I had to pick one it would be A because it has fewer functions and I don't see greater usability for the extra functions from B

Re: allowing users to use csv themselves, as we spoke in Slack, let's wait before optimizing in that direction. We can focus on providing a single clean interface through JSON for now and then consider other formats for speed if the need arises.

Re: renaming classic, agreed, it's not great; if we give the user a single fetch function, then the user doesn't have to deal with it.

Re: classic being a list of data frames, maybe we can not export fetch_classic then (and in that case, don't export fetch_json either).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P0 Top priority question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants