Consolidate `fetch_*` interfaces #72

dshemetov · 2023-03-15T22:58:07Z

TODO

Remove fetch_df
Print a suggestion to use a fetch_function when printing an EpidataCall object
Consider further unification of fetch_classic, fetch_json, fetch_csv, fetch_tbl

Notes on the current interface

the fetching functions fetch_classic, fetch_json, fetch_csv reflect the interface of the API, but don't steer the user to the likeliest use case: download data and get it in a tibble, which is what fetch_tbl does
fetch_classic is not a great name: it sounds outdated even though it is the default return format for the API
fetch_classic is not clear that it captures epidata status codes from the server (while the other fetch function don't)
fetch_classic is not clear that it returns a data.frame for most endpoints, but a list for a few others (like the delphi endpoint)
fetch_json is not clear that it returns a data.frame
fetch_csv is not clear that it returns a plain CSV string
fetch_tbl is not clear that it requests a CSV first, before converting to a tibble

Proposed new interface

A default fetch function that returns a tibble for most endpoints and a list for the few endpoints that can't be safely converted to a tibble. The fetch function will call fetch_classic under the hood and will make a decision to convert to tibble based on the endpoint. Remove fetch_json, fetch_csv.

Pro: simplifies fetching for the most likely user workflow: getting a tibble; this improves user onboarding
Pro: reducing maintenance costs of this package by reducing interface complexity
Con: higher variance in RAM usage (before converting to a tibble)

Discussion of JSON vs CSV performance

After an extensive conversation and some tests with @brookslogan, we found that the CSV format is not substantially more performant than JSON for serving our data:

Bandwidth: uncompressed JSON is about ~3x the size of a CSV (for our data), but only ~1.25x compressed
Runtime: unzipping time is dominated by query/disk time (see benchmarks below)
RAM usage: our largest queries would be ~1GB in CSV uncompressed and therefore ~3GB in JSON, which is manageable by data science laptops

The pros of JSON over CSV are:

universally supported as the return hierarchical return format of our API, whereas CSV only handles tabular formats
no delimiter issues and UTF-8 compliant by default (minor, since we get to choose the encoding server-side)

h/t to @brookslogan for these benchmarks

     http_headers <- httr::add_headers("User-Agent" = paste0("epidatr/", version), "Accept-Encoding" = "gzip")
     call <- covidcast(
       data_source = "jhu-csse",
       signals = "confirmed_7dav_incidence_prop",
       time_type = "day",
       geo_type = "state",
       time_values = epirange(20200601, 20220801),
       geo_values = "*"
     )
     microbenchmark::microbenchmark(
     fetch_json(call),
     fetch_csv(call),
     fetch_tbl(call),
     times = 10L
     )

Unit: seconds
             expr      min       lq     mean   median       uq      max neval
 fetch_json(call) 2.918282 3.011072 3.183227 3.082953 3.403508 3.629481    10
  fetch_csv(call) 2.811736 2.847303 3.197557 2.884883 2.957755 4.573654    10
  fetch_tbl(call) 2.889439 3.011481 3.376560 3.125979 3.846497 4.012702    10

The text was updated successfully, but these errors were encountered:

brookslogan · 2023-04-20T01:06:26Z

Pro: this actually gives us Epidata errors for everything. (We could get it while using csv delivery with some tweaks/details but I'm not sure the server does this now.)

Bug: delphi should be a list / nested structure, but fetch_classic doesn't actually respect this and turns it into a data frame. The culprit is probably

  r <- httr::content(res, "text", encoding = "UTF-8")
  m <- jsonlite::fromJSON(r)

vs. delphi_epidata.R's

content(res, 'parsed')

Potential Con: in some/all operating systems & R versions, R cannot give RAM back to the OS after its freed. So the RAM increase is not just before converting to tibble but also after. Probably fine; if you're working with a really large amount of data your downstream models and forecasts are probably really large too.
Con: removing csv functionality removes ability to handle csv parsing via other functions, e.g., data.table::fread, maybe equivalents from arrow/spark/polars/..., so users can't opt-in to more dependencies to potentially improve performance.
Question: is csv not UTF-8 compliant by default?

dshemetov · 2023-04-20T01:14:37Z

Re: CSV and UTF-8, nope:

RFC 4180 proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term "CSV" might refer to any file that:[1][5]

is plain text using a character encoding such as ASCII, various Unicode character encodings (e.g. UTF-8), EBCDIC, or Shift JIS,

consists of records (typically one record per line),

with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),

where every record has the same sequence of fields.

This point might be somewhat irrelevant, since we are in control of the CSV encoding server-side, so we can enforce a standard.

brookslogan · 2023-04-20T01:18:08Z

~~Potential Con: users that want old Epidata row lists can't get them if they can't force classic format.~~ [no, they can just use fetch_classic() if it's fixed]

Todo if implemented: fix fetch_tbl() and make fetch() (on non-classic format) properly stop for Epidata errors [get the data in classic format instead of csv, stop for errors, else format to tbl. Then fetch() is just dispatching to either fetch_tbl() or fetch_classic().]

I like the default fetch proposal a lot. A couple of design alternatives I recall from past discussions, thoughts?

A: current proposal:

fetch(call) # fetch via classic, format based on endpoint
fetch_tbl(call) # fetch via classic(?), format to tbl
fetch_classic(call) # fetch via classic, format to classic

B: one exported fetch:

fetch(call, format = c("default","tbl","classic", {any others})) # forward to nonexported functions below
# non-exported:
fetch_default(call) # same as `fetch()` in A
fetch_tbl(call) # fetch via classic(?), format to tbl
fetch_classic(call) # fetch via classic, format to classic

C: no exported fetch functions. Like B, but there's such a format arg in each endpoint function. Would likely need an extra format option or two, like "epidatacall" or "request_url"

Question: since user might use csv option more performantly than our client code, should we allow it?

Wish: improve naming of classic and fix any other bugs

Worry: at some point, "classic" also referred to a list of data frames in covidcast. Doesn't seem to be the case when I try now, but not sure if somewhere this name would trip us up.

brookslogan · 2023-04-20T01:25:57Z

Question: should we make fetch_classic() and fetch() on classic-only endpoints also stop for status codes else return the epidata? That seems like a really common operation. The current classic error representation could be put into something like fetch_result() if we want. [this might already be implied in the proposed new interface, I'm just not sure.]

dshemetov · 2023-04-20T07:24:04Z

Re: alternative interface proposals

I don't like C due to code duplication
The difference between B and A is very small, but if I had to pick one it would be A because it has fewer functions and I don't see greater usability for the extra functions from B

Re: allowing users to use csv themselves, as we spoke in Slack, let's wait before optimizing in that direction. We can focus on providing a single clean interface through JSON for now and then consider other formats for speed if the need arises.

Re: renaming classic, agreed, it's not great; if we give the user a single fetch function, then the user doesn't have to deal with it.

Re: classic being a list of data frames, maybe we can not export fetch_classic then (and in that case, don't export fetch_json either).

dshemetov added enhancement New feature or request question Further information is requested P0 Top priority labels Apr 18, 2023

dshemetov mentioned this issue Apr 20, 2023

classic only endpoints should return plain JSON, but fromJSON auto converts to data.frame #88

Closed

dshemetov mentioned this issue Apr 24, 2023

Update check functions for clarity and add tests #89

Merged

brookslogan mentioned this issue Apr 27, 2023

Consider adding fetch functions returning epiprocess objects #91

Open

dshemetov self-assigned this May 3, 2023

dshemetov mentioned this issue May 6, 2023

Consolidate fetch interfaces #99

Merged

1 task

brookslogan closed this as completed in #99 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate `fetch_*` interfaces #72

Consolidate `fetch_*` interfaces #72

dshemetov commented Mar 15, 2023 •

edited

Loading

brookslogan commented Apr 20, 2023

dshemetov commented Apr 20, 2023 •

edited

Loading

brookslogan commented Apr 20, 2023 •

edited

Loading

brookslogan commented Apr 20, 2023 •

edited

Loading

dshemetov commented Apr 20, 2023

Consolidate fetch_* interfaces #72

Consolidate fetch_* interfaces #72

Comments

dshemetov commented Mar 15, 2023 • edited Loading

TODO

Notes on the current interface

Proposed new interface

Discussion of JSON vs CSV performance

brookslogan commented Apr 20, 2023

dshemetov commented Apr 20, 2023 • edited Loading

brookslogan commented Apr 20, 2023 • edited Loading

brookslogan commented Apr 20, 2023 • edited Loading

dshemetov commented Apr 20, 2023

Consolidate `fetch_*` interfaces #72

Consolidate `fetch_*` interfaces #72

dshemetov commented Mar 15, 2023 •

edited

Loading

dshemetov commented Apr 20, 2023 •

edited

Loading

brookslogan commented Apr 20, 2023 •

edited

Loading

brookslogan commented Apr 20, 2023 •

edited

Loading