Skip to content

covidcast endpoint - no mutual exclusivity check for issue, lag, and as_of fields #1159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dshemetov opened this issue May 3, 2023 · 7 comments

Comments

@dshemetov
Copy link
Contributor

dshemetov commented May 3, 2023

Some of these should probably give a validation error.

@melange396
Copy link
Collaborator

While its difficult to imagine a truly useful reason for specifying more than one of those arguments (that couldnt be achieved in a simpler way), theyre not technically incompatible. In many cases, a combination of them could make up a perfectly valid query (for some definition of "valid"). Someone might request an issue of january 1 and a lag of 1 with a time_value wildcard because they need to know a value for december 31 but were too lazy to do the calendar arithmetic... And that result would just as accurate as_of january 2 as today.

This opens up a can of worms: Should we alert if the issue or as_of date is earlier than the time_value? Should we alert if a time_value, issue, or as_of date is in the future? Should we alert if time_value plus lag is in the future??? There are a number of semantic constraints we could enforce, but i expect that our users are generally smart enough to know the difference. Im not sure if we should coddle or handhold our users with hopes of keeping them from making any fathomable kind of mistake -- and even if we try, someone will inevitably stumble upon something we hadnt thought of.

One exception to this is that as_of queries currently can be costly. We already have users who specify as_of="today". this, again, is technically not wrong... but such is life. Writing this up inspired me to create #1161 , which i have been meaning to do for a long while.

@dshemetov
Copy link
Contributor Author

dshemetov commented May 4, 2023

Hm, you make a great point, especially the part about semantic constraints turning into a complexity slippery slope.

From the perspective of minimizing API server complexity, it seems ideal to have the API be as close to an SQL query wrapper as possible (while satisfying other constraints like security/API keys and some minimal argument validation).

From the perspective of maximizing API usability, we should probably ask ourselves: is it easy to use our API wrong?

In one sense yes, since queries aren't guarded against mistakes. But practicality, I don't think so, since mixing the arguments I listed above is very much an opt-in: a user would have to write out those multiple arguments themselves, but by default these arguments don't collide.

So I think I'm convinced that we can leave these alone. This is making me rethink some guards present in epidatr, such as issues and lag mutually exclusive checks and the recently added Date validation. We can probably remove the mutual exclusivity checks, while Date validation is probably worth keeping.

We can probably close this issue, but keep it in mind as a record of what our API does when mixing these arguments. Before writing this, I thought that some of these arguments would take precedence over others.

@brookslogan
Copy link
Contributor

brookslogan commented May 5, 2023

We [probably shouldn't] leave these completely alone, because they no longer do the right thing in covidcast. I expect an issues plus as_of query to get me issues through the as_of [as @dshemetov described above], but it appears to just get me a normal as_of query.

library(epidatr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
cce = covidcast_epidata()
as_of_issue_data1 = cce$signals$`doctor-visits.doctor-visits:smoothed_cli`$call(
  "state", "az", 20230101, issues="*", as_of=20230105
) %>% fetch_tbl()
as_of_issue_data2 = cce$signals$`doctor-visits.doctor-visits:smoothed_cli`$call(
  "state", "az", 20230101, issues="*", as_of=20230106
) %>% fetch_tbl()
as_of_issue_data1 %>% distinct(issue)
#> # A tibble: 1 × 1
#>   issue     
#>   <date>    
#> 1 2023-01-05
as_of_issue_data2 %>% distinct(issue)
#> # A tibble: 1 × 1
#>   issue     
#>   <date>    
#> 1 2023-01-06

Created on 2023-05-04 with reprex v2.0.2

[Actually, I would like to use this type of query. It is a convenient way to do reproducible version-aware analyses. (Plus if we ever separate out different types of "versions" (e.g., what the data source said / would/should have said vs. what the API had available), I'd imagine we'd consider pairing data_source_issues="*" with api_as_of = analysis_datetime.) But it's not that critical right now of course, and #1085 would close the gap on convenience.]

Aside on the validation: it would be nice to get some warnings when requesting as_of beyond today / beyond our latest available issue / maybe some similar things. [The beyond-latest-issue check w]ould help detect when we've run something too early in the morning, or on a stale data source.]

@brookslogan brookslogan reopened this May 5, 2023
@krivard
Copy link
Contributor

krivard commented May 5, 2023

We do specifically say in the documentation to use only one of the three versioning parameters:

image

so I'm in favor of a validation error.

@brookslogan I'm not sure what is expected from issue-star+asof; asof is by definition ragged-edge ("the greatest issue less than X for each row") whereas issue queries are more columnar ("all rows posted on this date"). Also want to note, support for issue inequalities ("all issues less than X") are currently in review and should be available soon pending a minor ~enums refactor

@brookslogan
Copy link
Contributor

brookslogan commented May 5, 2023

To me, as_of sounds like a meta-level thing that asks: "what would this query have said as of this date?" (glossing over some details about issue labeling, potential issue data rewriting).

So for issues="*", as_of="2023-01-05", that'd be whatever the issues="*" query would have said "on" 2023-01-05 (a little more precisely, whenever the 2023-01-05 issue was released). That [follows the rule that] Dmitry wrote above:

returns all rows where issue in issue_range and issue <= as_of

I'm fine with a validation error as well though.

@dshemetov
Copy link
Contributor Author

dshemetov commented May 19, 2023

After thinking about this a bit, I feel fine with enforcing that only one of issue, as_of, lag can be specified at a time. The main reason is that it's just too confusing to reason about what their combinations will do. I don't think it's an overreach into semantic validation to be clear about which arguments are mutually exclusive - it's just a marker of "there be dragons ahead" and a posting of some guardrails, to prevent a user from falling into a pit of said dragons.

Case in point, the as_of + issue combo in @brookslogan's example. It's not clear whether it should:

  1. "return all rows where issue in issue_range and issue <= as_of" (just a fancy one-sided range)
  2. "return all rows where max(issue in issue_range) == as_of" (regular as_of query but on a subset of the rows; my guess is this is what the code is currently doing]

And if I understood @brookslogan's point correctly, the point about "what the issues="*" query would have said on 2023-01-05", well those seems like they will be covered by the upcoming inequalities PR, with something like issues="<=20230105", so we don't lose on usability by making these mutually exclusive.

Aside on the validation: it would be nice to get some warnings when requesting as_of beyond today / beyond our latest available issue / maybe some similar things. [The beyond-latest-issue check w]ould help detect when we've run something too early in the morning, or on a stale data source.]

To me, this seems like the kind of validation that belongs in a data analyst's / modeler's repo downstream, not as part of server code. I'm happy for things like this to exist in a modeling package, but to have this be in the server code feels like importing our own assumptions about data usage too far upstream.

@brookslogan
Copy link
Contributor

Seems like there is some agreement on adding server-side exclusivity checks.

(Asides:

  • issues= + as_of= shouldn't do 2, that's super weird. Makes more sense to just directly ask for issues = max issue or as of or absolutely no data, whichever they wanted. 1. makes sense. Current behavior is neither 1. nor 2.; it just does a regular as_of query and ignores the issues parameter altogether.
  • On issues= + as_of= being covered by issues inequality query: yep! (Until when/if we have multiple types of version tags.)
  • Other validation: I'm not not sure here. If they're requesting version data explicitly for a range including today but we don't have today's version in it yet, I feel a little nervous about the server/client continuing happily along in the same way as if we did have today's version in already. But this seems outside the scope of this Issue anyway.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants