-
Notifications
You must be signed in to change notification settings - Fork 67
http caching for metadata #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
e7b69eb
4ce0ff5
9e83107
d6a56cf
082cc75
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -26,57 +26,66 @@ def __str__(self): | |||||||||||||||||
return f"{self.source}:{self.signal}" | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
def fetch_data( | ||||||||||||||||||
time_types: Optional[List[str]], | ||||||||||||||||||
geo_types: Optional[List[str]], | ||||||||||||||||||
signals: Optional[List[SourceSignal]], | ||||||||||||||||||
): | ||||||||||||||||||
# complain if the cache is more than 75 minutes old | ||||||||||||||||||
max_age = 75 * 60 | ||||||||||||||||||
|
||||||||||||||||||
row = db.execute( | ||||||||||||||||||
# empty generator that never yields | ||||||||||||||||||
def _nonerator(): | ||||||||||||||||||
return | ||||||||||||||||||
yield | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
@bp.route("/", methods=("GET", "POST")) | ||||||||||||||||||
def handle(): | ||||||||||||||||||
time_types = extract_strings("time_types") | ||||||||||||||||||
signals = [SourceSignal(v) for v in (extract_strings("signals") or [])] | ||||||||||||||||||
geo_types = extract_strings("geo_types") | ||||||||||||||||||
|
||||||||||||||||||
printer = create_printer(request.values.get("format")) | ||||||||||||||||||
|
||||||||||||||||||
metadata = db.execute( | ||||||||||||||||||
text( | ||||||||||||||||||
"SELECT UNIX_TIMESTAMP(NOW()) - timestamp AS age, epidata FROM covidcast_meta_cache LIMIT 1" | ||||||||||||||||||
) | ||||||||||||||||||
).fetchone() | ||||||||||||||||||
|
||||||||||||||||||
if not row or not row["epidata"]: | ||||||||||||||||||
if not metadata or "epidata" not in metadata: | ||||||||||||||||||
# the db table `covidcast_meta_cache` has no rows | ||||||||||||||||||
get_structured_logger('server_api').warning("no data in covidcast_meta cache") | ||||||||||||||||||
return | ||||||||||||||||||
|
||||||||||||||||||
age = row["age"] | ||||||||||||||||||
if age > max_age and row["epidata"]: | ||||||||||||||||||
get_structured_logger('server_api').warning("covidcast_meta cache is stale", cache_age=age) | ||||||||||||||||||
|
||||||||||||||||||
epidata = loads(row["epidata"]) | ||||||||||||||||||
|
||||||||||||||||||
if not epidata: | ||||||||||||||||||
return | ||||||||||||||||||
return printer(_nonerator()) | ||||||||||||||||||
|
||||||||||||||||||
def filter_row(row: Dict): | ||||||||||||||||||
if time_types and row.get("time_type") not in time_types: | ||||||||||||||||||
return False | ||||||||||||||||||
if geo_types and row.get("geo_type") not in geo_types: | ||||||||||||||||||
return False | ||||||||||||||||||
if not signals: | ||||||||||||||||||
return True | ||||||||||||||||||
for signal in signals: | ||||||||||||||||||
# match source and (signal or no signal or signal = *) | ||||||||||||||||||
if row.get("data_source") == signal.source and ( | ||||||||||||||||||
signal.signal == "*" or signal.signal == row.get("signal") | ||||||||||||||||||
): | ||||||||||||||||||
return True | ||||||||||||||||||
return False | ||||||||||||||||||
metadata_list = loads(metadata["epidata"]) | ||||||||||||||||||
|
||||||||||||||||||
for row in epidata: | ||||||||||||||||||
if filter_row(row): | ||||||||||||||||||
yield row | ||||||||||||||||||
if not metadata_list: | ||||||||||||||||||
# the db table has a row, but there is no metadata about any signals in it | ||||||||||||||||||
get_structured_logger('server_api').warning("empty entry in covidcast_meta cache") | ||||||||||||||||||
return printer(_nonerator()) | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Optional: be nice to robot if we wanted to not make sonarLint nervous we could replace this with
Suggested change
via experimentation: >>> def foo():
... return
... yield
...
>>> type(foo())
<class 'generator'>
>>> type(x for x in [])
<class 'generator'>
>>> list(foo())
[]
>>> list(x for x in [])
[] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alternately (since def _null_printer():
return printer(x for x in [])
...
if not metadata_list:
# the db table has a row, but there is no metadata about any signals in it
get_structured_logger('server_api').warning("empty entry in covidcast_meta cache")
return _null_printer() There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe im missing something, but i cant find where sonar flagged this...? to convey intent, i prefer a named item over a weird empty comprehension just hanging out there. to that end, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added an exception so it would build. the condition it fails is "no unreachable code" and the yield is unreachable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as i said in another comment, its reachable enough for python to consider this a generator... but regardless, i changed it to the alternative empty generator. |
||||||||||||||||||
|
||||||||||||||||||
standard_age = 60 * 60 # expected metadata regeneration interval, in seconds (currently 60 mins) | ||||||||||||||||||
# TODO: get this ^ from a config var? ideally, it should be set to the time between runs of | ||||||||||||||||||
# src/acquisition/covidcast/covidcast_meta_cache_updater.py | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Strongly recommended: change clarifying comment I think this code is mixing up the metadata cache age (when did we last write to the metadata cache table) from the cache-control max-age (how long should the user hold on to this response before fetching again). We haven't cared about metadata cache age since late fall 2020, and should remove the associated checks. For the cache-control max-age, we should consider the following factors:
A 1-hour max-age is the shortest period that prevents unexpected 429s from otherwise-compliant use of the clients, given the above. It doesn't match the update cadence of the metadata cache table. I don't expect us to need to adjust max-age until we are adding new data and recomputing metadata dozens of times a day, or we enter another public health emergency, and likely not until both are true. In any of those scenarios, making a change via a new release instead of a via a config variable is completely reasonable.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its not mixing up the ages: the client should hold on to the response until a subsequent fetch can be expected to return a new value, which is the sum of the time of the last write to the table plus the write interval. it is independent of the rate limiting window, as the metadata we serve is quite literally a cache and this just decorating it with http directives to reveal accurate timing info to the client. fwiw, the metadata is updated every 2h (search i do like your "this should be updated..." text though, and ill include that. |
||||||||||||||||||
age = metadata["age"] | ||||||||||||||||||
if age > standard_age * 1.25: | ||||||||||||||||||
# complain if the cache is too old (currently, more than 75 mins old) | ||||||||||||||||||
get_structured_logger('server_api').warning("covidcast_meta cache is stale", cache_age=age) | ||||||||||||||||||
krivard marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
|
||||||||||||||||||
@bp.route("/", methods=("GET", "POST")) | ||||||||||||||||||
def handle(): | ||||||||||||||||||
time_types = extract_strings("time_types") | ||||||||||||||||||
signals = [SourceSignal(v) for v in (extract_strings("signals") or [])] | ||||||||||||||||||
geo_types = extract_strings("geo_types") | ||||||||||||||||||
|
||||||||||||||||||
return create_printer(request.values.get("format"))(filter_fields(fetch_data(time_types, geo_types, signals))) | ||||||||||||||||||
def cache_entry_gen(): | ||||||||||||||||||
for entry in metadata_list: | ||||||||||||||||||
if time_types and entry.get("time_type") not in time_types: | ||||||||||||||||||
continue | ||||||||||||||||||
if geo_types and entry.get("geo_type") not in geo_types: | ||||||||||||||||||
continue | ||||||||||||||||||
if not signals: | ||||||||||||||||||
yield entry | ||||||||||||||||||
for signal in signals: | ||||||||||||||||||
# match source and (signal or no signal or signal = *) | ||||||||||||||||||
if entry.get("data_source") == signal.source and ( | ||||||||||||||||||
signal.signal == "*" or signal.signal == entry.get("signal") | ||||||||||||||||||
): | ||||||||||||||||||
yield entry | ||||||||||||||||||
|
||||||||||||||||||
return printer( | ||||||||||||||||||
filter_fields(cache_entry_gen()), | ||||||||||||||||||
headers={ | ||||||||||||||||||
"Cache-Control": f"max-age={standard_age}, public", | ||||||||||||||||||
"Age": f"{age}", | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue: this won't work as-is We don't want this --
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i disagree. i think it will in fact work, especially because (as i mentioned above) the R lib doesnt even seem to respect the "Age" header. and we do want this -- when the time elapsed since the resource was cached plus alternatively, we could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
can you link me to this? i can't seem to find it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use the link from the opening comment ( https://github.com/r-lib/httr/blob/HEAD/R/cache.R ) and search for "age" |
||||||||||||||||||
# TODO?: "Expires": f"{}", # superseded by Cache-Control: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires | ||||||||||||||||||
} | ||||||||||||||||||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is unreachable and makes Sonar nervous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its reachable enough for python to consider this a generator ¯\_(ツ)_/¯
we can probably replace
return;yield
withreturn (x for x in [])
, similar to what you suggested before. let me try that out.