Skip to content

http caching for metadata #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions src/server/_printer.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,14 @@ def __init__(self):
self.result: int = -1
self._max_results: int = MAX_COMPATIBILITY_RESULTS if is_compatibility_mode() else MAX_RESULTS

def make_response(self, gen):
def make_response(self, gen, headers=None):
return Response(
gen,
mimetype="application/json",
headers=headers,
)

def __call__(self, generator: Iterable[Dict[str, Any]]) -> Response:
def __call__(self, generator: Iterable[Dict[str, Any]], headers=None) -> Response:
def gen():
self.result = -2 # no result, default response
began = False
Expand Down Expand Up @@ -84,7 +85,7 @@ def gen():
if r is not None:
yield r

return self.make_response(stream_with_context(gen()))
return self.make_response(stream_with_context(gen()), headers=headers)

@property
def remaining_rows(self) -> int:
Expand Down Expand Up @@ -223,8 +224,11 @@ def __init__(self, filename: Optional[str] = "epidata"):
super(CSVPrinter, self).__init__()
self._filename = filename

def make_response(self, gen):
headers = {"Content-Disposition": f"attachment; filename={self._filename}.csv"} if self._filename else {}
def make_response(self, gen, headers=None):
if headers is None:
headers = {}
if self._filename:
headers["Content-Disposition"] = f"attachment; filename={self._filename}.csv"
return Response(gen, mimetype="text/csv; charset=utf8", headers=headers)

def _begin(self):
Expand Down Expand Up @@ -296,8 +300,8 @@ class JSONLPrinter(APrinter):
a printer class writing in JSONLines format
"""

def make_response(self, gen):
return Response(gen, mimetype=" text/plain; charset=utf8")
def make_response(self, gen, headers=None):
return Response(gen, mimetype=" text/plain; charset=utf8", headers=headers)

def _begin(self):
if show_hard_api_key_warning():
Expand Down
97 changes: 53 additions & 44 deletions src/server/endpoints/covidcast_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,57 +26,66 @@ def __str__(self):
return f"{self.source}:{self.signal}"


def fetch_data(
time_types: Optional[List[str]],
geo_types: Optional[List[str]],
signals: Optional[List[SourceSignal]],
):
# complain if the cache is more than 75 minutes old
max_age = 75 * 60

row = db.execute(
# empty generator that never yields
def _nonerator():
return
yield
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is unreachable and makes Sonar nervous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its reachable enough for python to consider this a generator ¯\_(ツ)_/¯
we can probably replace return;yield with return (x for x in []), similar to what you suggested before. let me try that out.



@bp.route("/", methods=("GET", "POST"))
def handle():
time_types = extract_strings("time_types")
signals = [SourceSignal(v) for v in (extract_strings("signals") or [])]
geo_types = extract_strings("geo_types")

printer = create_printer(request.values.get("format"))

metadata = db.execute(
text(
"SELECT UNIX_TIMESTAMP(NOW()) - timestamp AS age, epidata FROM covidcast_meta_cache LIMIT 1"
)
).fetchone()

if not row or not row["epidata"]:
if not metadata or "epidata" not in metadata:
# the db table `covidcast_meta_cache` has no rows
get_structured_logger('server_api').warning("no data in covidcast_meta cache")
return

age = row["age"]
if age > max_age and row["epidata"]:
get_structured_logger('server_api').warning("covidcast_meta cache is stale", cache_age=age)

epidata = loads(row["epidata"])

if not epidata:
return
return printer(_nonerator())

def filter_row(row: Dict):
if time_types and row.get("time_type") not in time_types:
return False
if geo_types and row.get("geo_type") not in geo_types:
return False
if not signals:
return True
for signal in signals:
# match source and (signal or no signal or signal = *)
if row.get("data_source") == signal.source and (
signal.signal == "*" or signal.signal == row.get("signal")
):
return True
return False
metadata_list = loads(metadata["epidata"])

for row in epidata:
if filter_row(row):
yield row
if not metadata_list:
# the db table has a row, but there is no metadata about any signals in it
get_structured_logger('server_api').warning("empty entry in covidcast_meta cache")
return printer(_nonerator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: be nice to robot

if we wanted to not make sonarLint nervous we could replace this with

Suggested change
return printer(_nonerator())
return printer(x for x in [])

via experimentation:

>>> def foo():
...     return
...     yield
... 
>>> type(foo())
<class 'generator'>
>>> type(x for x in [])
<class 'generator'>
>>> list(foo())
[]
>>> list(x for x in [])
[]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternately (since _nonerator is used on both L52 and L59) we could refactor to pull out the printer instead of just the generator:

def _null_printer():
    return printer(x for x in [])

...
    if not metadata_list:
        # the db table has a row, but there is no metadata about any signals in it
        get_structured_logger('server_api').warning("empty entry in covidcast_meta cache")
        return _null_printer()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe im missing something, but i cant find where sonar flagged this...?

to convey intent, i prefer a named item over a weird empty comprehension just hanging out there. to that end, the _null_printer() could work. we might even want to put it into the printer module.

Copy link
Contributor

@krivard krivard Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an exception so it would build. the condition it fails is "no unreachable code" and the yield is unreachable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i said in another comment, its reachable enough for python to consider this a generator... but regardless, i changed it to the alternative empty generator.


standard_age = 60 * 60 # expected metadata regeneration interval, in seconds (currently 60 mins)
# TODO: get this ^ from a config var? ideally, it should be set to the time between runs of
# src/acquisition/covidcast/covidcast_meta_cache_updater.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly recommended: change clarifying comment

I think this code is mixing up the metadata cache age (when did we last write to the metadata cache table) from the cache-control max-age (how long should the user hold on to this response before fetching again). We haven't cared about metadata cache age since late fall 2020, and should remove the associated checks.

For the cache-control max-age, we should consider the following factors:

  • the rate limiter resets every 1 hour
  • the meta cache is updated every 3 hours
  • new data is added only 7-8 times per day
  • the amount of new data added each day is small
  • since we are not in a public health emergency, there is no great need for up-to-the-minute accuracy, though of course it's nice to see the latest data, especially if you're trying to debug something

A 1-hour max-age is the shortest period that prevents unexpected 429s from otherwise-compliant use of the clients, given the above. It doesn't match the update cadence of the metadata cache table.

I don't expect us to need to adjust max-age until we are adding new data and recomputing metadata dozens of times a day, or we enter another public health emergency, and likely not until both are true. In any of those scenarios, making a change via a new release instead of a via a config variable is completely reasonable.

Suggested change
standard_age = 60 * 60 # expected metadata regeneration interval, in seconds (currently 60 mins)
# TODO: get this ^ from a config var? ideally, it should be set to the time between runs of
# src/acquisition/covidcast/covidcast_meta_cache_updater.py
# recommended rerequest period, in seconds (currently 60 minutes)
## this should be updated if a stale cache will have undue impact on user activities, such as
## if we start updating the metadata table much more frequently and having up-to-the-minute
## metadata accuracy becomes important to users once more.
standard_age = 60 * 60

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not mixing up the ages: the client should hold on to the response until a subsequent fetch can be expected to return a new value, which is the sum of the time of the last write to the table plus the write interval.

it is independent of the rate limiting window, as the metadata we serve is quite literally a cache and this just decorating it with http directives to reveal accurate timing info to the client.

fwiw, the metadata is updated every 2h (search epidata_data.logger:"compute_covidcast_meta" or epidata_data.logger:"metadata_cache_updater" in elastic), and thus standard_age should actually be set to that. new commit coming shortly...

i do like your "this should be updated..." text though, and ill include that.

age = metadata["age"]
if age > standard_age * 1.25:
# complain if the cache is too old (currently, more than 75 mins old)
get_structured_logger('server_api').warning("covidcast_meta cache is stale", cache_age=age)

@bp.route("/", methods=("GET", "POST"))
def handle():
time_types = extract_strings("time_types")
signals = [SourceSignal(v) for v in (extract_strings("signals") or [])]
geo_types = extract_strings("geo_types")

return create_printer(request.values.get("format"))(filter_fields(fetch_data(time_types, geo_types, signals)))
def cache_entry_gen():
for entry in metadata_list:
if time_types and entry.get("time_type") not in time_types:
continue
if geo_types and entry.get("geo_type") not in geo_types:
continue
if not signals:
yield entry
for signal in signals:
# match source and (signal or no signal or signal = *)
if entry.get("data_source") == signal.source and (
signal.signal == "*" or signal.signal == entry.get("signal")
):
yield entry

return printer(
filter_fields(cache_entry_gen()),
headers={
"Cache-Control": f"max-age={standard_age}, public",
"Age": f"{age}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: this won't work as-is

We don't want this -- Age gets subtracted from Cache-Control.max-age when clients compute whether the response can be reused or not.

Suggested change
"Age": f"{age}",

Reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#:~:text=Note%20that%20max,freshness%20lifetime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i disagree. i think it will in fact work, especially because (as i mentioned above) the R lib doesnt even seem to respect the "Age" header.

and we do want this -- when the time elapsed since the resource was cached plus Age exceeds max-age, there is very likely to be a newly calculated metadata waiting to be retrieved (presuming max-age is set to the metadata regeneration interval).

alternatively, we could use "Cache-Control": f"max-age={max(standard_age-age, 1)}, public" to achieve a similar effect without using the "Age" header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i mentioned above

can you link me to this? i can't seem to find it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# TODO?: "Expires": f"{}", # superseded by Cache-Control: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires
}
)