Skip to content

flusurv data is stale #1247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brookslogan opened this issue Jul 26, 2023 · 7 comments · May be fixed by #1278
Open

flusurv data is stale #1247

brookslogan opened this issue Jul 26, 2023 · 7 comments · May be fixed by #1278
Assignees

Comments

@brookslogan
Copy link
Contributor

The version of FluSurv-NET data available appears to be from 2021-05-28, containing data through the epiweek labeled with date 2020-04-19, while the upstream source has data for the 2022/2023 flu season.

library(epidatr)
dat = flusurv(locations = "network_all", epiweeks = epirange(201701, 202301)) %>% fetch()
max(dat$release_date)
#> [1] "2021-05-28"
max(dat$epiweek)
#> [1] "2020-04-19"
names(dat)
#>  [1] "release_date" "location"     "issue"        "epiweek"      "lag"         
#>  [6] "rate_age_0"   "rate_age_1"   "rate_age_2"   "rate_age_3"   "rate_age_4"  
#> [11] "rate_overall"

Created on 2023-07-26 with reprex v2.0.2

FluSurv-NET acquisition broke circa 2020-10-09 but was patched to ignore age groups that were introduced then. From the above sample, it looks like these/other age groups are still being ignored; upstream has 2 top-level age groups, 5 subgroups, and 8 subsubgroups; the API returns only 5 age groups. I believe age group changes may have broken flusurv acquisition at some other point in time as well, so that might be a top suspect for the current breakage.

The fluview* outage might be too late to be related. FluSurv-NET reporting is not year-round; it typically starts at some point during the flu season when activity levels / influenza hospitalization numbers are deemed high enough (I don't remember the precise rule) and ends at/after the end of the flu season, with some break in issues and/or gap in measurements before the next season.

@brookslogan
Copy link
Contributor Author

brookslogan commented Jul 26, 2023

Some other notes: upstream source:

  • has a few more breakdowns now which would be of interest: Sex, Race/Ethnicity, Season
  • shows a blank plot for Cumulative Rate for [the 2020-21 flu season, which would have started ~ Oct 2020] and "No Data Available" for corresponding Weekly Rates. [so maybe that + the fluview* outage could be connected]

@nmdefries nmdefries self-assigned this Aug 8, 2023
@nmdefries
Copy link
Contributor

nmdefries commented Aug 18, 2023

Looking at the oldest and newest logs available for flusurv acquisition (can't find any before April 2023):

we always see "current issue: 202111" and

rows before: 212669
rows after: 212669 (+0)

So issues newer than March 2021 are not available/being fetched, and thus no new data is added to the DB. Although we don't have older logs, I'd assume that this has been going on since May 2021, which is the latest release_date available in our flusurv data.

This matches my local testing. get_current_issue() returns 202111 using a magic URL. I wonder if we were supposed to switch to a different magic URL.

There are no errors; the pipeline appears to have been running successfully this whole time.

@nmdefries
Copy link
Contributor

nmdefries commented Aug 18, 2023

The cdcfluview package successfully gets up-to-date data (loaddatetime is Aug 12, 2023; data is available through 2023w17), so it looks like the Flu3 endpoints changed. The old ones still work but aren't returning new data.

The CDC GIS GRASP API/AMF server appears to be meant solely for internal use -- I can't find any documentation and can't find the new https://gis.cdc.gov/GRASP/Flu3/PostPhase03DataTool endpoint myself. I assume it is used somewhere in the dashboard source. The new https://gis.cdc.gov/GRASP/Flu3/PostPhase03DataTool endpoint can be found by loading (in Chrome) the source dashboard, turning on the inspector, going to the Network tab, and reloading or otherwise interacting with the page. API queries will show up as network requests.

The returned data has also changed format, so we'll need to update flusurv.extract_from_object() to account for that. It doesn't appear possible to request specific locations from the new endpoint. Edit: You can request locations with a payload like

{
  "appversion": "Public",
  "key": "getdata",
  "injson": [
    {
      "seasonid": 62,
      "networkid": 2,
      "catchmentid": 22
    }
  ]
}

Not sure if seasonid is required.

To avoid fetching the same (large) JSON multiple times, my recommendation is to fetch once upfront in main(), and pass the resulting JSON to get_current_issue and get_data functions to extract data of interest. This also avoids different endpoints potentially returning data from different loaddatetime. (Currently we use GetPhase03InitApp for the loaddatetime and PostPhase03GetData for the location data. With no documentation, it's hard to guarantee their behavior.)

We should also consider how to avoid this type of issue in the future. Based on our mirror of the historical data, the source is updated infrequently so it's possible that we'd see fairly long periods without any data updates. This means that we can't just error out if no new data is returned.

@nmdefries
Copy link
Contributor

nmdefries commented Aug 18, 2023

On recovering versioned data, again, because of the lack of documentation it's unclear to me if it's possible to request data from a particular loaddatetime. Edit: According to our CDC contacts, this is not possible.

@nmdefries
Copy link
Contributor

nmdefries commented Aug 21, 2023

Since fluview* signals pull from another CDC dashboard, we should double-check that those signals are updating correctly. Their CDC API endpoints may also have changed.

@brookslogan
Copy link
Contributor Author

The fluview outage also involved reporting the wrong "current" epiweek, but apparently was due to writing/querying from the wrong server/table. Doesn't seem like that is the case here, but I'm not 100% certain.

The fluview ones use Phase02 rather than Phase03, but still sounds like a good idea to check!

@nmdefries nmdefries linked a pull request Aug 23, 2023 that will close this issue
4 tasks
@nmdefries
Copy link
Contributor

nmdefries commented Sep 15, 2023

In the motivating example,

library(epidatr)
dat = flusurv(locations = "network_all", epiweeks = epirange(201701, 202301)) %>% fetch()
...
names(dat)
#>  [1] "release_date" "location"     "issue"        "epiweek"      "lag"         
#>  [6] "rate_age_0"   "rate_age_1"   "rate_age_2"   "rate_age_3"   "rate_age_4"  
#> [11] "rate_overall"

the result doesn't have all the value columns we'd expect. Acquisition attempts to update these columns, plus rate_age_5, 6, and 7. Float fields need to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants