Skip to content

add healthdata.gov signals on hospitalizations #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nickreich opened this issue Nov 17, 2020 · 20 comments · Fixed by #292
Closed

add healthdata.gov signals on hospitalizations #288

nickreich opened this issue Nov 17, 2020 · 20 comments · Fixed by #292

Comments

@nickreich
Copy link

We are hoping to migrate the teams doing the hospitalization forecasting challenge at the COVID-19 Forecast Hub over to using the datasets recently (in the last few days) made publicly available at the link below.

Note API links on left-hand panel down the page a bit:
https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-state-timeseries

@capnrefsmmat
Copy link
Contributor

Thanks. I'll move this over to our indicators repository and flag it to review when we do our next prioritization.

@capnrefsmmat capnrefsmmat transferred this issue from cmu-delphi/covidcast Nov 17, 2020
@dfarrow0
Copy link
Contributor

dfarrow0 commented Nov 17, 2020

Thanks for the heads-up @nickreich!

I recommend scraping this as a new data source in epidata, in a new database table. If there are specific indicators to derive from this, I recommend computing those separately and storing them in the existing covidcast table.

@krivard
Copy link
Contributor

krivard commented Nov 17, 2020

Just to note, the API links on the healthdata.gov page are for the metadata only, not the dataset itself. We'll probably need to write a scraper to harvest the CSV link from the page.

@nickreich
Copy link
Author

In case it's useful, this snippet of code was shared with me as an access to the data:

data.table::fread(
  jsonlite::fromJSON(
    "https://healthdata.gov/api/3/action/package_show?id=83b4a668-9321-4d8c-bc4f-2bef66c49050&page=0")$result$resources[[1]]$url
  )

@krivard
Copy link
Contributor

krivard commented Nov 17, 2020

Right -- that's the metadata query.

@nickreich
Copy link
Author

when I run this I see the actual data in my R session.

@krivard
Copy link
Contributor

krivard commented Nov 17, 2020

I stand semi-corrected: the query url https://healthdata.gov/api/3/action/package_show?id=83b4a668-9321-4d8c-bc4f-2bef66c49050&page=0 returns the metadata, and the metatdata includes the url for the csv file at https://healthdata.gov/sites/default/files/reported_hospital_utilization_timeseries_20201115_2134.csv, which is then read by data.table. Fair enough!

@nickreich
Copy link
Author

it ain't pretty, that's for sure.

@krivard
Copy link
Contributor

krivard commented Nov 17, 2020

Moving this to epidata for @dfarrow0 to add the full dataset as a separate endpoint. Once that's done, we'll bounce it back here for covidcast support.

@krivard krivard transferred this issue from cmu-delphi/covidcast-indicators Nov 17, 2020
@RoniRos
Copy link
Member

RoniRos commented Nov 18, 2020

FWIW, Nick just pointed me to the data dictionary for this table. Worth storing with the metadata,

@nickreich
Copy link
Author

Note that the most time-sensitive columns of this dataset to be surfacing are

  • previous_day_admission_pediatric_covid_confirmed
  • previous_day_admission_adult_covid_confirmed


For Forecast Hub purposes, it may make sense to return the sum of these two values, as this is likely to emerge as the central hospitalization forecast target.

@dfarrow0
Copy link
Contributor

Thanks, noted.

It looks like the dataset was updated yesterday (11/18). I have a local version of the previous issue (11/16). When the code gets checked in, I'll manually load the previous issue so that we have a complete revision history in the API.

@nickreich are there versions of this dataset prior to 11/16? if so, do you have them?

@nickreich
Copy link
Author

I don't know of prior versions. I think there were some available privately within the HHS Protect system, but I don't think they are public.

@nickreich
Copy link
Author

Thanks @krivard and @dfarrow0. Is there another issue where I can track the progress of adding this to the covidcast system?

@dfarrow0
Copy link
Contributor

Quick update: acquisition code has been merged and is scheduled to run twice daily. The data is now available via the Delphi Epidata API directly (sample URL), although the PyPI python package hasn't been updated yet — planning to do that tomorrow.

EpiVis, our relatively obscure web visualization for the API, has been updated to support this new dataset. As an example, here's an interactive plot of inpatient bed utilization over time for ND. Spoiler: it's at 75% and increasing.

@dfarrow0
Copy link
Contributor

@nickreich I'll have to defer to @krivard for an ETA on COVIDcast.

But just to clarify, could you elaborate a little bit? For example, are you asking when it'll be available on the map at https://covidcast.cmu.edu/, or when it'll be available though the covidcast R package, or something else?

@RoniRos
Copy link
Member

RoniRos commented Nov 19, 2020

EpiVis, our relatively obscure web visualization for the API, has been updated to support this new dataset.

Hurray, @dfarrow0 ! You must remember how much I love EpiVis!

@RoniRos
Copy link
Member

RoniRos commented Nov 22, 2020

@nickreich may revise or elaborate, but for the purpose of the CDC forecasting activity, I think it's more important that this summed-up signal be available in the API, so we can point to it and tell all participants "This signal is to be used as the Ground Truth for hospitalization forecasting". While we can tell them that the ground truth is the sum of these two already-existing signals, that's kind of clumsy.
For the purpose of the CDC forecasting activity, I don't think there is a need to add the summed-up signal to the map. But it might be something we could consider separately.

@nickreich
Copy link
Author

I agree that for modeling purposes, adding the sum is important to have a single signal for "ground truth".

I don't quite understand the distinction between the Epidata API and being available via the covidcast R package, say. @dfarrow0 that is what I was curious about. I don't care too much about the website availability, but do want to be able to access via an API somehow, and the more smooth that is, the better.

Final question: this weekend they seemed to post an incorrect file and then later corrected it. how does your system handle this?

@dfarrow0
Copy link
Contributor

dfarrow0 commented Dec 3, 2020

I don't quite understand the distinction between the Epidata API and being available via the covidcast R package

Yeah this is unclear, and we haven't been very consistent in general about how we refer to these things. The short version is this: the Epidata API contains various "endpoints". For example, fluview, nowcast, covid_hosp, covidcast, etc. We have libraries in python, R, javascript called delphi-epidata, and those libraries can fetch data from all Epidata API endpoints. Separately, Delphi has much more specialized/powerful libraries in python and R called covidcast, and these libraries fetch data solely from the covidcast endpoint of the Epidata API. For some time we've loosely called this the "COVIDcast Epidata API" (although IMO that's something of a misnomer since it's not separate API).

In any case, the covid_hosp endpoint of the Epidata API hosts this HHS dataset, and so it's not available via the covidcast library, but it is available via the delphi-epidata library.

I agree that for modeling purposes, adding the sum is important to have a single signal for "ground truth".

As I understand it, the plan is to compute this sum and surface it via the covidcast endpoint, which would make it available in the covidcast library. Until then, someone would need to use the generic delphi-epidata library and compute the sum manually.

want to be able to access via an API somehow, and the more smooth that is, the better.

Definitely agree that smoother is better, and currently there's room for improvement. Currently you can access the data with the delphi-epidata library, or you can just curl it directly, for example: https://delphi.cmu.edu/epidata/api.php?source=covid_hosp&states=PA&dates=20201101


Final question: this weekend they seemed to post an incorrect file and then later corrected it. how does your system handle this?

The API stores each dataset along with a version number, which we refer to as "issue" (as in "The December issue of Nature magazine"). Anyway, when you query the Epidata API, it returns data from the most recent issue by default. If you request data from a specific issue, it'll return that instead. So when a correction is posted, that becomes the most recent version and will be returned in queries by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants