Skip to content

Inspect HHS dataset for revision inconsistencies and produce patches #1351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dshemetov opened this issue Nov 18, 2023 · 2 comments
Open

Comments

@dshemetov
Copy link
Contributor

dshemetov commented Nov 18, 2023

A number of factors have made our HHS pipeline fragile in recent months. I made some WIP code to compare what our API had available compared to what was on the healthdata.gov site. Gonna post it here, as it might useful for debugging and doing some patches. It shows some recent discrepancies, though it could be just bugs in my comparison. Haven't tried it on older data.

If anyone feels motivated to dive into this, I'd welcome the help. Tagging myself so I remember to come back to this.

library(epidatr)
library(magrittr)
library(dplyr)

#' Download versioned data from healthdata.gov. Since the data is for previous
#' day, we subtract 1 from the date.
get_healthdata_from_url <- function(url) {
  url %>%
    read_csv() %>%
    mutate(
      state = tolower(state),
      date = as.Date(date) - 1
    ) %>%
    select(geo_value = state, time_value = date, value = previous_day_admission_influenza_confirmed) %>%
    arrange(geo_value, time_value)
}

#' Download versioned data from epidatr. The data is already adjusted for
#' previous day.
get_epidata_from_date <- function(date) {
  pub_covidcast(
    source = "hhs",
    signals = "confirmed_admissions_influenza_1d",
    geo_type = "state",
    time_type = "day",
    geo_values = "*",
    time_values = epirange(20200101, date),
    as_of = strftime(date, format = "%Y%m%d")
  ) %>%
    select(geo_value, time_value, value) %>%
    arrange(geo_value, time_value)
}

#' Find differences between healthdata.gov and epidatr.
compare_datasets <- function(healthdata, epidata) {
  inner_join(
    healthdata,
    epidata,
    by = c("geo_value", "time_value")
  ) %>%
    filter(value.x != value.y) %>%
    select(geo_value, time_value, value.x, value.y) %>%
    arrange(geo_value, time_value)
}

get_metadata <- function() {
  read_csv("https://healthdata.gov/api/views/qqte-vkut/rows.csv") %>%
    select(datetime = `Update Date`, link = `Archive Link`) %>%
    mutate(
      datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S %p")
    ) %>%
    arrange(desc(datetime))
}

metadata <- get_metadata()
data_comparisons <- metadata %>%
  # Look at data uploads from before the previous day (ones we have a chance of
  # having).
  filter(datetime <= Sys.time() - 60 * 60 * 24) %>%
  head(n = 6) %>%
  rowwise() %>%
  mutate(
    healthdata = list(get_healthdata_from_url(link)),
    # Get the data one day later, since our pipeline is setup to get data for
    # previous day.
    epidata = list(get_epidata_from_date(as.Date(datetime) + 1)),
    comparison = list(compare_datasets(healthdata, epidata))
  )

# Data drop from 2023-11-15 compared to epidata as of 2023-11-16
data_comparisons[[1L, "comparison"]]

# A tibble: 135 × 4
   geo_value time_value value.x value.y
   <chr>     <date>       <dbl>   <dbl>
 1 ak        2022-12-14       7       8
 2 ak        2022-12-15       3       4
 3 ak        2022-12-16       6       9
 4 ak        2022-12-21      10      12
 5 ak        2022-12-23       4       5
 6 ak        2022-12-24       3       4
 7 ak        2022-12-25       3       4
 8 ak        2022-12-26       8      12
 9 ak        2022-12-27       6       7
10 ak        2022-12-28       7       9
# ℹ 125 more rows
# ℹ Use `print(n = ...)` to see more rows

# Data drop from 2023-11-10 compared to epidata as of 2023-11-11
data_comparisons[[3L, "comparison"]]

# A tibble: 339 × 4
   geo_value time_value value.x value.y
   <chr>     <date>       <dbl>   <dbl>
 1 ak        2022-12-14       7       8
 2 ak        2022-12-15       3       4
 3 ak        2022-12-16       6       9
 4 ak        2022-12-21      10      12
 5 ak        2022-12-23       4       5
 6 ak        2022-12-24       3       4
 7 ak        2022-12-25       3       4
 8 ak        2022-12-26       8      12
 9 ak        2022-12-27       6       7
10 ak        2022-12-28       7       9
# ℹ 329 more rows
# ℹ Use `print(n = ...)` to see more rows
@dshemetov dshemetov self-assigned this Nov 18, 2023
@dshemetov dshemetov removed their assignment Jan 3, 2024
@dsweber2
Copy link
Contributor

dsweber2 commented Jun 3, 2024

is this still relevant with hhs going archived?

@dshemetov
Copy link
Contributor Author

I think so. Our historical HHS data doesn't accurately reflect what's in the healthdata.gov archives and we should probably still fix that at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants