Inspect HHS dataset for revision inconsistencies and produce patches #1351

dshemetov · 2023-11-18T02:49:29Z

A number of factors have made our HHS pipeline fragile in recent months. I made some WIP code to compare what our API had available compared to what was on the healthdata.gov site. Gonna post it here, as it might useful for debugging and doing some patches. It shows some recent discrepancies, though it could be just bugs in my comparison. Haven't tried it on older data.

If anyone feels motivated to dive into this, I'd welcome the help. Tagging myself so I remember to come back to this.

library(epidatr)
library(magrittr)
library(dplyr)

#' Download versioned data from healthdata.gov. Since the data is for previous
#' day, we subtract 1 from the date.
get_healthdata_from_url <- function(url) {
  url %>%
    read_csv() %>%
    mutate(
      state = tolower(state),
      date = as.Date(date) - 1
    ) %>%
    select(geo_value = state, time_value = date, value = previous_day_admission_influenza_confirmed) %>%
    arrange(geo_value, time_value)
}

#' Download versioned data from epidatr. The data is already adjusted for
#' previous day.
get_epidata_from_date <- function(date) {
  pub_covidcast(
    source = "hhs",
    signals = "confirmed_admissions_influenza_1d",
    geo_type = "state",
    time_type = "day",
    geo_values = "*",
    time_values = epirange(20200101, date),
    as_of = strftime(date, format = "%Y%m%d")
  ) %>%
    select(geo_value, time_value, value) %>%
    arrange(geo_value, time_value)
}

#' Find differences between healthdata.gov and epidatr.
compare_datasets <- function(healthdata, epidata) {
  inner_join(
    healthdata,
    epidata,
    by = c("geo_value", "time_value")
  ) %>%
    filter(value.x != value.y) %>%
    select(geo_value, time_value, value.x, value.y) %>%
    arrange(geo_value, time_value)
}

get_metadata <- function() {
  read_csv("https://healthdata.gov/api/views/qqte-vkut/rows.csv") %>%
    select(datetime = `Update Date`, link = `Archive Link`) %>%
    mutate(
      datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S %p")
    ) %>%
    arrange(desc(datetime))
}

metadata <- get_metadata()
data_comparisons <- metadata %>%
  # Look at data uploads from before the previous day (ones we have a chance of
  # having).
  filter(datetime <= Sys.time() - 60 * 60 * 24) %>%
  head(n = 6) %>%
  rowwise() %>%
  mutate(
    healthdata = list(get_healthdata_from_url(link)),
    # Get the data one day later, since our pipeline is setup to get data for
    # previous day.
    epidata = list(get_epidata_from_date(as.Date(datetime) + 1)),
    comparison = list(compare_datasets(healthdata, epidata))
  )

# Data drop from 2023-11-15 compared to epidata as of 2023-11-16
data_comparisons[[1L, "comparison"]]

# A tibble: 135 × 4
   geo_value time_value value.x value.y
   <chr>     <date>       <dbl>   <dbl>
 1 ak        2022-12-14       7       8
 2 ak        2022-12-15       3       4
 3 ak        2022-12-16       6       9
 4 ak        2022-12-21      10      12
 5 ak        2022-12-23       4       5
 6 ak        2022-12-24       3       4
 7 ak        2022-12-25       3       4
 8 ak        2022-12-26       8      12
 9 ak        2022-12-27       6       7
10 ak        2022-12-28       7       9
# ℹ 125 more rows
# ℹ Use `print(n = ...)` to see more rows

# Data drop from 2023-11-10 compared to epidata as of 2023-11-11
data_comparisons[[3L, "comparison"]]

# A tibble: 339 × 4
   geo_value time_value value.x value.y
   <chr>     <date>       <dbl>   <dbl>
 1 ak        2022-12-14       7       8
 2 ak        2022-12-15       3       4
 3 ak        2022-12-16       6       9
 4 ak        2022-12-21      10      12
 5 ak        2022-12-23       4       5
 6 ak        2022-12-24       3       4
 7 ak        2022-12-25       3       4
 8 ak        2022-12-26       8      12
 9 ak        2022-12-27       6       7
10 ak        2022-12-28       7       9
# ℹ 329 more rows
# ℹ Use `print(n = ...)` to see more rows

dsweber2 · 2024-06-03T16:05:58Z

is this still relevant with hhs going archived?

dshemetov · 2024-06-03T16:54:01Z

I think so. Our historical HHS data doesn't accurately reflect what's in the healthdata.gov archives and we should probably still fix that at some point.

dshemetov self-assigned this Nov 18, 2023

dshemetov removed their assignment Jan 3, 2024

melange396 mentioned this issue Jun 6, 2024

Deactivate HHS cmu-delphi/covidcast-indicators#1965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inspect HHS dataset for revision inconsistencies and produce patches #1351

Inspect HHS dataset for revision inconsistencies and produce patches #1351

dshemetov commented Nov 18, 2023 •

edited

Loading

dsweber2 commented Jun 3, 2024 •

edited

Loading

Uh oh!

dshemetov commented Jun 3, 2024

Uh oh!

Inspect HHS dataset for revision inconsistencies and produce patches #1351

Inspect HHS dataset for revision inconsistencies and produce patches #1351

Comments

dshemetov commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dsweber2 commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dshemetov commented Jun 3, 2024

Uh oh!

dshemetov commented Nov 18, 2023 •

edited

Loading

dsweber2 commented Jun 3, 2024 •

edited

Loading