Skip to content

NSSP data for HRRs is not summed properly #2129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
melange396 opened this issue Feb 27, 2025 · 5 comments · Fixed by #2131
Closed

NSSP data for HRRs is not summed properly #2129

melange396 opened this issue Feb 27, 2025 · 5 comments · Fixed by #2131
Assignees
Labels
bug Something isn't working data quality Missing data, weird data, broken data

Comments

@melange396
Copy link
Contributor

it is computed with a basic sum instead of a weighted one like the other geo types we aggregate ourselves. as a result the scale is off and can be >100% (see nssp:pct_ed_visits_influenza @hrr:366 )

@melange396 melange396 added bug Something isn't working data quality Missing data, weird data, broken data labels Feb 27, 2025
@aysim319 aysim319 self-assigned this Feb 28, 2025
@minhkhul
Copy link
Contributor

minhkhul commented Feb 28, 2025

Probably adjust some wording/add warning in geomapper here too. This original geomap instruction was why I implemented custom weighting for msa and hhs but not hrr.

Example 2: to replace a geocode column with a new one, aggregating the data
with weights:
> gmpr = GeoMapper()
> df = gmpr.replace_geocode(df, "fips", "zip",
from_col="fips", new_col="geo_id",
date_col="timestamp", dropna=False)

@melange396
Copy link
Contributor Author

I just did another read-through of some GeoMapper code... It looks like .replace_geocode() can/should produce weighted sums for certain geo-level conversions (through a sneaky and convoluted path), but the precomputed fips-->hrr weights do not add up the way they should. Using your other approach that calls .aggregate_by_weighted_sum() w/ population should get around this.

@dshemetov
Copy link
Contributor

dshemetov commented Feb 28, 2025

I'm pretty familiar with GeoMapper, having written much of the current core logic in #217 and in #1960, so I figure I should provide some context. These functions contain a lot of implicit assumptions about the data, by the way they handle the missingness edge case, and these assumptions are hard to convey succinctly (more on that below). I don't think the fips->hrr weights are wrong. I think the TL;DR of the problem here is that

We should definitely use aggregate_by_weighted_sum() in this code path. And I should improve the GeoMapper docstring. Maybe add a "GeoMapper by example" notebook too.

Context:

("Source geo" and "target geo" below refer to the geos being translated. So for the bug in question, source geos are fips and target geos are hrrs).

NSSP data is all percent of state ED visits and it's full of incomplete reporting (i.e. not every county reports values), so (a) adding values across separate source geos doesn't make sense without some sort of population adjustment, (b) we need a strategy to handle missingness in source geos. Since the target geo needs to also report a percentage, it has to be normalized after aggregating and we're forced into a modeling choice by our choice of denominator:

  • if you normalize by the total population of the target geo, but the numerator only added present source geos, then that translates to an implicit zero-fill assumption on the missing source geos
  • if you normalize by the population of only the present source geos, then that translates to an implicit extrapolation from the present geos

David and I settled on the latter choice, since we figured that assuming nearby missing geos are similar is more correct than assuming they are zero. He wrote the original aggregate_by_weighted_sum() for this purpose and I generalized and ported it to GeoMapper. In hindsight, and as these things tend to go, we spent all our time getting the functionality right and no time choosing a good function name, so we ended up with this too-generic-to-be-descriptive thing 🙃.

@aysim319 aysim319 linked a pull request Mar 3, 2025 that will close this issue
@melange396
Copy link
Contributor Author

Reopening because this still needs some sort of "patching-like" action, either deleting the erroneous data or overwriting it with correct data, depending on available options (im not familiar enough with the issuance and revision behavior of the source data nor with our backups of it off the top of my head).

@melange396 melange396 reopened this Mar 12, 2025
@minhkhul
Copy link
Contributor

minhkhul commented Mar 19, 2025

todo:

  • delete old hrr data
    • took ~1hr late last night
    • DELETE epimetric_x FROM epimetric_x JOIN signal_dim USING (signal_key_id) JOIN geo_dim USING (geo_key_id) WHERE source='nssp' AND geo_type='hrr';
  • release covidcast indicators to the fix is applied for future runs.
  • patch old hrr data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data quality Missing data, weird data, broken data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants