-
Notifications
You must be signed in to change notification settings - Fork 16
Fix (?) Doctor Visits Megacounty Bug #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yeah, it's tricky to verify. I'll call the oldest version (pre-geomapper) There are 606 (out of 1727) rows where any of
Just for the megacounties, only one For non-megacounty changes, I'll mainly compare I don't know what changes were made to geomapper that caused these discrepancies for counties, so I can't make a judgement call on if these differences are acceptable. My initial thought would be that these changes seem minor enough? |
One concern is that there are some
|
Thanks, those plots are super useful. Agree that the differences seem pretty minor, but I'm not really basing this off of anything other than it looking "pretty close." One thing I noticed is that the new geomapper uses As for the nans....that's interesting....need to think a bit more about how that's happening |
So if nan implies it got rolled into a megacounty, then having a few |
@krivard in terms of differences, any guidance on what's acceptable? Since the function is now just the geomapper call + a groupby, any differences are probably geomapper implementation + mapping file differences, and in those cases we'd need to make a call on if we just use the geomapper as the canonical implementation, or add some additional corrections to get them to align better (as a basic example, adjusting the threshold by 1). My preference would be the former so everything's consistent, but only if that doesn't significantly affect downstream use cases. |
Prefer using the geomapper as the canonical implementation. If the resulting changes (from |
I tried running
The rest are still |
OK, I think I found it. @chinandrew you were totally right about forgetting to use
So the first 6 correspond exactly to the 6 in the above comment, and the rest have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the only potential change is from >
to >=
in the thresholding, but if other code uses the former, then it's not a big deal. Otherwise, looks good!
Great catch 🎉 |
🎉 🌮 💪 🎊 💯 Stellar work both of you! |
Description
Doctors visit megafips were decreasing after the georefactor. Looks like the bug was due to me having the geomapper calculate the window threshold on the already rolling-summed column instead of the actual visit column, which would effectively make the new threshold the double sum. Previous tests didn't catch this because the test data was only 1 day, and so there was no window to sum.
It's hard to test exactly against previous output because the geomapping is still a tiny bit different in some areas. At this point this function is basically just the geomapper call, so unit testing it is would be less useful to catch bugs like this vs a higher level integration test. DV package is a bit light on tests anyway, so that's probably a follow up issue once this is fixed.
For now, would be good to run these files on some real data to verify. I tried to verify by hand on one of the full test files available and there weren't huge changes, but when I dug in a bit to the intermediate steps there's some noticeable places where this PR lines up with the old code while the current buggy code doesnt (see below). Not the most comprehensive, so I'd probably feel better if someone else also took a stab at verifying this.
These are the dataframes captured before the groupby in the return of county_to_megacounty(), filtered by megacounties. The columns vary but the extra ones don't appear to be used downstream. Note the row count difference:
old implementation (47016ab)
this PR
prod (with bug)
Changelog
Fixes