Skip to content

Fix (?) Doctor Visits Megacounty Bug #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 19, 2020
Merged

Fix (?) Doctor Visits Megacounty Bug #513

merged 1 commit into from
Nov 19, 2020

Conversation

chinandrew
Copy link
Contributor

Description

Doctors visit megafips were decreasing after the georefactor. Looks like the bug was due to me having the geomapper calculate the window threshold on the already rolling-summed column instead of the actual visit column, which would effectively make the new threshold the double sum. Previous tests didn't catch this because the test data was only 1 day, and so there was no window to sum.

It's hard to test exactly against previous output because the geomapping is still a tiny bit different in some areas. At this point this function is basically just the geomapper call, so unit testing it is would be less useful to catch bugs like this vs a higher level integration test. DV package is a bit light on tests anyway, so that's probably a follow up issue once this is fixed.

For now, would be good to run these files on some real data to verify. I tried to verify by hand on one of the full test files available and there weren't huge changes, but when I dug in a bit to the intermediate steps there's some noticeable places where this PR lines up with the old code while the current buggy code doesnt (see below). Not the most comprehensive, so I'd probably feel better if someone else also took a stab at verifying this.

These are the dataframes captured before the groupby in the return of county_to_megacounty(), filtered by megacounties. The columns vary but the extra ones don't appear to be used downstream. Note the row count difference:

old implementation (47016ab)

(Pdb) result[result.PatCountyFIPS.str.endswith("000")]
     Covid_like  Flu_like  Mixed   Flu1  Denominator ServiceDate PatCountyFIPS Pat HRR Name  Pat HRR ID PatAgeGroup  RecentVisits  ToExclude
0         10405      7953  26466  18649      1326264  2020-02-01         01000          NaN       623.0         NaN     1326264.0      False
1            32       455    623   1271        84568  2020-02-01         02000          NaN        61.0         NaN       84568.0      False
2          2423      1859   6066   4697       259260  2020-02-01         04000          NaN       357.0         NaN      259260.0      False
3         16971     12361  41544  34968      2035103  2020-02-01         05000          NaN      4533.0         NaN     2035103.0      False
4         10366      9710  30666  19512      1492464  2020-02-01         06000          NaN      4521.0         NaN     1492464.0      False
..          ...       ...    ...    ...          ...         ...           ...          ...         ...         ...           ...        ...
157          13       201   1488     39         9207  2020-02-06         38000          NaN     14814.0         NaN        9207.0      False
158        1489      1377   5377   2300       105203  2020-02-06         46000          NaN     60698.0         NaN      105203.0      False
159         495       120    647   1619        58023  2020-02-06         47000          NaN     15278.0         NaN       58023.0      False
160         748       314   2810   1223        23320  2020-02-06         49000          NaN     31071.0         NaN      134870.0      False
161         311       603    842    141        95878  2020-02-06         55000          NaN     15809.0         NaN       95878.0      False

[162 rows x 12 columns]

this PR

(Pdb) data[data.PatCountyFIPS.str.endswith("000")]
      ServiceDate PatCountyFIPS  Covid_like  Flu_like  Mixed   Flu1  Denominator  Pat HRR ID
0      2020-02-01         01000       10405      7953  26466  18649      1326264       623.0
37     2020-02-01         02000          32       455    623   1271        84568        61.0
40     2020-02-01         04000        2423      1859   6066   4697       259260       357.0
51     2020-02-01         05000       16971     12361  41544  34968      2035103      4533.0
82     2020-02-01         06000       10366      9710  30666  19512      1492464      4521.0
...           ...           ...         ...       ...    ...    ...          ...         ...
13283  2020-02-06         38000          13       201   1488     39         9207     14814.0
13635  2020-02-06         46000        1489      1377   5377   2300       105203     60698.0
13677  2020-02-06         47000         495       120    647   1619        58023     15278.0
14005  2020-02-06         49000         748       314   2810   1223        23320     31071.0
14242  2020-02-06         55000         311       603    842    141        95878     15809.0

[162 rows x 8 columns]

prod (with bug)

(Pdb) data[data.PatCountyFIPS.str.endswith("000")]
      ServiceDate PatCountyFIPS  Covid_like  Flu_like  Mixed   Flu1  Denominator  Pat HRR ID  RecentVisits
0      2020-02-01         01000       10405      7953  26466  18649      1326264       623.0     1326264.0
37     2020-02-01         02000          32       455    623   1271        84568        61.0       84568.0
40     2020-02-01         04000        2423      1859   6066   4697       259260       357.0      259260.0
51     2020-02-01         05000       16971     12361  41544  34968      2035103      4533.0     2035103.0
82     2020-02-01         06000       10366      9710  30666  19512      1492464      4521.0     1492464.0
...           ...           ...         ...       ...    ...    ...          ...         ...           ...
13075  2020-02-06         30000          18       108   1348   1660        14178     14482.0       14178.0
13431  2020-02-06         38000          13       201   1488     39         9207     14814.0        9207.0
13783  2020-02-06         46000        1489      1377   5377   2300       105203     60698.0      105203.0
13825  2020-02-06         47000         495       120    647   1619        58023     15278.0       58023.0
14391  2020-02-06         55000         311       603    842    141        95878     15809.0       95878.0

[135 rows x 9 columns]

Changelog

  • Adjust the column that the geomapper looks at to count

Fixes

@chinandrew chinandrew requested a review from krivard November 13, 2020 08:59
@krivard krivard requested a review from mariajahja November 13, 2020 15:45
@mariajahja
Copy link
Member

Yeah, it's tricky to verify. I'll call the oldest version (pre-geomapper) orig, the prod with bug prod, and this pr with fix as fix.

There are 606 (out of 1727) rows where any of orig, prod, or fix differ from each other.

diff_rows = comp_data[comp_data.orig_val.ne(comp_data.prod_val) | 
          comp_data.orig_val.ne(comp_data.fix_val) |
          comp_data.fix_val.ne(comp_data.prod_val)]

diff_rows 

geo_id | orig_val | prod_val | fix_val
-- | -- | -- | --
01000 | 5.459617 | NaN | 5.088082
01003 | 1.663579 | 0.370079 | 0.370079
01011 | 3.106940 | 3.106941 | 3.106941
01013 | 5.192788 | 5.192788 | NaN
01025 | 4.826946 | 3.508243 | 3.508243
... | ... | ... | ...
55135 | 4.667270 | 4.667270 | 4.043397
56000 | 3.789481 | NaN | 3.869652
56033 | 1.521005 | 1.521005 | 1.552264
28039 | NaN | 10.307503 | 10.307503
78030 | NaN | 4.345646 | NaN

606 rows × 4 columns

Just for the megacounties, only one geo_id (02000) matches exactly between orig and fix. The other 39 differ: the mean difference is 0.11, and the max difference is 0.554 for geo_id=47000 (Tennessee). None of the megacounties in fix is nan. Plotting these values, you can see that prod is very missing, but the rest align relatively well.

diff_mega_val.pdf

For non-megacounty changes, I'll mainly compare prod to fix. There are 175 rows that don't match between the two. There are 60 rows where fix is nan, but prod produces values, which is expected, since we've increased the threshold. For the remaining 115 rows, the mean diff is 0.174, median is 0.091 the max diff is 3.668. Plotting these differences (along with orig), it doesn't look too bad.

diff_nonmega_val.pdf

I don't know what changes were made to geomapper that caused these discrepancies for counties, so I can't make a judgement call on if these differences are acceptable. My initial thought would be that these changes seem minor enough?

@mariajahja
Copy link
Member

One concern is that there are some nan values in the fixed version, but not the original.

comp_data[comp_data.orig_val.isnull() | comp_data.fix_val.isnull()]

geo_id | orig_val | prod_val | fix_val
-- | -- | -- | --
01013 | 5.192788 | 5.192788 | NaN
01057 | 5.766869 | 5.766869 | NaN
01065 | 8.056592 | 8.056592 | NaN
01067 | 0.350194 | 0.350194 | NaN
01095 | 8.614593 | 8.614593 | NaN
... | ... | ... | ...
54027 | 0.705369 | 0.705369 | NaN
54051 | 3.761527 | 3.761527 | NaN
54061 | 3.655365 | 3.655365 | NaN
28039 | NaN | 10.307503 | 10.307503
78030 | NaN | 4.345646 | NaN

61 rows × 4 columns

@chinandrew
Copy link
Contributor Author

chinandrew commented Nov 13, 2020

Thanks, those plots are super useful. Agree that the differences seem pretty minor, but I'm not really basing this off of anything other than it looking "pretty close." One thing I noticed is that the new geomapper uses data_roll["_thr_col_roll"] > thr_count while the old uses data["RecentVisits"] < threshold_visits, so minor differences would be expected if you have the == case

As for the nans....that's interesting....need to think a bit more about how that's happening

@chinandrew
Copy link
Contributor Author

So if nan implies it got rolled into a megacounty, then having a few prod counties be non-nan and fix be nan isn't unexpected since prod wasn't creating that many megacountes. I'm more confused why orig wasn't nan while fix is. Maybe it's that > vs < case I mentioned? Would it be easy to check what the rolling sum of those nan counties were? If they equal the threshold, they'd be kept in the old implementation but rolled up in the new implementation. I guess another way to check is to run fix with threshold - 1.

@chinandrew
Copy link
Contributor Author

@krivard in terms of differences, any guidance on what's acceptable?

Since the function is now just the geomapper call + a groupby, any differences are probably geomapper implementation + mapping file differences, and in those cases we'd need to make a call on if we just use the geomapper as the canonical implementation, or add some additional corrections to get them to align better (as a basic example, adjusting the threshold by 1). My preference would be the former so everything's consistent, but only if that doesn't significantly affect downstream use cases.

@krivard
Copy link
Contributor

krivard commented Nov 16, 2020

Prefer using the geomapper as the canonical implementation. If the resulting changes (from orig) are substantial and/or noticeable, then the judgment call is whether to log the change and notify the list, or retire the current set of signals and spin up new ones.

@mariajahja
Copy link
Member

mariajahja commented Nov 18, 2020

I tried running fix with threshold-1 (call it threshm1), and I'm not sure that's the only culprit. It appears that only 6 counties were the ones on the threshold (ignoring the last row):

comp_data[(comp_data.orig_val.isnull() | comp_data.fix_val.isnull())
          & ~comp_data.threshm1_val.isnull()]

 | geo_id | orig_val | prod_val | fix_val | threshm1_val
-- | -- | -- | -- | -- | --
01057 | 5.766869 | 5.766869 | NaN | 5.621354
01095 | 8.614593 | 8.614593 | NaN | 8.739494
13209 | 4.343007 | 4.343007 | NaN | 4.440178
21081 | 3.830144 | 3.830144 | NaN | 3.975959
23023 | 0.574681 | 0.574681 | NaN | 0.580926
29011 | 6.462598 | 6.462598 | NaN | 6.364793
28039 | NaN | 10.307503 | 10.307503 | 10.307503

The rest are still nan in threshm1 which implies they were still rolled up in the megacounties while orig kept them separate. I'll look further into trying to get the actual rolling sums, but will have to check it in both orig and fix to see.

@mariajahja
Copy link
Member

OK, I think I found it. @chinandrew you were totally right about forgetting to use toExclude properly in the original implementation. They were still included, the other line was only excluding them in the megacounty part. If I add an extra line to exclude them after the merge, and rerun orig, I get orig_exclude, which looks like this:

diff_rows = diff_rows[~(diff_rows.orig_val.isnull() & diff_rows.fix_val.isnull())]
(diff_rows[["geo_id", "orig_val", "orig_exclude_val", "fix_val"]]
 .sort_values(["orig_exclude_val"])
)

geo_id | orig_val | orig_exclude_val | fix_val
-- | -- | -- | --
23023 | 0.574681 | 0.580926 | NaN
21081 | 3.830144 | 3.975959 | NaN
13209 | 4.343007 | 4.440178 | NaN
01057 | 5.766869 | 5.621354 | NaN
29011 | 6.462598 | 6.364793 | NaN
01095 | 8.614593 | 8.739494 | NaN
01013 | 5.192788 | NaN | NaN
01065 | 8.056592 | NaN | NaN
01067 | 0.350194 | NaN | NaN
13101 | 0.622303 | NaN | NaN
13125 | 2.912591 | NaN | NaN
13195 | 4.850784 | NaN | NaN
15007 | 3.544752 | NaN | NaN
....
1640	53031	0.679050	NaN	NaN
1662	54027	0.705369	NaN	NaN
1667	54051	3.761527	NaN	NaN
1670	54061	3.655365	NaN	NaN
1725	28039	NaN	NaN	10.307503

So the first 6 correspond exactly to the 6 in the above comment, and the rest have orig=fix as nan.

Copy link
Member

@mariajahja mariajahja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only potential change is from > to >= in the thresholding, but if other code uses the former, then it's not a big deal. Otherwise, looks good!

@chinandrew
Copy link
Contributor Author

OK, I think I found it. @chinandrew you were totally right about forgetting to use toExclude properly in the original implementation. They were still included, the other line was only excluding them in the megacounty part. If I add an extra line to exclude them after the merge, and rerun orig, I get orig_exclude, which looks like this:

diff_rows = diff_rows[~(diff_rows.orig_val.isnull() & diff_rows.fix_val.isnull())]
(diff_rows[["geo_id", "orig_val", "orig_exclude_val", "fix_val"]]
 .sort_values(["orig_exclude_val"])
)

geo_id | orig_val | orig_exclude_val | fix_val
-- | -- | -- | --
23023 | 0.574681 | 0.580926 | NaN
21081 | 3.830144 | 3.975959 | NaN
13209 | 4.343007 | 4.440178 | NaN
01057 | 5.766869 | 5.621354 | NaN
29011 | 6.462598 | 6.364793 | NaN
01095 | 8.614593 | 8.739494 | NaN
01013 | 5.192788 | NaN | NaN
01065 | 8.056592 | NaN | NaN
01067 | 0.350194 | NaN | NaN
13101 | 0.622303 | NaN | NaN
13125 | 2.912591 | NaN | NaN
13195 | 4.850784 | NaN | NaN
15007 | 3.544752 | NaN | NaN
....
1640	53031	0.679050	NaN	NaN
1662	54027	0.705369	NaN	NaN
1667	54051	3.761527	NaN	NaN
1670	54061	3.655365	NaN	NaN
1725	28039	NaN	NaN	10.307503

So the first 6 correspond exactly to the 6 in the above comment, and the rest have orig=fix as nan.

Great catch 🎉

@krivard
Copy link
Contributor

krivard commented Nov 19, 2020

🎉 🌮 💪 🎊 💯

Stellar work both of you!

@krivard krivard merged commit c36214c into dv-package Nov 19, 2020
@krivard krivard deleted the dv-bug branch November 19, 2020 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants