-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ZeroDivisionError when groupby rank with method="dense" and pct=True #23864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…is 1 For some reason the grp_size is stored as 0 instead of 1 in this example causing the error.
Added change to the what's new file.
the first thing to always write is a test |
Codecov Report
@@ Coverage Diff @@
## master #23864 +/- ##
==========================================
+ Coverage 92.28% 92.35% +0.06%
==========================================
Files 161 161
Lines 51500 51557 +57
==========================================
+ Hits 47528 47613 +85
+ Misses 3972 3944 -28
Continue to review full report at Codecov.
|
Hello @Koustav-Samaddar! Thanks for updating the PR.
Comment last updated on December 01, 2018 at 20:52 Hours UTC |
pandas/_libs/groupby_helper.pxi.in
Outdated
@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out, | |||
if out[i, 0] != out[i, 0] or out[i, 0] == NAN: | |||
out[i, 0] = NAN | |||
else: | |||
out[i, 0] = out[i, 0] / grp_sizes[i, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to fill with NAN if the grp_sizes[i, 0] == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this another elif condition, then the else can be the division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't Lines 588-589 taking care of the NAN edge case?
In my testing I have only ever been able to get grp_size[i, 0] == 0 (in Line 591) in the bugged case (pct=True, method="dense").
So my idea of the fix was that grp_size[i, 0] should never be 0 in the else block and if it is, then it should be 1 (not modifying out[i, 0] would be the same as div by 1) and so there's no else block for the if on Line 590.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you need to add this case there as well, if the denominator is 0, the result is NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is right now, doing that would cause incorrect results (would fail the test I added in test_rank.py). The reasoning behind this is:
-
grp_size[i, 0] should legitimately be 0 only if all the values in a group are NaN. In this case the if statement in Lines 588-589 takes care of this case and therefore, will never enter the else block in question.
-
grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.
Therefore my fix is only performing the division by grp_size[i, 0] when it is not equal to 0 in the if block [Line 591] because the else condition only demands division by 1 (the corrected grp_size[i, 0]).
To summarise, making the result NaN on grp_size[i, 0] would stop the ZeroDivisionError from occurring in BUG 23666 and replace it with incorrect behaviour.
An additional note: the conclusion of my testing is that there is a flaw in the calculation of grp_sizes[i, 0] itself under these circumstances, but in my testing I have uncovered that there are other parameters (not just method="dense") that can have weird behaviour that is caused due to incorrect grp_sizes[i, 0]. Once, I'm able to determine the extent of this I'm going to create a new issue that would target the fixing of grp_size[i, 0] calculation bug. This fix is a stop-gap measure till then since this would correct one instance of incorrect rank behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the test case to make an assertion about the result of this expression. It’s not enough to assert that it simply computes which is the behavior now.
Doing that would aid this discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, my bad. I forgot to push it.
pandas/_libs/groupby_helper.pxi.in
Outdated
@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out, | |||
if out[i, 0] != out[i, 0] or out[i, 0] == NAN: | |||
out[i, 0] = NAN | |||
else: | |||
out[i, 0] = out[i, 0] / grp_sizes[i, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you need to add this case there as well, if the denominator is 0, the result is NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit on comment placement but otherwise lgtm for this change. If you find a way to refactor as you keep looking at it certainly open to subsequent PRs
Also make sure you check the Travis failures. Think you have a styling issue somewhere |
I skimmed through the Travis logs and wasn't able to find (/didn't understand) what the issue was. I didn't modify any of the source files that I saw were mentioned in the logs. There was an error/issue raised by scripts/tests/test_validate_docstrings.py but I couldn't understand the logs from it. I rechecked with pep8 and flake8 and couldn't find anything sticking out. |
The failure is in @datapythonista FYI failures in that script don't provide a useful error message in CI currently. Not sure if that's on your radar |
@WillAyd the error message If you mean that they are difficult to find, that's addressed in #22854, which should make them much easier and quicker to find. We have a permissions problem with Azure, that I expect to be fixed early next week, so with some luck we have it merged very soon. |
I ran |
The travis failure was unrelated to the linting, it was caused by a url that could not be fetched. I rerun it let's see if it succeeds now. |
I don't buy this. why would a grp_size ever be 0? |
If I have a group where all the values are TL;DR there are valid cases where |
@datapythonista When I run ci/code_checks.sh on my local machine it doesn't raise any errors. However, in Travis I'm getting Any help is greatly appreciated! |
and what is an example of this |
Here's an example
If we were to do Here, the second group ( Hope this helps! |
and what would the result frame look like |
Sorry again.
I don't know how to link a snippet of code from a commit, but in test_rank.py I changed the testing function to include the permutations of grp_sizes and NaNs that you requested in a previous comment. |
@Koustav-Samaddar ok that result looks ok to me. still have some comments. |
pandas/tests/groupby/test_rank.py
Outdated
({"A": [1, 1, 2, 2], "B": [1, 2, 1, np.nan]}, {"B": [0.5, 1.0, 1.0, np.nan]}), | ||
({"A": [1, 1, 2], "B": [1, 2, np.nan]}, {"B": [0.5, 1.0, np.nan]}) | ||
]) | ||
def test_rank_zero_div(df_dicts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you passing a dict? just pass 2 arguments in the parameterize. this is very hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this correctly, so I followed the same styling as a previous test in the same file.
I hope the changes are more readable!
lgtm. @WillAyd over to you. |
Thanks @Koustav-Samaddar ! |
git diff upstream/master -u -- "*.py" | flake8 --diff