-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ZeroDivisionError when groupby rank with method="dense" and pct=True #23864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
3721987
a1ba61e
b8d343c
1be7457
f1db09c
7ce206c
976eccc
87ee207
34047da
88f5cb4
74f9e0b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -290,3 +290,18 @@ def test_rank_empty_group(): | |
result = df.groupby(column).rank(pct=True) | ||
expected = DataFrame({"B": [0.5, np.nan, 1.0]}) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
@pytest.mark.parametrize("df_dicts", [ | ||
({"A": [1, 2], "B": [1, 1]}, {"B": [1.0, 1.0]}), | ||
({"A": [1, 1, 2, 2], "B": [1, 2, 1, 2]}, {"B": [0.5, 1.0, 0.5, 1.0]}), | ||
({"A": [1, 1, 2, 2], "B": [1, 2, 1, np.nan]}, {"B": [0.5, 1.0, 1.0, np.nan]}), | ||
({"A": [1, 1, 2], "B": [1, 2, np.nan]}, {"B": [0.5, 1.0, np.nan]}) | ||
]) | ||
def test_rank_zero_div(df_dicts): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you passing a dict? just pass 2 arguments in the parameterize. this is very hard to read. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if I understand this correctly, so I followed the same styling as a previous test in the same file. I hope the changes are more readable! |
||
# GH 23666 | ||
df = DataFrame(df_dicts[0]) | ||
|
||
result = df.groupby("A").rank(method="dense", pct=True) | ||
expected = DataFrame(df_dicts[1]) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to fill with NAN if the grp_sizes[i, 0] == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this another elif condition, then the else can be the division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't Lines 588-589 taking care of the NAN edge case?
In my testing I have only ever been able to get grp_size[i, 0] == 0 (in Line 591) in the bugged case (pct=True, method="dense").
So my idea of the fix was that grp_size[i, 0] should never be 0 in the else block and if it is, then it should be 1 (not modifying out[i, 0] would be the same as div by 1) and so there's no else block for the if on Line 590.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you need to add this case there as well, if the denominator is 0, the result is NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is right now, doing that would cause incorrect results (would fail the test I added in test_rank.py). The reasoning behind this is:
grp_size[i, 0] should legitimately be 0 only if all the values in a group are NaN. In this case the if statement in Lines 588-589 takes care of this case and therefore, will never enter the else block in question.
grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.
Therefore my fix is only performing the division by grp_size[i, 0] when it is not equal to 0 in the if block [Line 591] because the else condition only demands division by 1 (the corrected grp_size[i, 0]).
To summarise, making the result NaN on grp_size[i, 0] would stop the ZeroDivisionError from occurring in BUG 23666 and replace it with incorrect behaviour.
An additional note: the conclusion of my testing is that there is a flaw in the calculation of grp_sizes[i, 0] itself under these circumstances, but in my testing I have uncovered that there are other parameters (not just method="dense") that can have weird behaviour that is caused due to incorrect grp_sizes[i, 0]. Once, I'm able to determine the extent of this I'm going to create a new issue that would target the fixing of grp_size[i, 0] calculation bug. This fix is a stop-gap measure till then since this would correct one instance of incorrect rank behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the test case to make an assertion about the result of this expression. It’s not enough to assert that it simply computes which is the behavior now.
Doing that would aid this discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, my bad. I forgot to push it.