Skip to content

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 3, 2018
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1420,6 +1420,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`DataFrame.expanding` in which the ``axis`` argument was not being respected during aggregations (:issue:`23372`)
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` which caused missing values when the input function can accept a :class:`DataFrame` but renames it (:issue:`23455`).
- Bug in :func:`pandas.core.groupby.GroupBy.nth` where column order was not always preserved (:issue:`20760`)
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`).

Reshaping
^^^^^^^^^
Expand Down
5 changes: 4 additions & 1 deletion pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -588,7 +588,10 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
out[i, 0] = NAN
else:
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to fill with NAN if the grp_sizes[i, 0] == 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this another elif condition, then the else can be the division.

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't Lines 588-589 taking care of the NAN edge case?

In my testing I have only ever been able to get grp_size[i, 0] == 0 (in Line 591) in the bugged case (pct=True, method="dense").

So my idea of the fix was that grp_size[i, 0] should never be 0 in the else block and if it is, then it should be 1 (not modifying out[i, 0] would be the same as div by 1) and so there's no else block for the if on Line 590.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you need to add this case there as well, if the denominator is 0, the result is NaN

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is right now, doing that would cause incorrect results (would fail the test I added in test_rank.py). The reasoning behind this is:

  1. grp_size[i, 0] should legitimately be 0 only if all the values in a group are NaN. In this case the if statement in Lines 588-589 takes care of this case and therefore, will never enter the else block in question.

  2. grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

Therefore my fix is only performing the division by grp_size[i, 0] when it is not equal to 0 in the if block [Line 591] because the else condition only demands division by 1 (the corrected grp_size[i, 0]).

To summarise, making the result NaN on grp_size[i, 0] would stop the ZeroDivisionError from occurring in BUG 23666 and replace it with incorrect behaviour.

An additional note: the conclusion of my testing is that there is a flaw in the calculation of grp_sizes[i, 0] itself under these circumstances, but in my testing I have uncovered that there are other parameters (not just method="dense") that can have weird behaviour that is caused due to incorrect grp_sizes[i, 0]. Once, I'm able to determine the extent of this I'm going to create a new issue that would target the fixing of grp_size[i, 0] calculation bug. This fix is a stop-gap measure till then since this would correct one instance of incorrect rank behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the test case to make an assertion about the result of this expression. It’s not enough to assert that it simply computes which is the behavior now.

Doing that would aid this discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, my bad. I forgot to push it.

# BUG 23666 - grp_sizes[i, 0] should never be 0
# If it's zero, it means to be 1 a.k.a. no change
if grp_sizes[i, 0] != 0:
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
{{endif}}
{{endfor}}

Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/groupby/test_rank.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,3 +290,15 @@ def test_rank_empty_group():
result = df.groupby(column).rank(pct=True)
expected = DataFrame({"B": [0.5, np.nan, 1.0]})
tm.assert_frame_equal(result, expected)


def test_rank_zero_div():
# GH 23666
df = DataFrame({
"A": [1, 2],
"B": [1, 1]
})

result = df.groupby("A").rank(pct=True, method="dense")
expected = DataFrame({"B": [1.0, 1.0]})
tm.assert_frame_equal(result, expected)