ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

Koustav-Samaddar · 2018-11-23T01:26:54Z

closes ZeroDivisionError when groupby rank with method="dense" and pct=True #23666
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…is 1 For some reason the grp_size is stored as 0 instead of 1 in this example causing the error.

Added change to the what's new file.

jreback · 2018-11-23T01:54:14Z

the first thing to always write is a test

codecov · 2018-11-23T01:59:02Z

Codecov Report

Merging #23864 into master will increase coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #23864      +/-   ##
==========================================
+ Coverage   92.28%   92.35%   +0.06%     
==========================================
  Files         161      161              
  Lines       51500    51557      +57     
==========================================
+ Hits        47528    47613      +85     
+ Misses       3972     3944      -28

Flag	Coverage Δ
#multiple	`90.75% <ø> (+0.06%)`	⬆️
#single	`42.46% <ø> (+0.14%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/formats/console.py	`74.24% <0%> (-1.52%)`	⬇️
pandas/core/arrays/timedeltas.py	`96% <0%> (-0.45%)`	⬇️
pandas/plotting/_misc.py	`38.68% <0%> (-0.31%)`	⬇️
pandas/core/indexes/base.py	`96.32% <0%> (-0.17%)`	⬇️
pandas/core/arrays/datetimes.py	`98.37% <0%> (-0.14%)`	⬇️
pandas/tseries/offsets.py	`96.84% <0%> (-0.14%)`	⬇️
pandas/core/ops.py	`94.14% <0%> (-0.14%)`	⬇️
pandas/core/config.py	`87.04% <0%> (-0.13%)`	⬇️
pandas/io/sas/sas_xport.py	`90.14% <0%> (-0.1%)`	⬇️
pandas/core/groupby/ops.py	`96.72% <0%> (-0.07%)`	⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20ae454...74f9e0b. Read the comment docs.

Moved my test case from local file to approporiate test file in the repo.

pep8speaks · 2018-11-23T02:19:10Z

Hello @Koustav-Samaddar! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/tests/groupby/test_rank.py !

Comment last updated on December 01, 2018 at 20:52 Hours UTC

pandas/_libs/groupby_helper.pxi.in

pandas/tests/groupby/test_rank.py

doc/source/whatsnew/v0.24.0.rst

pandas/tests/groupby/test_rank.py

jreback · 2018-11-23T03:38:08Z

pandas/_libs/groupby_helper.pxi.in

@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
                if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
                    out[i, 0] = NAN
                else:
-                    out[i, 0] = out[i, 0] / grp_sizes[i, 0]


you need to fill with NAN if the grp_sizes[i, 0] == 0

make this another elif condition, then the else can be the division.

Aren't Lines 588-589 taking care of the NAN edge case?

In my testing I have only ever been able to get grp_size[i, 0] == 0 (in Line 591) in the bugged case (pct=True, method="dense").

So my idea of the fix was that grp_size[i, 0] should never be 0 in the else block and if it is, then it should be 1 (not modifying out[i, 0] would be the same as div by 1) and so there's no else block for the if on Line 590.

no, you need to add this case there as well, if the denominator is 0, the result is NaN

As it is right now, doing that would cause incorrect results (would fail the test I added in test_rank.py). The reasoning behind this is:

grp_size[i, 0] should legitimately be 0 only if all the values in a group are NaN. In this case the if statement in Lines 588-589 takes care of this case and therefore, will never enter the else block in question.

grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

Therefore my fix is only performing the division by grp_size[i, 0] when it is not equal to 0 in the if block [Line 591] because the else condition only demands division by 1 (the corrected grp_size[i, 0]).

To summarise, making the result NaN on grp_size[i, 0] would stop the ZeroDivisionError from occurring in BUG 23666 and replace it with incorrect behaviour.

An additional note: the conclusion of my testing is that there is a flaw in the calculation of grp_sizes[i, 0] itself under these circumstances, but in my testing I have uncovered that there are other parameters (not just method="dense") that can have weird behaviour that is caused due to incorrect grp_sizes[i, 0]. Once, I'm able to determine the extent of this I'm going to create a new issue that would target the fixing of grp_size[i, 0] calculation bug. This fix is a stop-gap measure till then since this would correct one instance of incorrect rank behaviour.

Please improve the test case to make an assertion about the result of this expression. It’s not enough to assert that it simply computes which is the behavior now.

Doing that would aid this discussion

Ah sorry, my bad. I forgot to push it.

jreback · 2018-11-23T14:39:34Z

pandas/_libs/groupby_helper.pxi.in

@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
                if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
                    out[i, 0] = NAN
                else:
-                    out[i, 0] = out[i, 0] / grp_sizes[i, 0]


no, you need to add this case there as well, if the denominator is 0, the result is NaN

WillAyd

Minor nit on comment placement but otherwise lgtm for this change. If you find a way to refactor as you keep looking at it certainly open to subsequent PRs

pandas/_libs/groupby_helper.pxi.in

WillAyd · 2018-11-23T17:04:13Z

Also make sure you check the Travis failures. Think you have a styling issue somewhere

Koustav-Samaddar · 2018-11-23T18:14:32Z

Also make sure you check the Travis failures. Think you have a styling issue somewhere

I skimmed through the Travis logs and wasn't able to find (/didn't understand) what the issue was. I didn't modify any of the source files that I saw were mentioned in the logs. There was an error/issue raised by scripts/tests/test_validate_docstrings.py but I couldn't understand the logs from it.

I rechecked with pep8 and flake8 and couldn't find anything sticking out.

WillAyd · 2018-11-23T18:57:16Z

The failure is in ci/code_checks.sh if you run locally should give you visibility into issue.

@datapythonista FYI failures in that script don't provide a useful error message in CI currently. Not sure if that's on your radar

pandas/tests/groupby/test_rank.py

datapythonista · 2018-11-24T14:59:09Z

@WillAyd the error message ./pandas/tests/groupby/test_rank.py:304:44: W292 no newline at end of file is coming from flake8.

If you mean that they are difficult to find, that's addressed in #22854, which should make them much easier and quicker to find. We have a permissions problem with Azure, that I expect to be fixed early next week, so with some luck we have it merged very soon.

Koustav-Samaddar · 2018-11-24T22:42:35Z

I ran LINT=true ci/code_checks.sh on my machine and it didn't fail anything. However, on code submission, Travis is still failing.

datapythonista · 2018-11-25T00:06:40Z

The travis failure was unrelated to the linting, it was caused by a url that could not be fetched. I rerun it let's see if it succeeds now.

jreback · 2018-11-25T02:39:36Z

grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

I don't buy this. why would a grp_size ever be 0?

Koustav-Samaddar · 2018-11-25T16:48:20Z

grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

I don't buy this. why would a grp_size ever be 0?

If I have a group where all the values are NaNs, then grp_size[i, 0] should be zero. However, the if out[i, 0] == NaN block makes it so that all the members (not sure if right term) of such a group get caught there, and never make it past to the check of if grp_size[i, 0] == 0.

TL;DR there are valid cases where grp_size[i, 0] == 0 but they will never enter the condition that checks if grp_size[i, 0] == 0

Koustav-Samaddar · 2018-11-25T16:51:42Z

@datapythonista When I run ci/code_checks.sh on my local machine it doesn't raise any errors. However, in Travis I'm getting The command "ci/code_checks.sh" exited with 1.. I still am unable to ascertain from the Travis logs what's going wrong/failing.

Any help is greatly appreciated!

jreback · 2018-11-25T17:00:25Z

there are valid cases where grp_size[i, 0] == 0

and what is an example of this
don’t reference the existing code
rather show an input an output that has no group size yet has a result

Koustav-Samaddar · 2018-11-25T18:24:58Z

there are valid cases where grp_size[i, 0] == 0

and what is an example of this
don’t reference the existing code
rather show an input an output that has no group size yet has a result

Here's an example df:

   A    B
0  1  3.0
1  1  4.0
2  2  NaN
3  2  NaN

If we were to do df.groupby("A") this would result in two groups:
{1: Int64Index([0, 1], dtype='int64'), 2: Int64Index([2, 3], dtype='int64')}

Here, the second group (A == 2) is an example of a valid group with grp_size[i, 0] == 0.

Hope this helps!

jreback · 2018-11-25T18:46:44Z

and what would the result frame look like

Koustav-Samaddar · 2018-11-25T19:25:05Z

Sorry again.
On running df.groupby('A').rank(method="dense", pct=True) the correct result should be:

     B
0  0.5
1  1.0
2  NaN
3  NaN

I don't know how to link a snippet of code from a commit, but in test_rank.py I changed the testing function to include the permutations of grp_sizes and NaNs that you requested in a previous comment.

jreback · 2018-11-27T02:26:25Z

@Koustav-Samaddar ok that result looks ok to me. still have some comments.

jreback · 2018-11-25T02:35:59Z

pandas/tests/groupby/test_rank.py

+    ({"A": [1, 1, 2, 2], "B": [1, 2, 1, np.nan]}, {"B": [0.5, 1.0, 1.0, np.nan]}),
+    ({"A": [1, 1, 2], "B": [1, 2, np.nan]}, {"B": [0.5, 1.0, np.nan]})
+])
+def test_rank_zero_div(df_dicts):


why are you passing a dict? just pass 2 arguments in the parameterize. this is very hard to read.

I'm not sure if I understand this correctly, so I followed the same styling as a previous test in the same file.

I hope the changes are more readable!

pandas/_libs/groupby_helper.pxi.in

jreback · 2018-12-03T01:20:04Z

lgtm. @WillAyd over to you.

WillAyd · 2018-12-03T05:35:30Z

Thanks @Koustav-Samaddar !

…andas-dev#23864)

Koustav-Samaddar added 2 commits November 22, 2018 18:10

BUG: Fixed groupby.rank w/ method="dense", pct=True if any grp count …

3721987

…is 1 For some reason the grp_size is stored as 0 instead of 1 in this example causing the error.

BUG: Fixed groupby.rank w/ method="dense", pct=True

a1ba61e

Added change to the what's new file.

BUG 23666

b8d343c

Moved my test case from local file to approporiate test file in the repo.

Fixed pep8 in the tests file

1be7457

WillAyd requested changes Nov 23, 2018

View reviewed changes

pandas/_libs/groupby_helper.pxi.in Outdated Show resolved Hide resolved

pandas/tests/groupby/test_rank.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved

WillAyd added Groupby Error Reporting Incorrect or improved errors from pandas labels Nov 23, 2018

WillAyd requested changes Nov 23, 2018

View reviewed changes

pandas/tests/groupby/test_rank.py Outdated Show resolved Hide resolved

jreback changed the title ~~Bug 23666~~ ZeroDivisionError when groupby rank with method="dense" and pct=True Nov 23, 2018

Made the requested modifications

f1db09c

WillAyd requested changes Nov 23, 2018

View reviewed changes

pandas/tests/groupby/test_rank.py Outdated Show resolved Hide resolved

jreback requested changes Nov 23, 2018

View reviewed changes

Updated test case to assert expected result

7ce206c

WillAyd requested changes Nov 23, 2018

View reviewed changes

pandas/_libs/groupby_helper.pxi.in Outdated Show resolved Hide resolved

Moved & shrunk comment block outside if block

976eccc

jreback requested changes Nov 23, 2018

View reviewed changes

pandas/tests/groupby/test_rank.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_rank.py Outdated Show resolved Hide resolved

Koustav-Samaddar added 2 commits November 24, 2018 16:20

Fixed pep8 formatting to have empty line at end of file

87ee207

Updated to have more test cases for better coverage

34047da

jreback requested changes Nov 27, 2018

View reviewed changes

Koustav-Samaddar added 2 commits December 1, 2018 15:51

Made stylistic changes to code and test according to request

88f5cb4

Removed comment describing bug

74f9e0b

jreback added this to the 0.24.0 milestone Dec 3, 2018

jreback approved these changes Dec 3, 2018

View reviewed changes

WillAyd approved these changes Dec 3, 2018

View reviewed changes

WillAyd merged commit b7bdf7c into pandas-dev:master Dec 3, 2018

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ZeroDivisionError when groupby rank with method="dense" and pct=True (p…

261ed9b

…andas-dev#23864)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ZeroDivisionError when groupby rank with method="dense" and pct=True (p…

2d88192

…andas-dev#23864)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

Koustav-Samaddar commented Nov 23, 2018 •

edited

Loading

jreback commented Nov 23, 2018

codecov bot commented Nov 23, 2018 •

edited

Loading

pep8speaks commented Nov 23, 2018 •

edited

Loading

jreback Nov 23, 2018

jreback Nov 23, 2018

Koustav-Samaddar Nov 23, 2018 •

edited

Loading

jreback Nov 23, 2018

Koustav-Samaddar Nov 23, 2018 •

edited

Loading

WillAyd Nov 23, 2018

Koustav-Samaddar Nov 23, 2018

jreback Nov 23, 2018

WillAyd left a comment

WillAyd commented Nov 23, 2018

Koustav-Samaddar commented Nov 23, 2018 •

edited

Loading

WillAyd commented Nov 23, 2018

datapythonista commented Nov 24, 2018

Koustav-Samaddar commented Nov 24, 2018

datapythonista commented Nov 25, 2018

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

jreback commented Nov 27, 2018

jreback Nov 25, 2018

Koustav-Samaddar Dec 1, 2018

jreback commented Dec 3, 2018

WillAyd commented Dec 3, 2018

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

Conversation

Koustav-Samaddar commented Nov 23, 2018 • edited Loading

jreback commented Nov 23, 2018

codecov bot commented Nov 23, 2018 • edited Loading

Codecov Report

pep8speaks commented Nov 23, 2018 • edited Loading

Comment last updated on December 01, 2018 at 20:52 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Koustav-Samaddar Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Koustav-Samaddar Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Nov 23, 2018

Koustav-Samaddar commented Nov 23, 2018 • edited Loading

WillAyd commented Nov 23, 2018

datapythonista commented Nov 24, 2018

Koustav-Samaddar commented Nov 24, 2018

datapythonista commented Nov 25, 2018

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 • edited Loading

Koustav-Samaddar commented Nov 25, 2018 • edited Loading

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 • edited Loading

jreback commented Nov 25, 2018

Koustav-Samaddar commented Nov 25, 2018 • edited Loading

jreback commented Nov 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 3, 2018

WillAyd commented Dec 3, 2018

Koustav-Samaddar commented Nov 23, 2018 •

edited

Loading

codecov bot commented Nov 23, 2018 •

edited

Loading

pep8speaks commented Nov 23, 2018 •

edited

Loading

Koustav-Samaddar Nov 23, 2018 •

edited

Loading

Koustav-Samaddar Nov 23, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 23, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading

Koustav-Samaddar commented Nov 25, 2018 •

edited

Loading