Skip to content

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 3, 2018
Merged

ZeroDivisionError when groupby rank with method="dense" and pct=True #23864

merged 11 commits into from
Dec 3, 2018

Conversation

Koustav-Samaddar
Copy link
Contributor

@Koustav-Samaddar Koustav-Samaddar commented Nov 23, 2018

…is 1

For some reason the grp_size is stored as 0 instead of 1 in this example
causing the error.
Added change to the what's new file.
@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

the first thing to always write is a test

@codecov
Copy link

codecov bot commented Nov 23, 2018

Codecov Report

Merging #23864 into master will increase coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23864      +/-   ##
==========================================
+ Coverage   92.28%   92.35%   +0.06%     
==========================================
  Files         161      161              
  Lines       51500    51557      +57     
==========================================
+ Hits        47528    47613      +85     
+ Misses       3972     3944      -28
Flag Coverage Δ
#multiple 90.75% <ø> (+0.06%) ⬆️
#single 42.46% <ø> (+0.14%) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/console.py 74.24% <0%> (-1.52%) ⬇️
pandas/core/arrays/timedeltas.py 96% <0%> (-0.45%) ⬇️
pandas/plotting/_misc.py 38.68% <0%> (-0.31%) ⬇️
pandas/core/indexes/base.py 96.32% <0%> (-0.17%) ⬇️
pandas/core/arrays/datetimes.py 98.37% <0%> (-0.14%) ⬇️
pandas/tseries/offsets.py 96.84% <0%> (-0.14%) ⬇️
pandas/core/ops.py 94.14% <0%> (-0.14%) ⬇️
pandas/core/config.py 87.04% <0%> (-0.13%) ⬇️
pandas/io/sas/sas_xport.py 90.14% <0%> (-0.1%) ⬇️
pandas/core/groupby/ops.py 96.72% <0%> (-0.07%) ⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20ae454...74f9e0b. Read the comment docs.

Moved my test case from local file to approporiate test file in the repo.
@pep8speaks
Copy link

pep8speaks commented Nov 23, 2018

Hello @Koustav-Samaddar! Thanks for updating the PR.

Comment last updated on December 01, 2018 at 20:52 Hours UTC

@WillAyd WillAyd added Groupby Error Reporting Incorrect or improved errors from pandas labels Nov 23, 2018
@jreback jreback changed the title Bug 23666 ZeroDivisionError when groupby rank with method="dense" and pct=True Nov 23, 2018
@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
out[i, 0] = NAN
else:
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to fill with NAN if the grp_sizes[i, 0] == 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this another elif condition, then the else can be the division.

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't Lines 588-589 taking care of the NAN edge case?

In my testing I have only ever been able to get grp_size[i, 0] == 0 (in Line 591) in the bugged case (pct=True, method="dense").

So my idea of the fix was that grp_size[i, 0] should never be 0 in the else block and if it is, then it should be 1 (not modifying out[i, 0] would be the same as div by 1) and so there's no else block for the if on Line 590.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you need to add this case there as well, if the denominator is 0, the result is NaN

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is right now, doing that would cause incorrect results (would fail the test I added in test_rank.py). The reasoning behind this is:

  1. grp_size[i, 0] should legitimately be 0 only if all the values in a group are NaN. In this case the if statement in Lines 588-589 takes care of this case and therefore, will never enter the else block in question.

  2. grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

Therefore my fix is only performing the division by grp_size[i, 0] when it is not equal to 0 in the if block [Line 591] because the else condition only demands division by 1 (the corrected grp_size[i, 0]).

To summarise, making the result NaN on grp_size[i, 0] would stop the ZeroDivisionError from occurring in BUG 23666 and replace it with incorrect behaviour.

An additional note: the conclusion of my testing is that there is a flaw in the calculation of grp_sizes[i, 0] itself under these circumstances, but in my testing I have uncovered that there are other parameters (not just method="dense") that can have weird behaviour that is caused due to incorrect grp_sizes[i, 0]. Once, I'm able to determine the extent of this I'm going to create a new issue that would target the fixing of grp_size[i, 0] calculation bug. This fix is a stop-gap measure till then since this would correct one instance of incorrect rank behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the test case to make an assertion about the result of this expression. It’s not enough to assert that it simply computes which is the behavior now.

Doing that would aid this discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, my bad. I forgot to push it.

@@ -588,7 +588,13 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
out[i, 0] = NAN
else:
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you need to add this case there as well, if the denominator is 0, the result is NaN

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on comment placement but otherwise lgtm for this change. If you find a way to refactor as you keep looking at it certainly open to subsequent PRs

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

Also make sure you check the Travis failures. Think you have a styling issue somewhere

@Koustav-Samaddar
Copy link
Contributor Author

Koustav-Samaddar commented Nov 23, 2018

Also make sure you check the Travis failures. Think you have a styling issue somewhere

I skimmed through the Travis logs and wasn't able to find (/didn't understand) what the issue was. I didn't modify any of the source files that I saw were mentioned in the logs. There was an error/issue raised by scripts/tests/test_validate_docstrings.py but I couldn't understand the logs from it.

I rechecked with pep8 and flake8 and couldn't find anything sticking out.

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

The failure is in ci/code_checks.sh if you run locally should give you visibility into issue.

@datapythonista FYI failures in that script don't provide a useful error message in CI currently. Not sure if that's on your radar

@datapythonista
Copy link
Member

@WillAyd the error message ./pandas/tests/groupby/test_rank.py:304:44: W292 no newline at end of file is coming from flake8.

If you mean that they are difficult to find, that's addressed in #22854, which should make them much easier and quicker to find. We have a permissions problem with Azure, that I expect to be fixed early next week, so with some luck we have it merged very soon.

@Koustav-Samaddar
Copy link
Contributor Author

I ran LINT=true ci/code_checks.sh on my machine and it didn't fail anything. However, on code submission, Travis is still failing.

@datapythonista
Copy link
Member

The travis failure was unrelated to the linting, it was caused by a url that could not be fetched. I rerun it let's see if it succeeds now.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2018

grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

I don't buy this. why would a grp_size ever be 0?

@Koustav-Samaddar
Copy link
Contributor Author

Koustav-Samaddar commented Nov 25, 2018

grp_size[i, 0] shouldn't be 0 but is in the case where there is a group with only one member with a non NaN value and another index also has the same value but in a different group && method="density". In this case the correct grp_size[i, 0] should be 1 and it being 0 is what causes BUG 23666.

I don't buy this. why would a grp_size ever be 0?

If I have a group where all the values are NaNs, then grp_size[i, 0] should be zero. However, the if out[i, 0] == NaN block makes it so that all the members (not sure if right term) of such a group get caught there, and never make it past to the check of if grp_size[i, 0] == 0.

TL;DR there are valid cases where grp_size[i, 0] == 0 but they will never enter the condition that checks if grp_size[i, 0] == 0

@Koustav-Samaddar
Copy link
Contributor Author

Koustav-Samaddar commented Nov 25, 2018

@datapythonista When I run ci/code_checks.sh on my local machine it doesn't raise any errors. However, in Travis I'm getting The command "ci/code_checks.sh" exited with 1.. I still am unable to ascertain from the Travis logs what's going wrong/failing.

Any help is greatly appreciated!

@jreback
Copy link
Contributor

jreback commented Nov 25, 2018

there are valid cases where grp_size[i, 0] == 0

and what is an example of this
don’t reference the existing code
rather show an input an output that has no group size yet has a result

@Koustav-Samaddar
Copy link
Contributor Author

Koustav-Samaddar commented Nov 25, 2018

there are valid cases where grp_size[i, 0] == 0

and what is an example of this
don’t reference the existing code
rather show an input an output that has no group size yet has a result

Here's an example df:

   A    B
0  1  3.0
1  1  4.0
2  2  NaN
3  2  NaN

If we were to do df.groupby("A") this would result in two groups:
{1: Int64Index([0, 1], dtype='int64'), 2: Int64Index([2, 3], dtype='int64')}

Here, the second group (A == 2) is an example of a valid group with grp_size[i, 0] == 0.

Hope this helps!

@jreback
Copy link
Contributor

jreback commented Nov 25, 2018

and what would the result frame look like

@Koustav-Samaddar
Copy link
Contributor Author

Koustav-Samaddar commented Nov 25, 2018

Sorry again.
On running df.groupby('A').rank(method="dense", pct=True) the correct result should be:

     B
0  0.5
1  1.0
2  NaN
3  NaN

I don't know how to link a snippet of code from a commit, but in test_rank.py I changed the testing function to include the permutations of grp_sizes and NaNs that you requested in a previous comment.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2018

@Koustav-Samaddar ok that result looks ok to me. still have some comments.

({"A": [1, 1, 2, 2], "B": [1, 2, 1, np.nan]}, {"B": [0.5, 1.0, 1.0, np.nan]}),
({"A": [1, 1, 2], "B": [1, 2, np.nan]}, {"B": [0.5, 1.0, np.nan]})
])
def test_rank_zero_div(df_dicts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you passing a dict? just pass 2 arguments in the parameterize. this is very hard to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, so I followed the same styling as a previous test in the same file.

I hope the changes are more readable!

@jreback jreback added this to the 0.24.0 milestone Dec 3, 2018
@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

lgtm. @WillAyd over to you.

@WillAyd WillAyd merged commit b7bdf7c into pandas-dev:master Dec 3, 2018
@WillAyd
Copy link
Member

WillAyd commented Dec 3, 2018

Thanks @Koustav-Samaddar !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZeroDivisionError when groupby rank with method="dense" and pct=True
5 participants