Skip to content

BUG: NaN should have pct rank of NaN #22600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 8, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Sep 5, 2018

Closes #22519.

@gfyoung gfyoung added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version labels Sep 5, 2018
@gfyoung gfyoung added this to the 0.23.5 milestone Sep 5, 2018
@pep8speaks
Copy link

Hello @gfyoung! Thanks for submitting the PR.

@@ -584,7 +584,10 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,

if pct:
for i in range(N):
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this all be simplified to just if out[i, 0] != NAN:?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think those are semantically equivalent. Did you mean out[i, 0] == NAN ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was only do the division if out[i, 0] != NAN otherwise leave as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha! 🙂 Let's try it and see what happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd : Good idea, but unfortunately, the tests I added disagree with it. You need both conditionals when checking. Thus, this code needs to stay as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK I got you. Is it a particular type that's failing?

For some reason I thought some of the work @realead was doing was supposed to remove the need for comparisons like out[i, 0 ] != out[i, 0] to figure out if a value was NA though it's entirely possible I have misunderstood that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "right" handling of NA would only apply to algorithms using hash-map, which is here not the case here if I see it correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what is going on here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@gfyoung gfyoung force-pushed the group-rank-nan branch 2 times, most recently from 0ac5ae9 to 3f3f30b Compare September 5, 2018 19:18
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment on whatsnew but otherwise lgtm

@@ -23,6 +23,9 @@ Fixed Regressions
- Constructing a DataFrame with an index argument that wasn't already an
instance of :class:`~pandas.core.Index` was broken in `4efb39f
<https://github.com/pandas-dev/pandas/commit/4efb39f01f5880122fa38d91e12d217ef70fad9e>`_ (:issue:`22227`).
- Calling :meth:`DataFrameGroupBy.rank` and :meth:`SeriesGroupBy.rank` with empty groups
and ``pct=True`` was broken in `c1068d9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we typically reference commits where regressions occurred in whatsnew notes? Would think it better to just call out the ZeroDivisionError instead of the commit

Copy link
Member Author

@gfyoung gfyoung Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following what was done above. This is a relatively new addition to whatsnew, but I don't any reason to buck this trend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Ultimately indifferent on the commit reference though I think calling out the ZeroDivisionError is much more useful of an indicator when either googling or looking at the whatsnew to see what has actually changed (rather than clicking through to issue or commit)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'll let the CI run its course and then make the addition.

@codecov
Copy link

codecov bot commented Sep 6, 2018

Codecov Report

Merging #22600 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22600   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50783    50783           
=======================================
  Hits        46749    46749           
  Misses       4034     4034
Flag Coverage Δ
#multiple 90.46% <ø> (ø) ⬆️
#single 42.3% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70c9003...b904ec2. Read the comment docs.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 7, 2018

Circle failure is related to Hypothesis timeout, which is not related to my PR.

cc @jreback

@@ -584,7 +584,10 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,

if pct:
for i in range(N):
out[i, 0] = out[i, 0] / grp_sizes[i, 0]
if out[i, 0] != out[i, 0] or out[i, 0] == NAN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what is going on here

@gfyoung
Copy link
Member Author

gfyoung commented Sep 7, 2018

@jreback : Made the requested change and all is green. PTAL.

@jreback jreback merged commit e6843c4 into pandas-dev:master Sep 8, 2018
@jreback
Copy link
Contributor

jreback commented Sep 8, 2018

thanks @gfyoung I think this will backport cleanly.

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 8, 2018

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 0.23.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 e6843c4b9754ae149cc6ff5cd58db05138327b74
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #22600: BUG: NaN should have pct rank of NaN'
  1. Push to a named branch :
git push YOURFORK 0.23.x:auto-backport-of-pr-22600-on-0.23.x
  1. Create a PR against branch 0.23.x, I would have named this PR:

"Backport PR #22600 on branch 0.23.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 8, 2018

@jreback : The cherry-pick is not cooperating. I'll go backport it myself.

@gfyoung gfyoung deleted the group-rank-nan branch September 8, 2018 04:37
gfyoung added a commit to forking-repos/pandas that referenced this pull request Sep 8, 2018
gfyoung added a commit to forking-repos/pandas that referenced this pull request Sep 11, 2018
gfyoung added a commit that referenced this pull request Sep 11, 2018
aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants