-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: make dense ranks results scale to 100 percent #21203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -167,11 +167,11 @@ def test_infs_n_nans(grps, vals, ties_method, ascending, na_option, exp): | |||
('dense', True, 'keep', False, | |||
[1., 1., np.nan, 3., 1., 2., np.nan, np.nan]), | |||
('dense', True, 'keep', True, | |||
[0.2, 0.2, np.nan, 0.6, 0.2, 0.4, np.nan, np.nan]), | |||
[1. / 3., 1. / 3., np.nan, 3. / 3., 1. / 3., 2. / 3., np.nan, np.nan]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use literals instead of expressions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@that will be 0.3333333333333333. A little hard to read. Still use literal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea don’t do that. If it’s minimal effort to replace the data with something more easily divisible please do so. Otherwise then I’m ok with it as is
pandas/_libs/groupby_helper.pxi.in
Outdated
grp_na_count = 0 | ||
val_start = i + 1 | ||
grp_start = i + 1 | ||
lab_start = i + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same thing as grp_start
?
pandas/_libs/groupby_helper.pxi.in
Outdated
grp_sizes[_as[j], 0] = i - grp_start + 1 - grp_na_count | ||
dups = sum_ranks = 0 | ||
if pct: | ||
if tiebreak != TIEBREAK_DENSE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this here is it not possible to alter the handling of the tie_count
variable incrementing above depending on whether or not we are using pct
and TIEBREAK_DENSE
together? Seems like it would be simpler to do that then to implement another branch for assigning percents, if that's possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean move the logic to here ?
elif tiebreak == TIEBREAK_DENSE:
for j in range(i - dups + 1, i + 1):
out[_as[j], 0] = grp_vals_seen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm well I suppose that that would alter the non-pct items, but my overall point is I feel like this can be done more succinctly. In either case (TIEBREAK_DENSE or not) it's a matter of dividing by the right denominator to get the correct value.
So I think it would be cleaner to do something along the lines of:
if TIEBREAK_DENSE:
denominator = ...
else:
denominator = ...
for j in range(lab_start, i + 1):
out[_as[j], 0] = out[_as[j], 0] / denominator
Codecov Report
@@ Coverage Diff @@
## master #21203 +/- ##
=======================================
Coverage 91.84% 91.84%
=======================================
Files 153 153
Lines 49543 49543
=======================================
Hits 45504 45504
Misses 4039 4039
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note. otherwise lgtm.
I guess this should be in 0.24 as its a numeric change. any takers for 0.23.1? |
I think its minor enough to be in 0.23.1 but don't have that strong a preference either way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, let's move to 0.23.1 @WillAyd merge when ready.
pandas/_libs/groupby_helper.pxi.in
Outdated
grp_sizes[_as[j], 0] = i - grp_start + 1 - grp_na_count | ||
else: | ||
for j in range(grp_start, i + 1): | ||
grp_sizes[_as[j], 0] = (grp_tie_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but can you put the -
on the previous line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break line after the operator ?
for j in range(grp_start, i + 1):
grp_sizes[_as[j], 0] = (grp_tie_count -
(grp_na_count > 0))
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -111,7 +111,7 @@ Offsets | |||
Numeric | |||
^^^^^^^ | |||
|
|||
- | |||
- Bug in :func:`pandas.core.groupby.GroupBy.rank` where results did not scale to 100% when specifying ``method='dense'`` and ``pct=True`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's move to 0.23.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Thanks @peterpanmj ! |
(cherry picked from commit b237b11)
(cherry picked from commit b237b11)
git diff upstream/master -u -- "*.py" | flake8 --diff