Skip to content

Commit 0d199e4

Browse files
peterpanmjjreback
authored andcommitted
BUG: Fix problems in group rank when both nans and infinity are present pandas-dev#20561 (pandas-dev#20681)
1 parent 7e75e4a commit 0d199e4

File tree

3 files changed

+75
-13
lines changed

3 files changed

+75
-13
lines changed

doc/source/whatsnew/v0.23.0.txt

+6-2
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,12 @@ Current Behavior:
221221

222222
s.rank(na_option='top')
223223

224+
These bugs were squashed:
225+
226+
- Bug in :meth:`DataFrame.rank` and :meth:`Series.rank` when ``method='dense'`` and ``pct=True`` in which percentile ranks were not being used with the number of distinct observations (:issue:`15630`)
227+
- Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``ascending='False'`` failed to return correct ranks for infinity if ``NaN`` were present (:issue:`19538`)
228+
- Bug in :func:`DataFrameGroupBy.rank` where ranks were incorrect when both infinity and ``NaN`` were present (:issue:`20561`)
229+
224230
.. _whatsnew_0230.enhancements.round-trippable_json:
225231

226232
JSON read/write round-trippable with ``orient='table'``
@@ -1082,14 +1088,12 @@ Offsets
10821088

10831089
Numeric
10841090
^^^^^^^
1085-
- Bug in :meth:`DataFrame.rank` and :meth:`Series.rank` when ``method='dense'`` and ``pct=True`` in which percentile ranks were not being used with the number of distinct observations (:issue:`15630`)
10861091
- Bug in :class:`Series` constructor with an int or float list where specifying ``dtype=str``, ``dtype='str'`` or ``dtype='U'`` failed to convert the data elements to strings (:issue:`16605`)
10871092
- Bug in :class:`Index` multiplication and division methods where operating with a ``Series`` would return an ``Index`` object instead of a ``Series`` object (:issue:`19042`)
10881093
- Bug in the :class:`DataFrame` constructor in which data containing very large positive or very large negative numbers was causing ``OverflowError`` (:issue:`18584`)
10891094
- Bug in :class:`Index` constructor with ``dtype='uint64'`` where int-like floats were not coerced to :class:`UInt64Index` (:issue:`18400`)
10901095
- Bug in :class:`DataFrame` flex arithmetic (e.g. ``df.add(other, fill_value=foo)``) with a ``fill_value`` other than ``None`` failed to raise ``NotImplementedError`` in corner cases where either the frame or ``other`` has length zero (:issue:`19522`)
10911096
- Multiplication and division of numeric-dtyped :class:`Index` objects with timedelta-like scalars returns ``TimedeltaIndex`` instead of raising ``TypeError`` (:issue:`19333`)
1092-
- Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``ascending='False'`` failed to return correct ranks for infinity if ``NaN`` were present (:issue:`19538`)
10931097
- Bug where ``NaN`` was returned instead of 0 by :func:`Series.pct_change` and :func:`DataFrame.pct_change` when ``fill_method`` is not ``None`` (:issue:`19873`)
10941098

10951099

pandas/_libs/groupby_helper.pxi.in

+20-11
Original file line numberDiff line numberDiff line change
@@ -417,25 +417,33 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
417417
ndarray[int64_t] labels,
418418
bint is_datetimelike, object ties_method,
419419
bint ascending, bint pct, object na_option):
420-
"""Provides the rank of values within each group
420+
"""
421+
Provides the rank of values within each group.
421422

422423
Parameters
423424
----------
424425
out : array of float64_t values which this method will write its results to
425426
values : array of {{c_type}} values to be ranked
426427
labels : array containing unique label for each group, with its ordering
427428
matching up to the corresponding record in `values`
428-
is_datetimelike : bool
429+
is_datetimelike : bool, default False
429430
unused in this method but provided for call compatibility with other
430431
Cython transformations
431-
ties_method : {'keep', 'top', 'bottom'}
432+
ties_method : {'average', 'min', 'max', 'first', 'dense'}, default 'average'
433+
* average: average rank of group
434+
* min: lowest rank in group
435+
* max: highest rank in group
436+
* first: ranks assigned in order they appear in the array
437+
* dense: like 'min', but rank always increases by 1 between groups
438+
ascending : boolean, default True
439+
False for ranks by high (1) to low (N)
440+
na_option : {'keep', 'top', 'bottom'}, default 'keep'
441+
pct : boolean, default False
442+
Compute percentage rank of data within each group
443+
na_option : {'keep', 'top', 'bottom'}, default 'keep'
432444
* keep: leave NA values where they are
433445
* top: smallest rank if ascending
434446
* bottom: smallest rank if descending
435-
ascending : boolean
436-
False for ranks by high (1) to low (N)
437-
pct : boolean
438-
Compute percentage rank of data within each group
439447

440448
Notes
441449
-----
@@ -508,7 +516,8 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
508516

509517
# if keep_na, check for missing values and assign back
510518
# to the result where appropriate
511-
if keep_na and masked_vals[_as[i]] == nan_fill_val:
519+
520+
if keep_na and mask[_as[i]]:
512521
grp_na_count += 1
513522
out[_as[i], 0] = nan
514523
else:
@@ -548,9 +557,9 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
548557
# reset the dups and sum_ranks, knowing that a new value is coming
549558
# up. the conditional also needs to handle nan equality and the
550559
# end of iteration
551-
if (i == N - 1 or (
552-
(masked_vals[_as[i]] != masked_vals[_as[i+1]]) and not
553-
(mask[_as[i]] and mask[_as[i+1]]))):
560+
if (i == N - 1 or
561+
(masked_vals[_as[i]] != masked_vals[_as[i+1]]) or
562+
(mask[_as[i]] ^ mask[_as[i+1]])):
554563
dups = sum_ranks = 0
555564
val_start = i
556565
grp_vals_seen += 1

pandas/tests/groupby/test_groupby.py

+49
Original file line numberDiff line numberDiff line change
@@ -1965,6 +1965,55 @@ def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp):
19651965
exp_df = DataFrame(exp * len(grps), columns=['val'])
19661966
assert_frame_equal(result, exp_df)
19671967

1968+
@pytest.mark.parametrize("grps", [
1969+
['qux'], ['qux', 'quux']])
1970+
@pytest.mark.parametrize("vals", [
1971+
[-np.inf, -np.inf, np.nan, 1., np.nan, np.inf, np.inf],
1972+
])
1973+
@pytest.mark.parametrize("ties_method,ascending,na_option,exp", [
1974+
('average', True, 'keep', [1.5, 1.5, np.nan, 3, np.nan, 4.5, 4.5]),
1975+
('average', True, 'top', [3.5, 3.5, 1.5, 5., 1.5, 6.5, 6.5]),
1976+
('average', True, 'bottom', [1.5, 1.5, 6.5, 3., 6.5, 4.5, 4.5]),
1977+
('average', False, 'keep', [4.5, 4.5, np.nan, 3, np.nan, 1.5, 1.5]),
1978+
('average', False, 'top', [6.5, 6.5, 1.5, 5., 1.5, 3.5, 3.5]),
1979+
('average', False, 'bottom', [4.5, 4.5, 6.5, 3., 6.5, 1.5, 1.5]),
1980+
('min', True, 'keep', [1., 1., np.nan, 3., np.nan, 4., 4.]),
1981+
('min', True, 'top', [3., 3., 1., 5., 1., 6., 6.]),
1982+
('min', True, 'bottom', [1., 1., 6., 3., 6., 4., 4.]),
1983+
('min', False, 'keep', [4., 4., np.nan, 3., np.nan, 1., 1.]),
1984+
('min', False, 'top', [6., 6., 1., 5., 1., 3., 3.]),
1985+
('min', False, 'bottom', [4., 4., 6., 3., 6., 1., 1.]),
1986+
('max', True, 'keep', [2., 2., np.nan, 3., np.nan, 5., 5.]),
1987+
('max', True, 'top', [4., 4., 2., 5., 2., 7., 7.]),
1988+
('max', True, 'bottom', [2., 2., 7., 3., 7., 5., 5.]),
1989+
('max', False, 'keep', [5., 5., np.nan, 3., np.nan, 2., 2.]),
1990+
('max', False, 'top', [7., 7., 2., 5., 2., 4., 4.]),
1991+
('max', False, 'bottom', [5., 5., 7., 3., 7., 2., 2.]),
1992+
('first', True, 'keep', [1., 2., np.nan, 3., np.nan, 4., 5.]),
1993+
('first', True, 'top', [3., 4., 1., 5., 2., 6., 7.]),
1994+
('first', True, 'bottom', [1., 2., 6., 3., 7., 4., 5.]),
1995+
('first', False, 'keep', [4., 5., np.nan, 3., np.nan, 1., 2.]),
1996+
('first', False, 'top', [6., 7., 1., 5., 2., 3., 4.]),
1997+
('first', False, 'bottom', [4., 5., 6., 3., 7., 1., 2.]),
1998+
('dense', True, 'keep', [1., 1., np.nan, 2., np.nan, 3., 3.]),
1999+
('dense', True, 'top', [2., 2., 1., 3., 1., 4., 4.]),
2000+
('dense', True, 'bottom', [1., 1., 4., 2., 4., 3., 3.]),
2001+
('dense', False, 'keep', [3., 3., np.nan, 2., np.nan, 1., 1.]),
2002+
('dense', False, 'top', [4., 4., 1., 3., 1., 2., 2.]),
2003+
('dense', False, 'bottom', [3., 3., 4., 2., 4., 1., 1.])
2004+
])
2005+
def test_infs_n_nans(self, grps, vals, ties_method, ascending, na_option,
2006+
exp):
2007+
# GH 20561
2008+
key = np.repeat(grps, len(vals))
2009+
vals = vals * len(grps)
2010+
df = DataFrame({'key': key, 'val': vals})
2011+
result = df.groupby('key').rank(method=ties_method,
2012+
ascending=ascending,
2013+
na_option=na_option)
2014+
exp_df = DataFrame(exp * len(grps), columns=['val'])
2015+
assert_frame_equal(result, exp_df)
2016+
19682017
@pytest.mark.parametrize("grps", [
19692018
['qux'], ['qux', 'quux']])
19702019
@pytest.mark.parametrize("vals", [

0 commit comments

Comments
 (0)