Implement high performance rolling_rank #9481

PH82 · 2015-02-13T07:18:16Z

xref SO issue here

Im looking to set the rolling rank on a dataframe. Having posted, discussed and analysed the code it looks like the suggested way would be to use the pandas Series.rank function as an argument in rolling_apply. However on large datasets the performance is particularly poor. I have tried different implementations and using bottlenecks rank method orders of magnitude faster, but that only offers the average option for ties. It is also still some way off the performance of rolling_mean. I have previously implemented a rolling rank function which monitors changes on a moving window (in a similar way to algos.roll_mean I believe) rather that recalculating the rank from scratch on each window. Below is an example to highlight the performance, it should be possible to implement a rolling rank with comparable performance to rolling_mean.

python: 2.7.3
pandas: 0.15.2
scipy: 0.10.1
bottleneck: 0.7.0

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 #28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 #70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 #3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 #3.56secs (only provides competition/min ranking for ties, not necessarily correct result)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 #0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]
```python

I think this is likely to be a common request for users looking to use pandas for analysis on large datasets and thought it would be a useful addition to the pandas moving statistics/moments suite?

jreback · 2015-02-13T14:08:24Z

I have added it to the master issue. Its actuallly pretty straightforward to do this. Note that all that is necessary is a new method, rolling_idxmax, which is essentially rolling_max but it tracks the index as well; so all that is needed is a modification to algos.roll_max2 to also record the index.

want to do a pull-request here?

PH82 · 2015-02-16T07:47:08Z

Yes that sounds like what I was thinking, pandas.stats.moments.rolling_rank with the same arguments as rank, allowing the user to select a competition method (min, avg....) and whether ranks should be ascending. Not sure I'm familiar enough with the pandas code yet to make any modifications my self to implement the best solution in the environment for this though. Happy to test though.

cing · 2015-04-14T01:20:21Z

I got rolling_indmax and rolling_indxmin working just as @jreback said, recording the index. No extra memory allocation or computation on top of the normal algorithm, just an additional return array from roll_max2/roll_min2 holding the indices. At the PyCon sprint @jreback suggested I return an int64 from the Cython code to store the indices, but I'm pretty sure it has to be a float64 due to possible NaN's, right?

Also, @PH82 requested rolling_idxaverage but is this really necessary?

shoyer · 2015-04-14T02:27:01Z

@cing you can use -1 to mark missing values values as long as you only use positive indices otherwise (that's pretty standard in pandas)

PH82 · 2015-07-21T14:37:06Z

@cing, The average I mention is to determine how to rank ties. Common methods for how to rank ties are:

Values	Rank (Tie Method=Min)	Rank (Tie Method=Max)	Rank (Tie Method=Avg)
3	4	4	4
1	1	1	1
2	2	3	2.5
4	5	5	5
2	2	3	2.5

cing · 2015-07-21T14:44:06Z

I'll have to take a look at this again, some of the test cases didn't succeed for rolling_idxmax and rolling_idxmin. I didn't implement a rolling_idxavg though, is that truly useful?

PH82 · 2015-07-21T14:57:08Z

I would say so, for my immediate purposes no, but it is a valid tie method
and quite likely that it would be used in the future, in fact I can think
of scenarios where I would use it. The rolling_rank I think should
implement as much of the non-rolling method as possible.

Dmoonleo · 2016-05-26T12:36:01Z

I am glad to see there is an issue ticket already, for this rolling rank.
May I know the status of this issue? How could I help on this?

jreback · 2016-05-26T12:41:48Z

@Dmoonleo the labels indicate the status, meaning that its open and unclaimed. you are welcome to submit a pull-request.

mcherkassky · 2017-08-23T19:05:13Z

Is this going to be implemented?

TomAugspurger · 2017-08-23T19:13:25Z

@mcherkassky you're welcome take take a crack at it. Let us know if you need help getting started.

cing · 2017-08-23T19:38:52Z

I still have the code I wrote back from the good ol' days, but it's worth taking a fresh look because window.pyx has been touched a few times. As a novice, I'd just comment that the windowed rolling algorithm is a bit more complicated than the "novice" tag might suggest! Nothing teamwork can't crack though.

mcherkassky · 2017-08-23T20:33:15Z

does anyone have an algorithm reference to implement the rolling rank? it doesn't seem trivial

i'm happy to give it a stab but i don't really follow the discussion above about rolling_idxmax/rolling_idxmin and how that helps get the rolling rank

sluo1989 · 2017-12-07T18:26:11Z

Is there anyone still working on the efficient implementation of rolling rank?

jreback · 2017-12-07T18:41:49Z

no would love to have this!

VanyTang · 2018-12-11T12:23:53Z

I propose an algorithm to calculate rolling_rank efficiently.

Suppose window size is fixed, and rank is defined when the window number is sorted in monotone increasing.

We can use a balanced tree to store window data, as it only takes O(logM) for insert, delete and finding operations, where M is the window size.

In the procedure of insert, delete operation, we should maintain a field called size of each node representing the count of nodes of this sub-tree, which will be used in calculating rank. If a balanced tree maintain the size field initially (such as Size-Balanced Tree or Weight Balanced Tree), it would be better.

Due to numbers are organized orderly in a balanced tree, when we want to get the rank of a number in this window, we can just find the corresponding node starting from the tree root, and sum the size of all the left child tree through the finding path. The sum is the rank. This operation is also O(logM).

From the above, when calculating a rolling rank in a length N sequence where window size is M, the total time complexity is O(NlogM). It's much better than that of the naive algorithm (sort and get the rank in each window) O(NMlogM). This algorithm could be M times fast than the original one.

bmpalatiello · 2019-01-23T22:39:18Z

If there is still interest, my workaround for this is:

import bottleneck as bk

norm_rank = bk.move_rank(x.values, n, axis=0)
denorm = (((norm_rank + 1) / 2) * (n - 1)) + 1
descend = (n - denorm) + 1

The bk.move_rank function returns a normalized rank between -1 and 1. So taking the normalized rank and reverse engineering it to return the actual rank. Then essentially making it descending=True. Obviously the only potential downside is it only provides average ranking for ties.

Running it on my small laptop:

window = 240
x = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))

# Original rollingRankBottleneck above
6.04 s ± 302 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This version
411 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cottrell · 2020-04-27T16:18:06Z

In case it's useful I'm using it like this:

def rank_last_value(x, shift= 2, pct=False):
    """
    One step of the procedure to get the rolling rank of the last value
    according to the previous window of values.

    Use the bottlneck routine below instead. It is much faster. Keep this here for testing.
    Use with .rolling(window=100).apply(rank_last_value, raw=True) for example
    """
    args = np.argsort(x)
    rank = np.argwhere(args == (x.shape[0] - 1))[0][0]
    if pct:
        return (rank + 1)/ (x.shape[0] + shift)
    return rank

def rolling_rank(x, window, pct=False, min_prob=None):
    """
    Get the rolling rank of the last value according to the previous window of values.
    """
    # https://github.com/pandas-dev/pandas/issues/9481
    import bottleneck as bk
    norm_rank = bk.move_rank(x, window, axis=0) # [-1, 1]
    u = (norm_rank + 1) / 2 # [0, 1]
    if pct:
        if min_prob is None:
            min_prob = 1 / (window + 1)
            return u * (1 - 2 * min_prob) + min_prob # [min_prob, 1 - min_prob]
    rank = u * (window - 1)
    return np.round(rank)

def _test_rolling_rank_against_rank_last_value():
    import pandas as pd
    x = np.random.randn(1000)
    window = 30
    aa = pd.Series(x).rolling(window).apply(rank_last_value, raw=True).values
    bb = rolling_rank(x, window=window)
    assert np.allclose(aa, bb, equal_nan=True)

contribu · 2020-09-05T08:23:55Z

I implemented it.

Computational complexity( n: input length w: rolling window size )

rollingrank (proposed): O(n * log(w))
bottleneck.move_rank: O(n * w)

Try
https://github.com/contribu/rollingrank

jreback · 2020-09-05T14:52:06Z

@contribu thanks for the implementaton. Ideally this would port almost directly to cython and embedded in the current infrastructure. we don't have very much c++ code in pandas and mostly use cython. if you could do this would be fantastic.

leeong05 mentioned this issue Feb 13, 2015

API/ENH: master issue for pd.rolling_apply #8659

Closed

14 tasks

jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels Feb 13, 2015

jreback added this to the 0.17.0 milestone Feb 13, 2015

jreback added Good as first PR labels Apr 2, 2015

PH82 closed this as completed Jul 21, 2015

PH82 reopened this Jul 21, 2015

TomAugspurger added the good first issue label Oct 11, 2017

jreback added good first issue and removed good first issue Difficulty Novice labels Dec 15, 2017

jreback mentioned this issue Apr 9, 2018

Rolling Rank issues in pandas #20602

Closed

jbrockmendel removed the Effort Low label Oct 21, 2019

ghuls mentioned this issue Jan 15, 2020

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

Closed

mroeschke added Window rolling, ewma, expanding and removed Numeric Operations Arithmetic, Comparison, and Logical operations good first issue labels Apr 12, 2021

jreback mentioned this issue Aug 31, 2021

ENH: Rolling rank #43338

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Sep 2, 2021

jreback closed this as completed in #43338 Sep 14, 2021

contribu mentioned this issue Mar 2, 2022

Segmentation fault when trying to run rollingrank. contribu/rollingrank#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement high performance rolling_rank #9481

Implement high performance rolling_rank #9481

PH82 commented Feb 13, 2015

jreback commented Feb 13, 2015

PH82 commented Feb 16, 2015

cing commented Apr 14, 2015

shoyer commented Apr 14, 2015

PH82 commented Jul 21, 2015

cing commented Jul 21, 2015

PH82 commented Jul 21, 2015

Dmoonleo commented May 26, 2016

jreback commented May 26, 2016

mcherkassky commented Aug 23, 2017

TomAugspurger commented Aug 23, 2017

cing commented Aug 23, 2017

mcherkassky commented Aug 23, 2017 •

edited

Loading

sluo1989 commented Dec 7, 2017

jreback commented Dec 7, 2017

VanyTang commented Dec 11, 2018 •

edited

Loading

bmpalatiello commented Jan 23, 2019

cottrell commented Apr 27, 2020

contribu commented Sep 5, 2020

jreback commented Sep 5, 2020

Implement high performance rolling_rank #9481

Implement high performance rolling_rank #9481

Comments

PH82 commented Feb 13, 2015

jreback commented Feb 13, 2015

PH82 commented Feb 16, 2015

cing commented Apr 14, 2015

shoyer commented Apr 14, 2015

PH82 commented Jul 21, 2015

cing commented Jul 21, 2015

PH82 commented Jul 21, 2015

Dmoonleo commented May 26, 2016

jreback commented May 26, 2016

mcherkassky commented Aug 23, 2017

TomAugspurger commented Aug 23, 2017

cing commented Aug 23, 2017

mcherkassky commented Aug 23, 2017 • edited Loading

sluo1989 commented Dec 7, 2017

jreback commented Dec 7, 2017

VanyTang commented Dec 11, 2018 • edited Loading

bmpalatiello commented Jan 23, 2019

cottrell commented Apr 27, 2020

contribu commented Sep 5, 2020

jreback commented Sep 5, 2020

mcherkassky commented Aug 23, 2017 •

edited

Loading

VanyTang commented Dec 11, 2018 •

edited

Loading