Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

ghuls · 2020-01-15T19:55:07Z

Feature: Add "random" rank in the group for DataFrame.rank and similar functions.

DataFrame.rank(self, axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)[source]

method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’, ‘random’}, default ‘average’

How to rank the group of records that have the same value (i.e. ties):

average: average rank of the group

min: lowest rank in the group

max: highest rank in the group

first: ranks assigned in order they appear in the array

dense: like ‘min’, but rank always increases by 1 between groups

random: ranks assigned randomly (between min and max), but each value is picked only once

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 1, 3, 3, 2, 1, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [10]: df
Out[10]:
   0
a  1
b  2
c  1
d  3
e  3
f  2
g  1
h  3

# Example run 1
In [11]: df.rank(method='random')
Out[11]:
   0
a  3
b  4
c  1
d  8
e  6
f  5
g  2
h  7

# Example run 2
In [12]: df.rank(method='random')
Out[12]:
   0
a  2
b  5
c  1
d  7
e  8
f  4
g  3
h  6

It would be nice if it could be implemented in pandas as I have huge dataframes (100G) for which I need this feature.

In the worst case, is there a way to do something like this with multiple pandas commands?

A slightly related issue: #9481

jreback · 2020-01-15T22:44:30Z

-1 on this

you simply want random numbers between min and max w/o replacement

np.random.randn already does this
no need to add to rank

ghuls · 2020-01-15T23:27:39Z

@jreback But I don't want any number between min and max for a certain group.
Within each group I want a random ranking between min and max but never the same one.

I finally found a way to solve it by using np.random.shuffle.
I assume there is a better way than to use a for loop there (apply, transform, ???: couldn't figure out the correct way to do it). Any suggestions?

In [1]: import numpy as np
     ...: import pandas as pd
     ...: df = pd.DataFrame([1, 2, 1, 3, 3, 2, 1, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
     ...:
     ...: full = pd.concat([df, df.rank(method='first'), df.rank(method='min'), df.rank(method='max'), df.rank(method='dense'), df], axis=1, sort=False)
     ...: full.columns = ['value', 'first', 'min', 'max', 'dense', 'ranked']
     ...:
     ...: print(full)
     ...:
     ...: full_group = full.groupby('dense')
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0       1
b      2    4.0  4.0  5.0    2.0       2
c      1    2.0  1.0  3.0    1.0       1
d      3    6.0  6.0  8.0    3.0       3
e      3    7.0  6.0  8.0    3.0       3
f      2    5.0  4.0  5.0    2.0       2
g      1    3.0  1.0  3.0    1.0       1
h      3    8.0  6.0  8.0    3.0       3

In [2]: for region_idxs in full_group.indices.values():
     ...:      tie_scores_shuffle = full.iloc[region_idxs, full.columns.get_loc('first')]
     ...:      if tie_scores_shuffle._is_view:
     ...:          tie_scores_shuffle = tie_scores_shuffle.copy()
     ...:      np.random.shuffle(tie_scores_shuffle)
     ...:      full.iloc[region_idxs, full.columns.get_loc('ranked')] = tie_scores_shuffle
     ...:
     ...: print(full)
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0     3.0
b      2    4.0  4.0  5.0    2.0     5.0
c      1    2.0  1.0  3.0    1.0     2.0
d      3    6.0  6.0  8.0    3.0     7.0
e      3    7.0  6.0  8.0    3.0     6.0
f      2    5.0  4.0  5.0    2.0     4.0
g      1    3.0  1.0  3.0    1.0     1.0
h      3    8.0  6.0  8.0    3.0     8.0

In [3]: for region_idxs in full_group.indices.values():
     ...:      tie_scores_shuffle = full.iloc[region_idxs, full.columns.get_loc('first')]
     ...:      if tie_scores_shuffle._is_view:
     ...:          tie_scores_shuffle = tie_scores_shuffle.copy()
     ...:      np.random.shuffle(tie_scores_shuffle)
     ...:      full.iloc[region_idxs, full.columns.get_loc('ranked')] = tie_scores_shuffle
     ...:
     ...: print(full)
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0     1.0
b      2    4.0  4.0  5.0    2.0     5.0
c      1    2.0  1.0  3.0    1.0     2.0
d      3    6.0  6.0  8.0    3.0     7.0
e      3    7.0  6.0  8.0    3.0     8.0
f      2    5.0  4.0  5.0    2.0     4.0
g      1    3.0  1.0  3.0    1.0     3.0
h      3    8.0  6.0  8.0    3.0     6.0

jreback · 2020-01-16T04:24:41Z

you can just do something like

def f(x):
return np.random.permutation(np.arange(x.min(), x.max())
df.groupby(..)[col].apply(f)

in any event this is out of scope for a method in pandas

ghuls · 2020-03-06T13:50:28Z

The following does what I want:

def rank_func(x): 
    return np.random.permutation(np.arange(x[0], x[0] + x.shape[0]))

def rank_column_func(x):
     return x.groupby(x).transform(rank_func)

df.rank(method='min').apply(rank_column_func, axis='rows')

jreback closed this as completed Jan 16, 2020

jreback added the Groupby label Jan 16, 2020

jreback added this to the No action milestone Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

ghuls commented Jan 15, 2020 •

edited

Loading

jreback commented Jan 15, 2020

ghuls commented Jan 15, 2020

jreback commented Jan 16, 2020

ghuls commented Mar 6, 2020

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

Comments

ghuls commented Jan 15, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

jreback commented Jan 15, 2020

ghuls commented Jan 15, 2020

jreback commented Jan 16, 2020

ghuls commented Mar 6, 2020

ghuls commented Jan 15, 2020 •

edited

Loading