Skip to content

Feature: Add "random" rank in the group for DataFrame.rank and similar functions. #31051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghuls opened this issue Jan 15, 2020 · 4 comments
Closed
Labels

Comments

@ghuls
Copy link

ghuls commented Jan 15, 2020

Feature: Add "random" rank in the group for DataFrame.rank and similar functions.

DataFrame.rank(self, axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)[source]

method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’, ‘random’}, default ‘average’

How to rank the group of records that have the same value (i.e. ties):

  • average: average rank of the group
  • min: lowest rank in the group
  • max: highest rank in the group
  • first: ranks assigned in order they appear in the array
  • dense: like ‘min’, but rank always increases by 1 between groups
  • random: ranks assigned randomly (between min and max), but each value is picked only once

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 1, 3, 3, 2, 1, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [10]: df
Out[10]:
   0
a  1
b  2
c  1
d  3
e  3
f  2
g  1
h  3

# Example run 1
In [11]: df.rank(method='random')
Out[11]:
   0
a  3
b  4
c  1
d  8
e  6
f  5
g  2
h  7

# Example run 2
In [12]: df.rank(method='random')
Out[12]:
   0
a  2
b  5
c  1
d  7
e  8
f  4
g  3
h  6

It would be nice if it could be implemented in pandas as I have huge dataframes (100G) for which I need this feature.

In the worst case, is there a way to do something like this with multiple pandas commands?

A slightly related issue: #9481

@jreback
Copy link
Contributor

jreback commented Jan 15, 2020

-1 on this

you simply want random numbers between min and max w/o replacement

np.random.randn already does this
no need to add to rank

@ghuls
Copy link
Author

ghuls commented Jan 15, 2020

@jreback But I don't want any number between min and max for a certain group.
Within each group I want a random ranking between min and max but never the same one.

I finally found a way to solve it by using np.random.shuffle.
I assume there is a better way than to use a for loop there (apply, transform, ???: couldn't figure out the correct way to do it). Any suggestions?

In [1]: import numpy as np
     ...: import pandas as pd
     ...: df = pd.DataFrame([1, 2, 1, 3, 3, 2, 1, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
     ...:
     ...: full = pd.concat([df, df.rank(method='first'), df.rank(method='min'), df.rank(method='max'), df.rank(method='dense'), df], axis=1, sort=False)
     ...: full.columns = ['value', 'first', 'min', 'max', 'dense', 'ranked']
     ...:
     ...: print(full)
     ...:
     ...: full_group = full.groupby('dense')
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0       1
b      2    4.0  4.0  5.0    2.0       2
c      1    2.0  1.0  3.0    1.0       1
d      3    6.0  6.0  8.0    3.0       3
e      3    7.0  6.0  8.0    3.0       3
f      2    5.0  4.0  5.0    2.0       2
g      1    3.0  1.0  3.0    1.0       1
h      3    8.0  6.0  8.0    3.0       3

In [2]: for region_idxs in full_group.indices.values():
     ...:      tie_scores_shuffle = full.iloc[region_idxs, full.columns.get_loc('first')]
     ...:      if tie_scores_shuffle._is_view:
     ...:          tie_scores_shuffle = tie_scores_shuffle.copy()
     ...:      np.random.shuffle(tie_scores_shuffle)
     ...:      full.iloc[region_idxs, full.columns.get_loc('ranked')] = tie_scores_shuffle
     ...:
     ...: print(full)
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0     3.0
b      2    4.0  4.0  5.0    2.0     5.0
c      1    2.0  1.0  3.0    1.0     2.0
d      3    6.0  6.0  8.0    3.0     7.0
e      3    7.0  6.0  8.0    3.0     6.0
f      2    5.0  4.0  5.0    2.0     4.0
g      1    3.0  1.0  3.0    1.0     1.0
h      3    8.0  6.0  8.0    3.0     8.0

In [3]: for region_idxs in full_group.indices.values():
     ...:      tie_scores_shuffle = full.iloc[region_idxs, full.columns.get_loc('first')]
     ...:      if tie_scores_shuffle._is_view:
     ...:          tie_scores_shuffle = tie_scores_shuffle.copy()
     ...:      np.random.shuffle(tie_scores_shuffle)
     ...:      full.iloc[region_idxs, full.columns.get_loc('ranked')] = tie_scores_shuffle
     ...:
     ...: print(full)
     ...:
     ...:
   value  first  min  max  dense  ranked
a      1    1.0  1.0  3.0    1.0     1.0
b      2    4.0  4.0  5.0    2.0     5.0
c      1    2.0  1.0  3.0    1.0     2.0
d      3    6.0  6.0  8.0    3.0     7.0
e      3    7.0  6.0  8.0    3.0     8.0
f      2    5.0  4.0  5.0    2.0     4.0
g      1    3.0  1.0  3.0    1.0     3.0
h      3    8.0  6.0  8.0    3.0     6.0

@jreback
Copy link
Contributor

jreback commented Jan 16, 2020

you can just do something like

def f(x):
return np.random.permutation(np.arange(x.min(), x.max())
df.groupby(..)[col].apply(f)

in any event this is out of scope for a method in pandas

@jreback jreback closed this as completed Jan 16, 2020
@jreback jreback added this to the No action milestone Jan 16, 2020
@ghuls
Copy link
Author

ghuls commented Mar 6, 2020

The following does what I want:

def rank_func(x): 
    return np.random.permutation(np.arange(x[0], x[0] + x.shape[0]))

def rank_column_func(x):
     return x.groupby(x).transform(rank_func)

df.rank(method='min').apply(rank_column_func, axis='rows')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants