PERF: groupby rank is slow when tie count is big #21237

peterpanmj · 2018-05-29T03:55:44Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [31]: %%timeit
    ...: t = df.groupby("B").rank()

608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [32]: %%timeit
    ...: t = df.A.rank()
1.27 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [33]: %%timeit
    ...: t = df.groupby("B").apply(pd.Series.rank)
    ...:
6.51 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

groupby rank is much slower than without groupby when there is a lot of ties

Expected Output

In [42]: df1 = pd.DataFrame({"A":np.random.rand(30000) ,"B":[1]*30000})

In [44]: %%timeit
    ...: t = df1.groupby("B").apply(pd.Series.rank)
    ...:
10.1 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %%timeit
    ...: t = df1.groupby("B").rank()
    ...:
4.77 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: 3b770fa python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: zh_CN.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+32.g3b770fa07
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.7
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-05-29T04:09:09Z

Not surprised by this as it is even called out in the comments of that function:

pandas/pandas/_libs/groupby_helper.pxi.in

Line 524 in b2eec25

# this implementation is inefficient because it will

Investigation and a PR for a more efficient implementation would certainly be welcome!

…dev#21285)

WillAyd added Groupby Performance Memory or execution speed performance Difficulty Intermediate labels May 29, 2018

peterpanmj added a commit to peterpanmj/pandas that referenced this issue Jun 1, 2018

PERF: improve performance of groupby rank (pandas-dev#21237)

cb3f778

peterpanmj mentioned this issue Jun 1, 2018

PERF: improve performance of groupby rank (#21237) #21285

Merged

4 tasks

jreback added this to the 0.24.0 milestone Jun 3, 2018

peterpanmj added a commit to peterpanmj/pandas that referenced this issue Jun 13, 2018

PERF: improve performance of groupby rank (pandas-dev#21237)

fbb05d4

jreback closed this as completed in #21285 Jun 14, 2018

jreback pushed a commit that referenced this issue Jun 14, 2018

PERF: improve performance of groupby rank (#21237) (#21285)

2a33926

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this issue Jun 18, 2018

PERF: improve performance of groupby rank (pandas-dev#21237) (pandas-…

c78c269

…dev#21285)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

PERF: improve performance of groupby rank (pandas-dev#21237) (pandas-…

591248f

…dev#21285)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby rank is slow when tie count is big #21237

PERF: groupby rank is slow when tie count is big #21237

peterpanmj commented May 29, 2018

WillAyd commented May 29, 2018

PERF: groupby rank is slow when tie count is big #21237

PERF: groupby rank is slow when tie count is big #21237

Comments

peterpanmj commented May 29, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented May 29, 2018

Output of `pd.show_versions()`