We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000}) In [31]: %%timeit ...: t = df.groupby("B").rank() 608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [32]: %%timeit ...: t = df.A.rank() 1.27 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [33]: %%timeit ...: t = df.groupby("B").apply(pd.Series.rank) ...: 6.51 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
groupby rank is much slower than without groupby when there is a lot of ties
In [42]: df1 = pd.DataFrame({"A":np.random.rand(30000) ,"B":[1]*30000}) In [44]: %%timeit ...: t = df1.groupby("B").apply(pd.Series.rank) ...: 10.1 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [46]: %%timeit ...: t = df1.groupby("B").rank() ...: 4.77 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.show_versions()
pandas: 0.24.0.dev0+32.g3b770fa07 pytest: 3.3.2 pip: 9.0.1 setuptools: 38.4.0 Cython: 0.27.3 numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.7 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.1 pymysql: 0.7.11.None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
The text was updated successfully, but these errors were encountered:
Not surprised by this as it is even called out in the comments of that function:
pandas/pandas/_libs/groupby_helper.pxi.in
Line 524 in b2eec25
Investigation and a PR for a more efficient implementation would certainly be welcome!
Sorry, something went wrong.
PERF: improve performance of groupby rank (pandas-dev#21237)
cb3f778
fbb05d4
PERF: improve performance of groupby rank (#21237) (#21285)
2a33926
PERF: improve performance of groupby rank (pandas-dev#21237) (pandas-…
c78c269
…dev#21285)
591248f
Successfully merging a pull request may close this issue.
Code Sample, a copy-pastable example if possible
Problem description
groupby rank is much slower than without groupby when there is a lot of ties
Expected Output
Output of
pd.show_versions()
pandas: 0.24.0.dev0+32.g3b770fa07
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.7
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: