-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance issue while groupby.shift if fill_value specified #26615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report - the alternate solution you've identified is spot on. If you'd like to take a look at improving performance with the fill_value keyword we would certainly take PRs! |
@BeforeFlight are you interested in working on this? |
@TomAugspurger honestly I'm very novice on github (one may say - in pandas as well). So almost never have worked with 'branches - pull requests` schemes. Have to take quite some time to read docs / set environment / test environment / fully understand TDD / ... Besides now I'm encounter another pandas performance issue, with HDFstore, which slows my work a great deal. But I'm not really shore it's bug (one may have a skim on my 'investigations' - to understand is it worthy at all?) so haven't opened another issue. So i think community just doesn't have so much time for waiting my pull requests :) (hypothetical for now). But if somehow this issue won't be solved by that time - it will become my first focus for sure. |
But I may make some considerations even now, which may be helpful. I think, that the reason for such delay of execution And as for implementation - I suppose it shouldn't be done in several steps like in the end of my 1st comment (remember |
@BeforeFlight thanks for the initial insights. So one possibility to make this significantly faster is to still use a vectorized solution even when |
Looks like |
Test code:
And suppose we want group by 0-level of index and shift each group (shift step = 2, for example).
Now to the problem: if we want to preserve our dtype 'uint8', we have to get rid of
None
s, and set our fill value with 0, for example. But we will get HUGE time of code execution now:Expected time of execution should be comparable with following:
If we take 1st shifted dataframe without
fill_value
, and add few code lines to achieve same result:Output:
It will add only few ms, not 5 seconds.
pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: