-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Improve poor performance of groupby resampling #14079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I didn't look into detail to your example, but note that the two functions do return a different result, so do not seem to be equivalent. |
duplicate of #11296 pull-requests are welcome! |
The actual grouping / resample are not bounding this here at all, rather it is the fact that |
Yeah, to be clear I wasn't proposing the other code as a replacement, just to show the performance issue (it gets worse on even bigger data sets). I didn't think it was worth the extra fiddling to get the column ordering and dtypes to be the same. The general application is converting transaction records into an unbalanced panel data set, which I can then run standard regression methods on. I had assumed that it was mostly the groupby that was causing the slowness. Is that wrong? Based on the assumption I thought that one solution for my use case would be if resample could support pad operations for non-unique indexes (or more complete support for resampling multiindexes, which is essentially what my DATE and ID pairs represent). Currently if I do:
I get:
Whereas it would be nice if it was able to handle non-unique indexes by treating the non-ascending dates as boundaries and return something like:
But I guess that's moot if the problem is that reindex filling relies on looping. As for pull requests, I found the code that does the reindexing / resampling to be pretty opaque so would need a few pointers to get started. |
Number of row : 960,000 |
When trying to resample transactions data where there are infrequent transactions for a large number of people, I get horrible performance. Compare the function annualize with the clunkier but faster annualize2 below. Ideally resample should be able to handle multiindex data and resample on 1 of the dimensions without the need to resort to groupby.
Code Sample, a copy-pastable example if possible
Expected Output
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: