-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
performance regression in ewm.corr(pairwise=True) #17917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
if you replace this does it pass the tests? want to do a PR? (would also appreciate an asv for this). |
What's an ASV? |
Thanks. I realized my patch above won't work if arg1.columns or arg2.columns is a MultiIndex. I can fix it but I need a way of combining two indices into a MultiIndex where one (or both) of the indices could be a MultiIndex. I.e. if idx1 and idx2 are Index objects, then I need an expression to compute idx equivalent to: idx = concat([DataFrame([], columns=idx2)] * len(idx1), keys=idx1).index The above is too slow since it creates too many unnecessary empty DataFrames. If someone can provide me with a fast way of doing that I can get this working. Apologies, this is probably the wrong venue for this...I don't ever submit patches. |
not really sure what you need, pls show an example
|
* PERF: remove use of Panel & perf in rolling corr/cov closes #17917
* PERF: remove use of Panel & perf in rolling corr/cov closes pandas-dev#17917
Problem description
The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:
The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.
My solution was to replace the last 3 lines with:
This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.
I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: