Skip to content

performance regression in ewm.corr(pairwise=True) #17917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
grumpyquant opened this issue Oct 19, 2017 · 5 comments · Fixed by #19257
Closed

performance regression in ewm.corr(pairwise=True) #17917

grumpyquant opened this issue Oct 19, 2017 · 5 comments · Fixed by #19257
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@grumpyquant
Copy link

Problem description

The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:

                # TODO: not the most efficient (perf-wise)
                # though not bad code-wise
                from pandas import Panel, MultiIndex, concat

                with warnings.catch_warnings(record=True):
                    p = Panel.from_dict(results).swapaxes('items', 'major')
                    if len(p.major_axis) > 0:
                        p.major_axis = arg1.columns[p.major_axis]
                    if len(p.minor_axis) > 0:
                        p.minor_axis = arg2.columns[p.minor_axis]

                if len(p.items):
                    result = concat(
                        [p.iloc[i].T for i in range(len(p.items))],
                        keys=p.items)

The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.

My solution was to replace the last 3 lines with:

                      result = DataFrame(
                          p.values.reshape((p.shape[0], p.shape[1]*p.shape[2])),
                          index=p.items,
                          columns=MultiIndex.from_product((arg1.columns, arg2.columns))
                      )
                      result = result.stack(dropna=False)

This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.

I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 19, 2017

if you replace this does it pass the tests? want to do a PR? (would also appreciate an asv for this).

@jreback jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 19, 2017
@grumpyquant
Copy link
Author

What's an ASV?

@jreback
Copy link
Contributor

jreback commented Oct 19, 2017

@grumpyquant
Copy link
Author

grumpyquant commented Oct 19, 2017

Thanks.

I realized my patch above won't work if arg1.columns or arg2.columns is a MultiIndex. I can fix it but I need a way of combining two indices into a MultiIndex where one (or both) of the indices could be a MultiIndex.

I.e. if idx1 and idx2 are Index objects, then I need an expression to compute idx equivalent to:

idx = concat([DataFrame([], columns=idx2)] * len(idx1), keys=idx1).index

The above is too slow since it creates too many unnecessary empty DataFrames.

If someone can provide me with a fast way of doing that I can get this working. Apologies, this is probably the wrong venue for this...I don't ever submit patches.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2017

not really sure what you need, pls show an example

In [15]: idx1 = pd.Index(list('abc'))

In [16]: idx2 = pd.Index([1, 2, 3])

In [17]: pd.MultiIndex.from_arrays([idx1, idx2])
Out[17]: 
MultiIndex(levels=[['a', 'b', 'c'], [1, 2, 3]],
           labels=[[0, 1, 2], [0, 1, 2]])

@jreback jreback added this to the 0.21.1 milestone Oct 20, 2017
@jreback jreback modified the milestones: 0.21.1, Next Major Release Dec 2, 2017
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 15, 2018
jreback added a commit to jreback/pandas that referenced this issue Jan 15, 2018
jreback added a commit to jreback/pandas that referenced this issue Jan 16, 2018
jreback added a commit to jreback/pandas that referenced this issue Jan 27, 2018
jreback added a commit to jreback/pandas that referenced this issue Feb 1, 2018
jreback added a commit that referenced this issue Feb 1, 2018
* PERF: remove use of Panel & perf in rolling corr/cov

closes #17917
harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018
* PERF: remove use of Panel & perf in rolling corr/cov

closes pandas-dev#17917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants