performance regression in ewm.corr(pairwise=True) #17917

grumpyquant · 2017-10-19T10:02:49Z

Problem description

The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:

                # TODO: not the most efficient (perf-wise)
                # though not bad code-wise
                from pandas import Panel, MultiIndex, concat

                with warnings.catch_warnings(record=True):
                    p = Panel.from_dict(results).swapaxes('items', 'major')
                    if len(p.major_axis) > 0:
                        p.major_axis = arg1.columns[p.major_axis]
                    if len(p.minor_axis) > 0:
                        p.minor_axis = arg2.columns[p.minor_axis]

                if len(p.items):
                    result = concat(
                        [p.iloc[i].T for i in range(len(p.items))],
                        keys=p.items)

The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.

My solution was to replace the last 3 lines with:

                      result = DataFrame(
                          p.values.reshape((p.shape[0], p.shape[1]*p.shape[2])),
                          index=p.items,
                          columns=MultiIndex.from_product((arg1.columns, arg2.columns))
                      )
                      result = result.stack(dropna=False)

This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.

I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-10-19T10:08:05Z

if you replace this does it pass the tests? want to do a PR? (would also appreciate an asv for this).

grumpyquant · 2017-10-19T10:12:29Z

What's an ASV?

jreback · 2017-10-19T10:42:23Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

grumpyquant · 2017-10-19T13:22:35Z

Thanks.

I realized my patch above won't work if arg1.columns or arg2.columns is a MultiIndex. I can fix it but I need a way of combining two indices into a MultiIndex where one (or both) of the indices could be a MultiIndex.

I.e. if idx1 and idx2 are Index objects, then I need an expression to compute idx equivalent to:

idx = concat([DataFrame([], columns=idx2)] * len(idx1), keys=idx1).index

The above is too slow since it creates too many unnecessary empty DataFrames.

If someone can provide me with a fast way of doing that I can get this working. Apologies, this is probably the wrong venue for this...I don't ever submit patches.

jreback · 2017-10-19T23:38:33Z

not really sure what you need, pls show an example

In [15]: idx1 = pd.Index(list('abc'))

In [16]: idx2 = pd.Index([1, 2, 3])

In [17]: pd.MultiIndex.from_arrays([idx1, idx2])
Out[17]: 
MultiIndex(levels=[['a', 'b', 'c'], [1, 2, 3]],
           labels=[[0, 1, 2], [0, 1, 2]])

closes pandas-dev#17917

* PERF: remove use of Panel & perf in rolling corr/cov closes #17917

* PERF: remove use of Panel & perf in rolling corr/cov closes pandas-dev#17917

jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 19, 2017

jreback added this to the 0.21.1 milestone Oct 20, 2017

jreback added Difficulty Intermediate labels Oct 20, 2017

jreback modified the milestones: 0.21.1, Next Major Release Dec 2, 2017

jreback modified the milestones: Next Major Release, 0.23.0 Jan 15, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 15, 2018

PERF: remove use of Panel & perf in rolling corr/cov

872fe71

closes pandas-dev#17917

jreback mentioned this issue Jan 15, 2018

PERF: remove use of Panel & perf in rolling corr/cov #19257

Merged

jreback added a commit to jreback/pandas that referenced this issue Jan 16, 2018

PERF: remove use of Panel & perf in rolling corr/cov

6d684e1

closes pandas-dev#17917

jreback added a commit to jreback/pandas that referenced this issue Jan 27, 2018

PERF: remove use of Panel & perf in rolling corr/cov

0fd1bcb

closes pandas-dev#17917

jreback added a commit to jreback/pandas that referenced this issue Feb 1, 2018

PERF: remove use of Panel & perf in rolling corr/cov

3ba2843

closes pandas-dev#17917

jreback closed this as completed in #19257 Feb 1, 2018

jreback added a commit that referenced this issue Feb 1, 2018

PERF: remove use of Panel & perf in rolling corr/cov (#19257)

b5dd6a3

* PERF: remove use of Panel & perf in rolling corr/cov closes #17917

harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018

PERF: remove use of Panel & perf in rolling corr/cov (pandas-dev#19257)

23cfb38

* PERF: remove use of Panel & perf in rolling corr/cov closes pandas-dev#17917

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression in ewm.corr(pairwise=True) #17917

performance regression in ewm.corr(pairwise=True) #17917

grumpyquant commented Oct 19, 2017

INSTALLED VERSIONS

jreback commented Oct 19, 2017

grumpyquant commented Oct 19, 2017

jreback commented Oct 19, 2017

grumpyquant commented Oct 19, 2017 •

edited

Loading

jreback commented Oct 19, 2017

performance regression in ewm.corr(pairwise=True) #17917

performance regression in ewm.corr(pairwise=True) #17917

Comments

grumpyquant commented Oct 19, 2017

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Oct 19, 2017

grumpyquant commented Oct 19, 2017

jreback commented Oct 19, 2017

grumpyquant commented Oct 19, 2017 • edited Loading

jreback commented Oct 19, 2017

Output of `pd.show_versions()`

grumpyquant commented Oct 19, 2017 •

edited

Loading