DataFrame getitem ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

dragoljub · 2016-12-20T18:05:10Z

# Run on Pandas 0.19.1

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1000,5000), columns=['C'+str(c) for c in range(5000)])

%%prun
for col in df:
    df[col]

         65011 function calls (65010 primitive calls) in 2.012 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    1.935    0.000    1.955    0.000 base.py:1362(__contains__)
     5000    0.025    0.000    2.002    0.000 frame.py:2033(__getitem__)
        1    0.011    0.011    2.012    2.012 <string>:2(<module>)
     5000    0.010    0.000    0.016    0.000 frame.py:2059(_getitem_column)
     5001    0.007    0.000    0.007    0.000 {method 'view' of 'numpy.ndarray' objects}
     5001    0.004    0.000    0.012    0.000 base.py:492(values)
     5000    0.004    0.000    0.007    0.000 generic.py:1381(_get_item_cache)
     5000    0.004    0.000    0.018    0.000 base.py:1275(<lambda>)
     5000    0.003    0.000    0.014    0.000 base.py:874(_values)
     5000    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}
     5000    0.002    0.000    0.003    0.000 common.py:438(_apply_if_callable)
     5000    0.002    0.000    0.002    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
        1    0.000    0.000    2.012    2.012 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:833(__iter__)
        1    0.000    0.000    0.000    0.000 generic.py:401(_info_axis)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 base.py:1315(__iter__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 4.39 s

Problem description

It appears that get_item_cache() or __contains__ may have something to do with it. This affects other functionality such as df.info() which is now also ~100X slower.

Expected Output

# Run on Pandas 0.18.1

In [6]: %%prun
   ...: for col in df:
   ...:     df[col]
   ...:
         45011 function calls (45010 primitive calls) in 0.032 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    0.012    0.000    0.028    0.000 frame.py:1973(__getitem__)
     5000    0.005    0.000    0.008    0.000 frame.py:1999(_getitem_column)
     5000    0.004    0.000    0.005    0.000 base.py:1233(__contains__)
        1    0.004    0.004    0.032    0.032 <string>:2(<module>)
     5000    0.002    0.000    0.003    0.000 generic.py:1345(_get_item_cache)
     5000    0.001    0.000    0.002    0.000 common.py:1846(_apply_if_callable)
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     5000    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
        1    0.000    0.000    0.032    0.032 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:808(__iter__)
        1    0.000    0.000    0.000    0.000 base.py:440(values)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 generic.py:381(_info_axis)
        1    0.000    0.000    0.000    0.000 base.py:1186(__iter__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 40 ms

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-12-20T18:20:34Z

I have a fix for this.

jreback · 2016-12-20T18:30:56Z

I changed this 3c96442 to make it lazy
but it keeps reinitializing which is not great

dragoljub · 2016-12-20T21:29:44Z

@jreback Thanks for looking into this. 👍

Can you provide some info about over_size_threshold. It does appear this issue does not affect all frames, just those with many columns. What is over_size_threshold currently set to? Is there a way to prevent creation of hash tables on column index?

Was this regression due to something that improved performance of row-based index but possibly hurt columnar access?

jreback · 2016-12-20T21:36:45Z

see the referenced issues. They explain all.

jreback · 2016-12-20T23:48:48Z

@dragoljub short answer is this.

When the over_size_threshold was triggered (> 1M len of index), a different path was followed where the hashtable was constructed for any type of indexing operation (once its constructed it stays till collected). The issue was that for these large indexes we actually use searchsorted rather than the hashtable to look things up. So we were needlessly constructing and using memory for this.

The solution was to do our uniqueness checking at the same time as we are already checking for sortedness( e.g. the is_monotonic_increasing).

However it seems that this was each time constructing the index mapping (bad).

cc @llllllllll
cc @ssanderson

ssanderson · 2016-12-21T00:33:21Z

@jreback I took a look at your proposed fix. I think it looks correct, but I think there might be a simpler fix that just makes initialized synonymous with self.mapping is not None (or drops initialized entirely in favor of the is not Nonecheck), and changes the gating behavior on theis_unique` property. It's certainly possible that I missed something though: keeping all the different flags straight in my head is tricky.

dragoljub · 2016-12-21T00:48:26Z

@jreback Thanks for the explanation and quick response time on this. 😄 This will be a great last minute per fix for 0.19.2!

closes pandas-dev#14930 Author: Jeff Reback <[email protected]> Closes pandas-dev#14933 from jreback/perf and squashes the following commits: dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue (cherry picked from commit 07c83ee)

closes pandas-dev#14930 Author: Jeff Reback <[email protected]> Closes pandas-dev#14933 from jreback/perf and squashes the following commits: dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue

jreback added this to the 0.19.2 milestone Dec 20, 2016

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Dec 20, 2016

jreback mentioned this issue Dec 20, 2016

PERF: fix getitem unique_check / initialization issue #14933

Closed

jreback closed this as completed in 07c83ee Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame getitem ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

DataFrame getitem ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

dragoljub commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

dragoljub commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

ssanderson commented Dec 21, 2016 •

edited

Loading

dragoljub commented Dec 21, 2016

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

Comments

dragoljub commented Dec 20, 2016

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

dragoljub commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

ssanderson commented Dec 21, 2016 • edited Loading

dragoljub commented Dec 21, 2016

DataFrame getitem ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

DataFrame getitem ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

Output of `pd.show_versions()`

ssanderson commented Dec 21, 2016 •

edited

Loading