Skip to content

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragoljub opened this issue Dec 20, 2016 · 7 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@dragoljub
Copy link

# Run on Pandas 0.19.1

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1000,5000), columns=['C'+str(c) for c in range(5000)])

%%prun
for col in df:
    df[col]

         65011 function calls (65010 primitive calls) in 2.012 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    1.935    0.000    1.955    0.000 base.py:1362(__contains__)
     5000    0.025    0.000    2.002    0.000 frame.py:2033(__getitem__)
        1    0.011    0.011    2.012    2.012 <string>:2(<module>)
     5000    0.010    0.000    0.016    0.000 frame.py:2059(_getitem_column)
     5001    0.007    0.000    0.007    0.000 {method 'view' of 'numpy.ndarray' objects}
     5001    0.004    0.000    0.012    0.000 base.py:492(values)
     5000    0.004    0.000    0.007    0.000 generic.py:1381(_get_item_cache)
     5000    0.004    0.000    0.018    0.000 base.py:1275(<lambda>)
     5000    0.003    0.000    0.014    0.000 base.py:874(_values)
     5000    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}
     5000    0.002    0.000    0.003    0.000 common.py:438(_apply_if_callable)
     5000    0.002    0.000    0.002    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
        1    0.000    0.000    2.012    2.012 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:833(__iter__)
        1    0.000    0.000    0.000    0.000 generic.py:401(_info_axis)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 base.py:1315(__iter__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 4.39 s

Problem description

It appears that get_item_cache() or __contains__ may have something to do with it. This affects other functionality such as df.info() which is now also ~100X slower.

Expected Output

# Run on Pandas 0.18.1

In [6]: %%prun
   ...: for col in df:
   ...:     df[col]
   ...:
         45011 function calls (45010 primitive calls) in 0.032 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    0.012    0.000    0.028    0.000 frame.py:1973(__getitem__)
     5000    0.005    0.000    0.008    0.000 frame.py:1999(_getitem_column)
     5000    0.004    0.000    0.005    0.000 base.py:1233(__contains__)
        1    0.004    0.004    0.032    0.032 <string>:2(<module>)
     5000    0.002    0.000    0.003    0.000 generic.py:1345(_get_item_cache)
     5000    0.001    0.000    0.002    0.000 common.py:1846(_apply_if_callable)
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     5000    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
        1    0.000    0.000    0.032    0.032 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:808(__iter__)
        1    0.000    0.000    0.000    0.000 base.py:440(values)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 generic.py:381(_info_axis)
        1    0.000    0.000    0.000    0.000 base.py:1186(__iter__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 40 ms

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback jreback added this to the 0.19.2 milestone Dec 20, 2016
@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

I have a fix for this.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

I changed this 3c96442 to make it lazy
but it keeps reinitializing which is not great

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Dec 20, 2016
@dragoljub
Copy link
Author

@jreback Thanks for looking into this. 👍

Can you provide some info about over_size_threshold. It does appear this issue does not affect all frames, just those with many columns. What is over_size_threshold currently set to? Is there a way to prevent creation of hash tables on column index?

Was this regression due to something that improved performance of row-based index but possibly hurt columnar access?

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

see the referenced issues. They explain all.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

@dragoljub short answer is this.

When the over_size_threshold was triggered (> 1M len of index), a different path was followed where the hashtable was constructed for any type of indexing operation (once its constructed it stays till collected). The issue was that for these large indexes we actually use searchsorted rather than the hashtable to look things up. So we were needlessly constructing and using memory for this.

The solution was to do our uniqueness checking at the same time as we are already checking for sortedness( e.g. the is_monotonic_increasing).

However it seems that this was each time constructing the index mapping (bad).

cc @llllllllll
cc @ssanderson

@ssanderson
Copy link
Contributor

ssanderson commented Dec 21, 2016

@jreback I took a look at your proposed fix. I think it looks correct, but I think there might be a simpler fix that just makes initialized synonymous with self.mapping is not None (or drops initialized entirely in favor of the is not Nonecheck), and changes the gating behavior on theis_unique` property. It's certainly possible that I missed something though: keeping all the different flags straight in my head is tricky.

@dragoljub
Copy link
Author

@jreback Thanks for the explanation and quick response time on this. 😄 This will be a great last minute per fix for 0.19.2!

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Dec 24, 2016
closes pandas-dev#14930

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue

(cherry picked from commit 07c83ee)
ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016
closes pandas-dev#14930

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants