Skip to content

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
philipphu opened this issue Jan 28, 2022 · 4 comments · Fixed by #45931
Closed
2 of 3 tasks
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Milestone

Comments

@philipphu
Copy link

philipphu commented Jan 28, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I am experiencing slow access with .loc on a DataFrame with ~380 million rows and a two level MultiIndex.
In this reproducible example, it takes ~2.5 seconds to do e.g. df.loc[1000] when having 100 million rows in the DataFrame.

import timeit

import distutils  # otherwise pd.show_versions() in pandas 1.4.0 threw "assert '_distutils' in core.__file__, core.__file__"
import numpy as np
import pandas as pd

n_entries = 10 ** 7
n_rows_per_entry = 10

print("pd.__version__", pd.__version__, "\n")
print(pd.show_versions(), "\n")
print("n_entries", f"{n_entries:.2e}")
print("n_rows_per_entry", n_rows_per_entry)
print("len(df)", f"{n_entries * n_rows_per_entry:.2e}", "\n")

base_setup = f"""
import numpy as np
import pandas as pd

n_entries = {n_entries}
n_rows_per_entry = {n_rows_per_entry}

idx = pd.MultiIndex.from_product(
    [np.arange(0, n_entries), np.arange(n_rows_per_entry)])
df = pd.DataFrame(
    data=np.random.rand(n_entries * n_rows_per_entry), index=idx, columns=["a"]
)
"""

setup_with_one_call = base_setup + f"df.loc[{n_entries - 1}]"

repeat = 5

statements = [
    "df.loc[(1000, slice(None))]",
    "df.loc[1000]",
    "df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]",
]
for setup_name, setup in [["base_setup", base_setup], ["setup_with_one_call", setup_with_one_call]]:
    print("\n", setup_name, "\n")
    for statement in statements:
        print(f"timeit {statement}")
        timings = timeit.repeat(stmt=statement, setup=setup, repeat=repeat, number=1)
        print(f"{np.mean(timings):.2e} ± {np.std(timings):.2e} s ({repeat} rounds.)", "\n")

will print (pd.version and pd.show_versions() output omitted here)

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
2.46e+00 ± 8.80e-02 s (5 rounds.) 

timeit df.loc[1000]
2.44e+00 ± 1.68e-01 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
6.27e-04 ± 1.81e-04 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
9.52e-01 ± 3.88e-02 s (5 rounds.) 

timeit df.loc[1000]
9.01e-01 ± 1.36e-01 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
2.42e-04 ± 2.21e-05 s (5 rounds.) 

I also ran the same code in two older versions of pandas (0.25.3. and 1.0.0). While 1.0.0 seems to be a little better than 1.4.0, the 0.25.3 version shows good performance.

I noticed that while the performance using the "base_setup" is comparable across all pandas versions, issuing a df.loc[x] "warm-up call" at a arbitrary position x, made pandas 0.25.3 perform the df.loc calls as fast as df.iloc (~4 orders of magnitude faster than the initial df.loc call), the two newer pandas versions still have painfully slow df.loc calls.

Edit: the environments I used for testing were freshly created conda environments created with

conda create --name pd_test_025 -c conda-forge pandas=0.25 --yes
conda create --name pd_test_100 -c conda-forge pandas=1.0.0 --yes
conda create --name pd_test_14 -c conda-forge pandas=1.4 --yes

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.2.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.49.1.el7.x86_64
Version : #1 SMP Tue Nov 9 16:09:48 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.utf-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 60.5.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

Prior Performance

pandas 0.25.3

pd.__version__ 0.25.3 


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.49.1.el7.x86_64
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.utf-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.22.1
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
None 

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
1.07e+00 ± 2.44e-02 s (5 rounds.) 

timeit df.loc[1000]
1.07e+00 ± 1.54e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.94e-04 ± 4.48e-05 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
4.12e-04 ± 1.71e-05 s (5 rounds.) 

timeit df.loc[1000]
3.97e-04 ± 1.07e-05 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.51e-04 ± 2.33e-05 s (5 rounds.) 

pandas 1.0.0

pd.__version__ 1.0.0 


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.49.1.el7.x86_64
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.utf-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.0
numpy            : 1.22.1
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
None 

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
1.88e+00 ± 6.38e-02 s (5 rounds.) 

timeit df.loc[1000]
1.87e+00 ± 4.57e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.97e-04 ± 9.39e-05 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
8.19e-01 ± 7.36e-03 s (5 rounds.) 

timeit df.loc[1000]
8.36e-01 ± 1.36e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.25e-04 ± 3.32e-05 s (5 rounds.) 
@philipphu philipphu added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 28, 2022
@philipphu philipphu changed the title PERF: PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well Jan 28, 2022
@phofl
Copy link
Member

phofl commented Jan 28, 2022

Could you profile the code to check where the time is spent? The code base changed a lot since 0.25.3, so don't think we can figure anything out through the diff alone

@jbrockmendel
Copy link
Member

Looks like it is effectively all in ndarray.searchsorted within MultiIndex.get_level_indexer. Would have to look at the old code to see how that was avoided

@philipphu
Copy link
Author

philipphu commented Jan 28, 2022

Could you profile the code to check where the time is spent? The code base changed a lot since 0.25.3, so don't think we can figure anything out through the diff alone

I don't know much about profiling so please excuse me if this is not exactly what you are looking for. I generated a call graph in PyCharm. I used the following code:

import distutils  # otherwise pd.show_versions() in pandas 1.4.0 threw "assert '_distutils' in core.__file__, core.__file__"
import time

import numpy as np
import pandas as pd

n_entries = 10 ** 7
n_rows_per_entry = 50

print("pd.__version__", pd.__version__, "\n")
print(pd.show_versions(), "\n")
print("n_entries", f"{n_entries:.2e}")
print("n_rows_per_entry", n_rows_per_entry)
print("len(df)", f"{n_entries * n_rows_per_entry:.2e}", "\n")

def gen_df(n_entries, n_rows_per_entry):
    idx = pd.MultiIndex.from_product(
        [np.arange(0, n_entries), np.arange(n_rows_per_entry)])
    df = pd.DataFrame(
        data=np.random.rand(n_entries * n_rows_per_entry), index=idx, columns=["a"]
    )
    return df


def access(df):
    s = time.time()
    a = df.loc[1000]
    e = time.time()
    print(e - s)


df = gen_df(n_entries=n_entries, n_rows_per_entry=n_rows_per_entry)
access(df=df)

The output was

pd.__version__ 1.3.5 


INSTALLED VERSIONS
------------------
commit           : 66e3805b8cabe977f40c05259cc3fcf7ead5687d
python           : 3.9.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.2.0
Version          : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.3.5
numpy            : 1.20.3
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.6.0
Cython           : None
pytest           : 6.2.5
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.5.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 6.0.1
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.54.1
None 

n_entries 1.00e+07
n_rows_per_entry 50
len(df) 5.00e+08 

7.486079931259155

Attached is a png call graph and the pstat file generated. (had to add the .txt extension to the pstat for GitHub to let me upload it).

pandas_test_profile_call_graph

pandas_test_profile.pstat.txt

So from what I can see, within access, the pandas._libs.algos.ensure_int64, pandas._libs.algos.is_lexsorted and method 'searchsorted' of 'numpy.ndarray' objects calls are driving the time consumption.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2022

this might be very similar to #38650

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2022
@jreback jreback added this to the 1.5 milestone Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants