PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

philipphu · 2022-01-28T16:33:25Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I am experiencing slow access with .loc on a DataFrame with ~380 million rows and a two level MultiIndex.
In this reproducible example, it takes ~2.5 seconds to do e.g. df.loc[1000] when having 100 million rows in the DataFrame.

import timeit

import distutils  # otherwise pd.show_versions() in pandas 1.4.0 threw "assert '_distutils' in core.__file__, core.__file__"
import numpy as np
import pandas as pd

n_entries = 10 ** 7
n_rows_per_entry = 10

print("pd.__version__", pd.__version__, "\n")
print(pd.show_versions(), "\n")
print("n_entries", f"{n_entries:.2e}")
print("n_rows_per_entry", n_rows_per_entry)
print("len(df)", f"{n_entries * n_rows_per_entry:.2e}", "\n")

base_setup = f"""
import numpy as np
import pandas as pd

n_entries = {n_entries}
n_rows_per_entry = {n_rows_per_entry}

idx = pd.MultiIndex.from_product(
    [np.arange(0, n_entries), np.arange(n_rows_per_entry)])
df = pd.DataFrame(
    data=np.random.rand(n_entries * n_rows_per_entry), index=idx, columns=["a"]
)
"""

setup_with_one_call = base_setup + f"df.loc[{n_entries - 1}]"

repeat = 5

statements = [
    "df.loc[(1000, slice(None))]",
    "df.loc[1000]",
    "df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]",
]
for setup_name, setup in [["base_setup", base_setup], ["setup_with_one_call", setup_with_one_call]]:
    print("\n", setup_name, "\n")
    for statement in statements:
        print(f"timeit {statement}")
        timings = timeit.repeat(stmt=statement, setup=setup, repeat=repeat, number=1)
        print(f"{np.mean(timings):.2e} ± {np.std(timings):.2e} s ({repeat} rounds.)", "\n")

will print (pd.version and pd.show_versions() output omitted here)

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
2.46e+00 ± 8.80e-02 s (5 rounds.) 

timeit df.loc[1000]
2.44e+00 ± 1.68e-01 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
6.27e-04 ± 1.81e-04 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
9.52e-01 ± 3.88e-02 s (5 rounds.) 

timeit df.loc[1000]
9.01e-01 ± 1.36e-01 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
2.42e-04 ± 2.21e-05 s (5 rounds.)

I also ran the same code in two older versions of pandas (0.25.3. and 1.0.0). While 1.0.0 seems to be a little better than 1.4.0, the 0.25.3 version shows good performance.

I noticed that while the performance using the "base_setup" is comparable across all pandas versions, issuing a df.loc[x] "warm-up call" at a arbitrary position x, made pandas 0.25.3 perform the df.loc calls as fast as df.iloc (~4 orders of magnitude faster than the initial df.loc call), the two newer pandas versions still have painfully slow df.loc calls.

Edit: the environments I used for testing were freshly created conda environments created with

conda create --name pd_test_025 -c conda-forge pandas=0.25 --yes
conda create --name pd_test_100 -c conda-forge pandas=1.0.0 --yes
conda create --name pd_test_14 -c conda-forge pandas=1.4 --yes

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.2.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.49.1.el7.x86_64
Version : #1 SMP Tue Nov 9 16:09:48 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.utf-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 60.5.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

Prior Performance

pandas 0.25.3

pd.__version__ 0.25.3 


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.49.1.el7.x86_64
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.utf-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.22.1
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
None 

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
1.07e+00 ± 2.44e-02 s (5 rounds.) 

timeit df.loc[1000]
1.07e+00 ± 1.54e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.94e-04 ± 4.48e-05 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
4.12e-04 ± 1.71e-05 s (5 rounds.) 

timeit df.loc[1000]
3.97e-04 ± 1.07e-05 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.51e-04 ± 2.33e-05 s (5 rounds.)

pandas 1.0.0

pd.__version__ 1.0.0 


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.49.1.el7.x86_64
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.utf-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.0
numpy            : 1.22.1
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
None 

n_entries 1.00e+07
n_rows_per_entry 10
len(df) 1.00e+08 


 base_setup 

timeit df.loc[(1000, slice(None))]
1.88e+00 ± 6.38e-02 s (5 rounds.) 

timeit df.loc[1000]
1.87e+00 ± 4.57e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.97e-04 ± 9.39e-05 s (5 rounds.) 


 setup_with_one_call 

timeit df.loc[(1000, slice(None))]
8.19e-01 ± 7.36e-03 s (5 rounds.) 

timeit df.loc[1000]
8.36e-01 ± 1.36e-02 s (5 rounds.) 

timeit df.iloc[[i for i in range(1000, 1000 + n_rows_per_entry)]]
3.25e-04 ± 3.32e-05 s (5 rounds.)

The text was updated successfully, but these errors were encountered:

phofl · 2022-01-28T19:45:18Z

Could you profile the code to check where the time is spent? The code base changed a lot since 0.25.3, so don't think we can figure anything out through the diff alone

jbrockmendel · 2022-01-28T20:46:50Z

Looks like it is effectively all in ndarray.searchsorted within MultiIndex.get_level_indexer. Would have to look at the old code to see how that was avoided

philipphu · 2022-01-28T22:04:10Z

Could you profile the code to check where the time is spent? The code base changed a lot since 0.25.3, so don't think we can figure anything out through the diff alone

I don't know much about profiling so please excuse me if this is not exactly what you are looking for. I generated a call graph in PyCharm. I used the following code:

import distutils  # otherwise pd.show_versions() in pandas 1.4.0 threw "assert '_distutils' in core.__file__, core.__file__"
import time

import numpy as np
import pandas as pd

n_entries = 10 ** 7
n_rows_per_entry = 50

print("pd.__version__", pd.__version__, "\n")
print(pd.show_versions(), "\n")
print("n_entries", f"{n_entries:.2e}")
print("n_rows_per_entry", n_rows_per_entry)
print("len(df)", f"{n_entries * n_rows_per_entry:.2e}", "\n")

def gen_df(n_entries, n_rows_per_entry):
    idx = pd.MultiIndex.from_product(
        [np.arange(0, n_entries), np.arange(n_rows_per_entry)])
    df = pd.DataFrame(
        data=np.random.rand(n_entries * n_rows_per_entry), index=idx, columns=["a"]
    )
    return df


def access(df):
    s = time.time()
    a = df.loc[1000]
    e = time.time()
    print(e - s)


df = gen_df(n_entries=n_entries, n_rows_per_entry=n_rows_per_entry)
access(df=df)

The output was

pd.__version__ 1.3.5 


INSTALLED VERSIONS
------------------
commit           : 66e3805b8cabe977f40c05259cc3fcf7ead5687d
python           : 3.9.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.2.0
Version          : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.3.5
numpy            : 1.20.3
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.6.0
Cython           : None
pytest           : 6.2.5
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.5.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 6.0.1
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.54.1
None 

n_entries 1.00e+07
n_rows_per_entry 50
len(df) 5.00e+08 

7.486079931259155

Attached is a png call graph and the pstat file generated. (had to add the .txt extension to the pstat for GitHub to let me upload it).

pandas_test_profile.pstat.txt

So from what I can see, within access, the pandas._libs.algos.ensure_int64, pandas._libs.algos.is_lexsorted and method 'searchsorted' of 'numpy.ndarray' objects calls are driving the time consumption.

jreback · 2022-01-28T23:09:19Z

this might be very similar to #38650

philipphu added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 28, 2022

philipphu changed the title ~~PERF:~~ PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well Jan 28, 2022

mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2022

lukemanley mentioned this issue Feb 11, 2022

PERF: MultiIndex.get_locs #45931

Merged

4 tasks

jreback added this to the 1.5 milestone Feb 11, 2022

jreback closed this as completed in #45931 Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

philipphu commented Jan 28, 2022 •

edited

Loading

INSTALLED VERSIONS

phofl commented Jan 28, 2022

jbrockmendel commented Jan 28, 2022

philipphu commented Jan 28, 2022 •

edited

Loading

jreback commented Jan 28, 2022

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681

Comments

philipphu commented Jan 28, 2022 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

phofl commented Jan 28, 2022

jbrockmendel commented Jan 28, 2022

philipphu commented Jan 28, 2022 • edited Loading

jreback commented Jan 28, 2022

philipphu commented Jan 28, 2022 •

edited

Loading

philipphu commented Jan 28, 2022 •

edited

Loading