Non-unique integer index performance depending on index creation method #26064

quantuminternet · 2019-04-12T15:24:56Z

Code Sample, a copy-pastable example if possible

import pandas
pandas.__version__
# '0.24.2'
df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
df['B'] = df['A'] + 10**15
df = df.set_index('B').sort_index()
# First access to index might be slow, so get that out of the way
df.loc[1000000000000000]

%timeit df.loc[1000000000000000]
# 71.2 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

index_name = df.index.name
df.index = df.index.values.tolist()
df.index.name = index_name

%timeit df.loc[1000000000000000]
# 422 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

The issue seems to occur when using a non-unique, but sorted column as an index which is the result of a previous calculation. Access to rows with .loc is much slower than it could be. When all index values are copied into a python list and then reinerted into the index, performance increases by a large factor, although the DataFrames before and after look exactly the same.

For the first try, the performance depends on the size of the DataFrame as well as the size of the offset. For the second try (after copying the values to a list and back) the operation seems to take constant time. I assume that these use different indexing methods.

I have found no way to determine in advance whether the index will be affected by this and how it will perform. For a code that has to do many random access lookups in this DataFrame, the performance difference of more than 100 determines whether the code can run in reasonable time or not.

The workaround is easy enough, but to me it is bad practice to premptively do this all the time, because you don't want to determine the exact history of operations that have been done on the index.

There are different ways to 'rebuild' the index for this DataFrame, but as far as I can tell, all of them involve getting the index values into a non-pandas/numpy structure and back again. .reset_index().set_index('B') or .copy(deep=True) have no effect.

Expected Output

Ideally, pandas would chose the correct indexing method on set_index() or sort_index() and automatically arrange the data in for optimized access.

If this is not possible for technical reasons or would create slowdowns for other use cases, I would like to have a method to check which index method will be used and a method to optimize the data for indexing.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-142-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.24.2
pytest: 3.4.2
pip: 19.0.3
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.16.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.1
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.8.1
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-07-02T05:23:39Z

Looks like the performance is comparable now. Could use an asv benchmark

In [14]: df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
    ...: df['B'] = df['A'] + 10**15
    ...: df = df.set_index('B').sort_index()
    ...: # First access to index might be slow, so get that out of the way
    ...: df.loc[1000000000000000]

Out[14]:
                  A
B
1000000000000000  0
1000000000000000  0

In [15]:

In [15]: %timeit df.loc[1000000000000000]
36.6 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [16]: index_name = df.index.name
    ...: df.index = df.index.values.tolist()
    ...: df.index.name = index_name

In [17]: %timeit df.loc[1000000000000000]
71.5 µs ± 38.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

quantuminternet · 2021-07-08T07:31:23Z

Using the same machine and pandas 1.1.4 I now get

84.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The performance is similar now and also 5 times faster than before.

I can reproduce the issue with 0.25.1, so it must have been fixed somewhere between 0.25.1 and 1.1.4.

kevinanker · 2022-06-30T06:49:09Z

I also found that the problem has disappeared. As suggested by @mroeschke, I added an asv benchmark for non-unique indices in DataFrames - based on the benchmark of indices in Series.

jbrockmendel added the Performance Memory or execution speed performance label Jul 21, 2019

mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Performance Memory or execution speed performance labels Jul 2, 2021

kevinanker mentioned this issue Jun 29, 2022

ASV: Non-unique DataFrame index #47551

Merged

2 tasks

mroeschke closed this as completed in #47551 Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-unique integer index performance depending on index creation method #26064

Non-unique integer index performance depending on index creation method #26064

quantuminternet commented Apr 12, 2019

INSTALLED VERSIONS

mroeschke commented Jul 2, 2021

quantuminternet commented Jul 8, 2021

kevinanker commented Jun 30, 2022

Non-unique integer index performance depending on index creation method #26064

Non-unique integer index performance depending on index creation method #26064

Comments

quantuminternet commented Apr 12, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jul 2, 2021

quantuminternet commented Jul 8, 2021

kevinanker commented Jun 30, 2022

Output of `pd.show_versions()`