Skip to content

Non-unique integer index performance depending on index creation method #26064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
quantuminternet opened this issue Apr 12, 2019 · 3 comments · Fixed by #47551
Closed

Non-unique integer index performance depending on index creation method #26064

quantuminternet opened this issue Apr 12, 2019 · 3 comments · Fixed by #47551
Labels
Benchmark Performance (ASV) benchmarks good first issue

Comments

@quantuminternet
Copy link

Code Sample, a copy-pastable example if possible

import pandas
pandas.__version__
# '0.24.2'
df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
df['B'] = df['A'] + 10**15
df = df.set_index('B').sort_index()
# First access to index might be slow, so get that out of the way
df.loc[1000000000000000]

%timeit df.loc[1000000000000000]
# 71.2 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

index_name = df.index.name
df.index = df.index.values.tolist()
df.index.name = index_name

%timeit df.loc[1000000000000000]
# 422 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

The issue seems to occur when using a non-unique, but sorted column as an index which is the result of a previous calculation. Access to rows with .loc is much slower than it could be. When all index values are copied into a python list and then reinerted into the index, performance increases by a large factor, although the DataFrames before and after look exactly the same.

For the first try, the performance depends on the size of the DataFrame as well as the size of the offset. For the second try (after copying the values to a list and back) the operation seems to take constant time. I assume that these use different indexing methods.

I have found no way to determine in advance whether the index will be affected by this and how it will perform. For a code that has to do many random access lookups in this DataFrame, the performance difference of more than 100 determines whether the code can run in reasonable time or not.

The workaround is easy enough, but to me it is bad practice to premptively do this all the time, because you don't want to determine the exact history of operations that have been done on the index.

There are different ways to 'rebuild' the index for this DataFrame, but as far as I can tell, all of them involve getting the index values into a non-pandas/numpy structure and back again. .reset_index().set_index('B') or .copy(deep=True) have no effect.

Expected Output

Ideally, pandas would chose the correct indexing method on set_index() or sort_index() and automatically arrange the data in for optimized access.

If this is not possible for technical reasons or would create slowdowns for other use cases, I would like to have a method to check which index method will be used and a method to optimize the data for indexing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-142-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.24.2
pytest: 3.4.2
pip: 19.0.3
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.16.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.1
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.8.1
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Jul 21, 2019
@mroeschke
Copy link
Member

Looks like the performance is comparable now. Could use an asv benchmark

In [14]: df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
    ...: df['B'] = df['A'] + 10**15
    ...: df = df.set_index('B').sort_index()
    ...: # First access to index might be slow, so get that out of the way
    ...: df.loc[1000000000000000]

Out[14]:
                  A
B
1000000000000000  0
1000000000000000  0

In [15]:

In [15]: %timeit df.loc[1000000000000000]
36.6 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [16]: index_name = df.index.name
    ...: df.index = df.index.values.tolist()
    ...: df.index.name = index_name

In [17]: %timeit df.loc[1000000000000000]
71.5 µs ± 38.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Performance Memory or execution speed performance labels Jul 2, 2021
@quantuminternet
Copy link
Author

Using the same machine and pandas 1.1.4 I now get

84.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The performance is similar now and also 5 times faster than before.

I can reproduce the issue with 0.25.1, so it must have been fixed somewhere between 0.25.1 and 1.1.4.

@kevinanker
Copy link
Contributor

I also found that the problem has disappeared. As suggested by @mroeschke, I added an asv benchmark for non-unique indices in DataFrames - based on the benchmark of indices in Series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants