You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importpandaspandas.__version__# '0.24.2'df=pandas.DataFrame([[x//2] forxinrange(0, 10**7)], columns=['A'])
df['B'] =df['A'] +10**15df=df.set_index('B').sort_index()
# First access to index might be slow, so get that out of the waydf.loc[1000000000000000]
%timeitdf.loc[1000000000000000]
# 71.2 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)index_name=df.index.namedf.index=df.index.values.tolist()
df.index.name=index_name%timeitdf.loc[1000000000000000]
# 422 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Problem description
The issue seems to occur when using a non-unique, but sorted column as an index which is the result of a previous calculation. Access to rows with .loc is much slower than it could be. When all index values are copied into a python list and then reinerted into the index, performance increases by a large factor, although the DataFrames before and after look exactly the same.
For the first try, the performance depends on the size of the DataFrame as well as the size of the offset. For the second try (after copying the values to a list and back) the operation seems to take constant time. I assume that these use different indexing methods.
I have found no way to determine in advance whether the index will be affected by this and how it will perform. For a code that has to do many random access lookups in this DataFrame, the performance difference of more than 100 determines whether the code can run in reasonable time or not.
The workaround is easy enough, but to me it is bad practice to premptively do this all the time, because you don't want to determine the exact history of operations that have been done on the index.
There are different ways to 'rebuild' the index for this DataFrame, but as far as I can tell, all of them involve getting the index values into a non-pandas/numpy structure and back again. .reset_index().set_index('B') or .copy(deep=True) have no effect.
Expected Output
Ideally, pandas would chose the correct indexing method on set_index() or sort_index() and automatically arrange the data in for optimized access.
If this is not possible for technical reasons or would create slowdowns for other use cases, I would like to have a method to check which index method will be used and a method to optimize the data for indexing.
Looks like the performance is comparable now. Could use an asv benchmark
In [14]: df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
...: df['B'] = df['A'] + 10**15
...: df = df.set_index('B').sort_index()
...: # First access to index might be slow, so get that out of the way
...: df.loc[1000000000000000]
Out[14]:
A
B
1000000000000000 0
1000000000000000 0
In [15]:
In [15]: %timeit df.loc[1000000000000000]
36.6 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [16]: index_name = df.index.name
...: df.index = df.index.values.tolist()
...: df.index.name = index_name
In [17]: %timeit df.loc[1000000000000000]
71.5 µs ± 38.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also found that the problem has disappeared. As suggested by @mroeschke, I added an asv benchmark for non-unique indices in DataFrames - based on the benchmark of indices in Series.
Code Sample, a copy-pastable example if possible
Problem description
The issue seems to occur when using a non-unique, but sorted column as an index which is the result of a previous calculation. Access to rows with .loc is much slower than it could be. When all index values are copied into a python list and then reinerted into the index, performance increases by a large factor, although the DataFrames before and after look exactly the same.
For the first try, the performance depends on the size of the DataFrame as well as the size of the offset. For the second try (after copying the values to a list and back) the operation seems to take constant time. I assume that these use different indexing methods.
I have found no way to determine in advance whether the index will be affected by this and how it will perform. For a code that has to do many random access lookups in this DataFrame, the performance difference of more than 100 determines whether the code can run in reasonable time or not.
The workaround is easy enough, but to me it is bad practice to premptively do this all the time, because you don't want to determine the exact history of operations that have been done on the index.
There are different ways to 'rebuild' the index for this DataFrame, but as far as I can tell, all of them involve getting the index values into a non-pandas/numpy structure and back again. .reset_index().set_index('B') or .copy(deep=True) have no effect.
Expected Output
Ideally, pandas would chose the correct indexing method on set_index() or sort_index() and automatically arrange the data in for optimized access.
If this is not possible for technical reasons or would create slowdowns for other use cases, I would like to have a method to check which index method will be used and a method to optimize the data for indexing.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-142-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8
pandas: 0.24.2
pytest: 3.4.2
pip: 19.0.3
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.16.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.1
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.8.1
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: