Intermittent error fetching value from multi-indexed dataframe #39585

Stevinson · 2021-02-03T22:59:57Z

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
See this question and this question.

Question about pandas

I have a dataframe with a multiindex from which I am attempting access a row from. However, it is seemingly failing stochastically on around 1/10th of runs. I see this behaviour both locally and on prod. The dataframe can be recreated with the following:

from datetime import timedelta, date
import pandas as pd
import pytz
from pandas import Timestamp

utc = pytz.UTC

data = {
    "date": [
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).replace(minute=59, second=59, microsecond=999999),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
    ],
    "a": ["alpha", "alpha", "beta", "alpha", "beta", "alpha", "beta"],
    "b": [100, 100, 100, 100, 100, 100, 100],
    "c": [100, 100, 100, 100, 100, 100, 100],
    "d": [0, 0, 0, 0, 0, 0, 0],
    "e": [100, 100, 100, 100, 100, 100, 100],
    "f": [0, 0, 0, 0, 0, 0, 0],
    "g": [0, 0, 0, 0, 0, 0, 0],
    "h": ["A", "B", "C", "D", "E", "F", "G"],
}
df = pd.DataFrame(data)

breakdown = df.groupby(["date", "a"]).sum()
done = breakdown.loc[date(2020, 6, 3), "beta"]

I do not know if it is my incorrect usage that is causing this behaviour or a bug.

I originally encountered the issue on pandas 1.1.4(*) with the error:

TypeError: '<' not supported between instances of 'int' and 'slice'

and on 1.2.1(**) I see the same intermitent errors but with the error message:

KeyError: 'beta'

Version info

(*)

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.2.1
numpy            : 1.19.5
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 21.0
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

(**)

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.1.4
numpy            : 1.20.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 20.3.1
setuptools       : 49.6.0.post20201009
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.20.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : 1.3.17
tables           : None
tabulate         : 0.8.7
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-02-03T23:50:04Z

Yes, this was changed before 1.2

Beta is interpreted as column, which does not exist, hence the keyerror.
You have to use breakdown.loc[(date(2020, 6, 3), "beta")]

Stevinson · 2021-02-03T23:53:06Z

Thanks @phofl. Unfortunately I'm still seeing the intermittent errors with this change.

phofl · 2021-02-03T23:55:21Z

b    100
c    100
d      0
e    100
f      0
g      0
Name: (2020-06-03, beta), dtype: int64

This is returned on master with the change

Stevinson · 2021-02-04T00:02:11Z

I'm confused. When I run that code snippet locally, on Heroku and on an online python editor I see the error occur roughly 1 in 10 runs when running as a new process every time. However, if I run the snippet in an infinite loop I do not see it error.

phofl · 2021-02-04T00:04:02Z

I am not sure why this should ever fail when indexed with a tuple. Started round about 30 times, did not fail. Could you provide the traceback of a failure?

Stevinson · 2021-02-04T00:07:58Z

Traceback (most recent call last):
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'beta'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edward/tmp/stochastic_error.py", line 30, in <module>
    done = breakdown.loc[(date(2020, 6, 3), "beta")]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 831, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 807, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1124, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1073, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3723, in xs
    return self[key]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'beta'

phofl · 2021-02-04T00:10:55Z

Not really sure how this happens, can you show breakdown before the failure?

Stevinson · 2021-02-04T00:14:35Z

Yeh, breakdown is successfully created before failure.

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

phofl · 2021-02-04T00:17:18Z

Thanks,

could you do me one Last favour and Try
breakdown.loc[(date(2020, 6, 3), "beta"), :]

Stevinson · 2021-02-04T00:19:27Z

Traceback (most recent call last):
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3732, in xs
    loc, new_index = index._get_loc_level(
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/multi.py", line 3034, in _get_loc_level
    return (self._engine.get_loc(key), None)
  File "pandas/_libs/index.pyx", line 705, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
TypeError: unsupported operand type(s) for +: 'slice' and 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edward/tmp/stochastic_error.py", line 31, in <module>
    breakdown.loc[(date(2020, 6, 3), "beta"), :]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 791, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 865, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1124, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1073, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3736, in xs
    raise TypeError(f"Expected label or tuple of labels, got {key}") from e
TypeError: Expected label or tuple of labels, got (datetime.date(2020, 6, 3), 'beta')

phofl · 2021-02-04T00:25:01Z

This is weird, I am getting

b    100
c    100
d      0
e    100
f      0
g      0
Name: (2020-06-03, beta), dtype: int64

again
done = breakdown.loc[(date(2020, 6, 3), "beta"), :]

You are on pandas 1.2.1 right?

cc @jbrockmendel Thoughts on how to handle this?

Stevinson · 2021-02-04T00:26:31Z

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.2.1
numpy            : 1.19.5
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 21.0
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

jbrockmendel · 2021-02-04T16:40:38Z

@phofl i gotta run, but tenatively this looks weird:

key = date(2020, 6, 3)
mi = breakdown.index

>>> mi.get_level_values(0).get_loc(key)
slice(0, 3, None)

>>> mi._get_level_indexer(key, 0)   # <--i.e. mi.get_loc(key)
slice(0, 2, None)

phofl · 2021-02-04T20:22:49Z

Yep, can see the changing result now too.

This is a groupby issue. I am getting

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

as breakdown in the slice(0,2) cases while

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

is returned in the other cases

phofl · 2021-02-04T21:12:58Z

This is happening in

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 1227 in ca4f204

k = kh_get_pymap(self.table, <PyObject*>val)

The return value of kh_get_pymap is wrong in case of slice(0, 2).

Would need a bit of help to debug this further

jbrockmendel · 2021-02-05T22:31:15Z

how do we get to that kh_get_pymap call? my intuition is that we shouldnt get there with slice objects

phofl · 2021-02-05T22:32:29Z

This is happening in groupby, not related to Indexing. See my second to last comment. This causes different groups which causes then the KeyError

jbrockmendel · 2021-02-05T23:51:09Z

OK, but do you know what the call stack looks like that gets to this line?

phofl · 2021-02-05T23:56:08Z

Yes, sorry should have thought about this earlier. I added raise ValueError right before kh_get_pymap is called to show the stack

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.3/scratches/scratch_4.py", line 426, in <module>
    breakdown = df.groupby(["date", "a"]).sum()
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1670, in sum
    result = self._agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1044, in _agg_general
    result = self._cython_agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1037, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1135, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 376, in apply
    result = func(self.values, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1114, in blk_func
    result = self.grouper._cython_operation(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 610, in _cython_operation
    out_shape = (self.ngroups,) + values.shape[1:]
  File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.__get__
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 327, in ngroups
    return len(self.result_index)
  File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.__get__
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 340, in result_index
    codes = self.reconstructed_codes
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 331, in reconstructed_codes
    codes = self.codes
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 257, in codes
    return [ping.codes for ping in self.groupings]
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 257, in <listcomp>
    return [ping.codes for ping in self.groupings]
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/grouper.py", line 567, in codes
    self._make_codes()
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/grouper.py", line 599, in _make_codes
    codes, uniques = algorithms.factorize(
  File "/home/developer/PycharmProjects/pandas/pandas/core/algorithms.py", line 724, in factorize
    codes, uniques = factorize_array(
  File "/home/developer/PycharmProjects/pandas/pandas/core/algorithms.py", line 528, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5336, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5263, in pandas._libs.hashtable.PyObjectHashTable._unique
ValueError

Process finished with exit code 1

jbrockmendel · 2021-02-09T00:18:22Z

maybe use breakpoint() instead of ValueError to track down the call args? i still think its really weird to get here with a slice

phofl · 2021-02-13T20:50:10Z

Not sure if I understand you correctly, but we are getting there with

breakdown = df.groupby(["date", "a"]).sum()

not with the indexing, so not related to the slice?

jbrockmendel · 2021-02-14T19:04:25Z

I'm having trouble reproducing this on master, can you still get it?

jbrockmendel · 2021-02-14T19:05:18Z

and in 1.1.4 im getting the failure in the line done = breakdown.loc[date(2020, 6, 3), "beta"]

phofl · 2021-02-14T19:19:54Z

The error is raised in there yes, but the cause of the error is, that the groupby line returns

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

most of the time (14 out of 15 or so) but sometimes we get

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

which then raises a KeyError, because the key is obvioulsly not in there, but the KeyError is actually just a consequential failure. I'll try to reproduce on master

jbrockmendel · 2021-02-14T19:20:09Z

Of course 2 minutes later i get it on master.

With breakdown as defined in the OP:

mi = breakdown.index
key = (date(2020, 6, 3), "beta")

indices = [0 if pd.isna(v) else lev.get_loc(v) + 1 for lev, v in zip(mi.levels, key)]
inds = mi.levels[0].get_loc(key[0])

The indices = line is part of BaseMultiIndexCodesEngine.get_loc, and the last line is just the relevant part of the indices = line. On runs that dont raise, inds == 1. On runs that raise, inds == slice(0, 2), which raises TypeError when trying to add 1 in the indices = line.

So if im right so far (@phofl definitely worth double-checking) then we need need to figure out why that get_loc is non-deterministic.

phofl · 2021-02-14T19:21:00Z

GroupBy is non deterministic, see my explanation above. This is because of the line I marked with the traceback.

jbrockmendel · 2021-02-14T19:22:24Z

gotcha, im on the totally wrong track. thanks

phofl · 2021-02-14T19:26:04Z

Wasn't explaining it very well either I think, sorry for this.

I am not familiar enough with

kh_get_pymap(self.table, <PyObject*>val)

to debug this on my own. The input values are the same but the output varies sometimes.

jbrockmendel · 2021-02-14T19:32:40Z

@realead is probably the best person to ask about kh_get_pymap

realead · 2021-02-14T22:39:44Z

@phofl I think I can explain why kh_get_pymap returns different indexes for different runs.

We use the hash-function of PyObject for the hash-map. The hash functions for str and bytes are salted, see here :

By default, the hash() values of str and bytes objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
...
See also PYTHONHASHSEED.

That means in different runs of the intepreter, the resulting hashes can be different. Different hash-values will lead to different places in the hash-map and thus different returned indexes.

You can verify this theory by setting the environment variable PYTHONHASHSEED to 0 - after that there should be no differences. Or alternatively, to try out different seeds and to find one for which the problem is always present.

Another observation, which supports my theory: the different behaviors were never observed in the same run of the interpreter - always in diffent runs.

phofl · 2021-02-18T19:58:48Z

@relead thy for the explanation, not sure if I get everything, will have to look this up.

One thing I would like to add:

We are using

k = kh_get_pymap(self.table, <PyObject*>val)

and if

if k == self.table.n_buckets:

then k has not been set.

The first value we pass in is
2020-06-03 15:59:59.999999+00:00 and the second is 2020-06-03. Most of the times kh_get_pymap returns 16, which is self.table.n_buckets for both cases. But sometimes kh_get_pymap returns 5 for 2020-06-03, this is what causes the inconsistency. Do you have an idea how this can happen?

realead · 2021-02-18T21:13:46Z

@phofl

I have added

print("Val", val, "hash", hash(val), "k", k)

directly after the call of kh_get_pymap(self.table, <PyObject*>val)

For the original example, I see the following trace in case of a successful run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash -5723614487856817989 k 16
Val 2020-06-03 hash -5723614487856817989 k 9
Val 2020-06-04 hash 4682208479840147490 k 16
Val 2020-06-04 hash 4682208479840147490 k 12
Val 2020-06-05 hash -3412894078602059675 k 16
Val 2020-06-05 hash -3412894078602059675 k 15

The behavior is as expected: the first call with a value returns always 16 -> the key isn't in the map (after this the key gets added to the map) thus a second call with the same value returns not 16, but an index where the value was put in.

Here are traces for a failed run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash 5740215080242998926 k 16
Val 2020-06-03 hash 5740215080242998926 k 13
Val 2020-06-04 hash 6389150965059193424 k 16
Val 2020-06-04 hash 6389150965059193424 k 3
Val 2020-06-05 hash -1444037058501195733 k 16
Val 2020-06-05 hash -1444037058501195733 k 9

First thing: the hash values (and thus indexes k) are different (reason is PYTHONHASHSEED as explained in my first comment), but it doesn't change the behavior, which looks correct to me.

I'm not sure kh_get_pymap is the issue here...

phofl · 2021-02-18T21:18:14Z

This is what I am getting in case of a failed run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash -223041617241465135 k 5
Val 2020-06-03 hash -223041617241465135 k 5
Val 2020-06-04 hash -5855114764210332090 k 16
Val 2020-06-04 hash -5855114764210332090 k 2
Val 2020-06-05 hash 7680890035772906995 k 16
Val 2020-06-05 hash 7680890035772906995 k 9

The 5 in the second row is the issue I think?

Edit: I have added your print line also directly below the pymap call

realead · 2021-02-18T21:53:37Z

Ok, the first element (2020-06-03 15:59:59.999999+00:00) is of type pandas._libs.tslibs.timestamps.Timestamp, the second (2020-06-03) is of type datetime.date.

The issue is, that while they have different hash-values, they are equal:

>>> data["date"][0] == data["date"][1]
True

this is an issue, because for using hash-map the following must hold: "a equals b => hash(a)==hash(b)". Which is not the case here.

Because the hash-values are different, there is sometimes a collision (k=5) and sometimes no collision (k=16).

However, I can see the error, even if this collision doesn't happen. Thus it could be a red herring.

phofl · 2021-02-18T22:07:40Z

Thx for the explanation. makes sense.
just to be clear: You can see the KeyError, when breakdown is

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

? I could not produce an error in this case

realead · 2021-02-18T22:23:07Z

just to be clear: You can see the KeyError, when breakdown is

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
...

Yes, I also see sometimes

date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
...

without an error.

jbrockmendel · 2021-02-20T05:30:17Z

>>> data["date"][0] == data["date"][1]
True

luckily this behavior in Timestamp.__richcmp__ is deprecated. Still a while before that gets enforced.

Stevinson · 2021-03-10T17:16:46Z

Is the current state-of-play waiting for the offending behaviour of Timestamp.__richcmp__ to be deprecated?

Is there a temporary workaround that could be used to avoid this error in the meantime?

jbrockmendel · 2021-03-31T21:18:00Z

Is there a temporary workaround that could be used to avoid this error in the meantime?

only thing that comes to mind is to not use date objects

rhshadrach · 2023-07-15T18:32:18Z

>>> data["date"][0] == data["date"][1]
True
luckily this behavior in Timestamp.__richcmp__ is deprecated. Still a while before that gets enforced.

@jbrockmendel @phofl - with the deprecation now enforced, are we good to close? I currently get Cannot compare Timestamp with datetime.date with the OP's example, haven't been able to reproduce if I remove the date objects.

matiaslindgren · 2024-07-21T10:59:49Z

Non-deterministic behaviour can also be seen in #57922, but with tuple subclasses created with collections.namedtuple.

Stevinson added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Feb 3, 2021

Stevinson changed the title ~~QST:~~ Intermitent error fetching value from multi-indexed dataframe Feb 3, 2021

Stevinson changed the title ~~Intermitent error fetching value from multi-indexed dataframe~~ Intermittent error fetching value from multi-indexed dataframe Feb 3, 2021

phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021

phofl added Bug Groupby and removed Indexing Related to indexing on series/frames, not to indexes themselves Usage Question labels Feb 4, 2021

jbrockmendel closed this as completed Jul 15, 2023

matiaslindgren mentioned this issue Jul 21, 2024

BUG: Hash and compare tuple subclasses as builtin tuples #59286

Merged

5 tasks

Intermittent error fetching value from multi-indexed dataframe #39585

Intermittent error fetching value from multi-indexed dataframe #39585

Comments

Stevinson commented Feb 3, 2021

Question about pandas

Version info

phofl commented Feb 3, 2021

Stevinson commented Feb 3, 2021 • edited Loading

phofl commented Feb 3, 2021 • edited Loading

Stevinson commented Feb 4, 2021

phofl commented Feb 4, 2021

Stevinson commented Feb 4, 2021

phofl commented Feb 4, 2021

Stevinson commented Feb 4, 2021

phofl commented Feb 4, 2021

Stevinson commented Feb 4, 2021

phofl commented Feb 4, 2021

Stevinson commented Feb 4, 2021

jbrockmendel commented Feb 4, 2021

phofl commented Feb 4, 2021 • edited Loading

phofl commented Feb 4, 2021

jbrockmendel commented Feb 5, 2021

phofl commented Feb 5, 2021 • edited Loading

jbrockmendel commented Feb 5, 2021

phofl commented Feb 5, 2021 • edited Loading

jbrockmendel commented Feb 9, 2021

phofl commented Feb 13, 2021 • edited Loading

jbrockmendel commented Feb 14, 2021

jbrockmendel commented Feb 14, 2021

phofl commented Feb 14, 2021

jbrockmendel commented Feb 14, 2021

phofl commented Feb 14, 2021

jbrockmendel commented Feb 14, 2021

phofl commented Feb 14, 2021

jbrockmendel commented Feb 14, 2021

realead commented Feb 14, 2021

phofl commented Feb 18, 2021 • edited Loading

realead commented Feb 18, 2021

phofl commented Feb 18, 2021 • edited Loading

realead commented Feb 18, 2021

phofl commented Feb 18, 2021

realead commented Feb 18, 2021

jbrockmendel commented Feb 20, 2021

Stevinson commented Mar 10, 2021

jbrockmendel commented Mar 31, 2021

rhshadrach commented Jul 15, 2023

matiaslindgren commented Jul 21, 2024 • edited Loading

Stevinson commented Feb 3, 2021 •

edited

Loading

phofl commented Feb 3, 2021 •

edited

Loading

phofl commented Feb 4, 2021 •

edited

Loading

phofl commented Feb 5, 2021 •

edited

Loading

phofl commented Feb 5, 2021 •

edited

Loading

phofl commented Feb 13, 2021 •

edited

Loading

phofl commented Feb 18, 2021 •

edited

Loading

phofl commented Feb 18, 2021 •

edited

Loading

matiaslindgren commented Jul 21, 2024 •

edited

Loading