Skip to content

Intermittent error fetching value from multi-indexed dataframe #39585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
Stevinson opened this issue Feb 3, 2021 · 41 comments
Closed
2 tasks done

Intermittent error fetching value from multi-indexed dataframe #39585

Stevinson opened this issue Feb 3, 2021 · 41 comments

Comments

@Stevinson
Copy link

Question about pandas

I have a dataframe with a multiindex from which I am attempting access a row from. However, it is seemingly failing stochastically on around 1/10th of runs. I see this behaviour both locally and on prod. The dataframe can be recreated with the following:

from datetime import timedelta, date
import pandas as pd
import pytz
from pandas import Timestamp

utc = pytz.UTC

data = {
    "date": [
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).replace(minute=59, second=59, microsecond=999999),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
        Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
    ],
    "a": ["alpha", "alpha", "beta", "alpha", "beta", "alpha", "beta"],
    "b": [100, 100, 100, 100, 100, 100, 100],
    "c": [100, 100, 100, 100, 100, 100, 100],
    "d": [0, 0, 0, 0, 0, 0, 0],
    "e": [100, 100, 100, 100, 100, 100, 100],
    "f": [0, 0, 0, 0, 0, 0, 0],
    "g": [0, 0, 0, 0, 0, 0, 0],
    "h": ["A", "B", "C", "D", "E", "F", "G"],
}
df = pd.DataFrame(data)

breakdown = df.groupby(["date", "a"]).sum()
done = breakdown.loc[date(2020, 6, 3), "beta"]

I do not know if it is my incorrect usage that is causing this behaviour or a bug.

I originally encountered the issue on pandas 1.1.4(*) with the error:

TypeError: '<' not supported between instances of 'int' and 'slice'

and on 1.2.1(**) I see the same intermitent errors but with the error message:

KeyError: 'beta'

Version info

(*)

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.2.1
numpy            : 1.19.5
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 21.0
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

(**)

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.1.4
numpy            : 1.20.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 20.3.1
setuptools       : 49.6.0.post20201009
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.20.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : 1.3.17
tables           : None
tabulate         : 0.8.7
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@Stevinson Stevinson added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Feb 3, 2021
@Stevinson Stevinson changed the title QST: Intermitent error fetching value from multi-indexed dataframe Feb 3, 2021
@Stevinson Stevinson changed the title Intermitent error fetching value from multi-indexed dataframe Intermittent error fetching value from multi-indexed dataframe Feb 3, 2021
@phofl
Copy link
Member

phofl commented Feb 3, 2021

Yes, this was changed before 1.2

Beta is interpreted as column, which does not exist, hence the keyerror.
You have to use breakdown.loc[(date(2020, 6, 3), "beta")]

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021
@Stevinson
Copy link
Author

Stevinson commented Feb 3, 2021

Thanks @phofl. Unfortunately I'm still seeing the intermittent errors with this change.

@phofl
Copy link
Member

phofl commented Feb 3, 2021

b    100
c    100
d      0
e    100
f      0
g      0
Name: (2020-06-03, beta), dtype: int64

This is returned on master with the change

@Stevinson
Copy link
Author

I'm confused. When I run that code snippet locally, on Heroku and on an online python editor I see the error occur roughly 1 in 10 runs when running as a new process every time. However, if I run the snippet in an infinite loop I do not see it error.

@phofl
Copy link
Member

phofl commented Feb 4, 2021

I am not sure why this should ever fail when indexed with a tuple. Started round about 30 times, did not fail. Could you provide the traceback of a failure?

@Stevinson
Copy link
Author

Traceback (most recent call last):
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'beta'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edward/tmp/stochastic_error.py", line 30, in <module>
    done = breakdown.loc[(date(2020, 6, 3), "beta")]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 831, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 807, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1124, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1073, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3723, in xs
    return self[key]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'beta'

@phofl
Copy link
Member

phofl commented Feb 4, 2021

Not really sure how this happens, can you show breakdown before the failure?

@Stevinson
Copy link
Author

Yeh, breakdown is successfully created before failure.

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

@phofl
Copy link
Member

phofl commented Feb 4, 2021

Thanks,

could you do me one Last favour and Try
breakdown.loc[(date(2020, 6, 3), "beta"), :]

@Stevinson
Copy link
Author

Traceback (most recent call last):
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3732, in xs
    loc, new_index = index._get_loc_level(
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexes/multi.py", line 3034, in _get_loc_level
    return (self._engine.get_loc(key), None)
  File "pandas/_libs/index.pyx", line 705, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
TypeError: unsupported operand type(s) for +: 'slice' and 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edward/tmp/stochastic_error.py", line 31, in <module>
    breakdown.loc[(date(2020, 6, 3), "beta"), :]
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 791, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 865, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1124, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/indexing.py", line 1073, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/edward/miniconda3/envs/playground/lib/python3.9/site-packages/pandas/core/generic.py", line 3736, in xs
    raise TypeError(f"Expected label or tuple of labels, got {key}") from e
TypeError: Expected label or tuple of labels, got (datetime.date(2020, 6, 3), 'beta')

@phofl
Copy link
Member

phofl commented Feb 4, 2021

This is weird, I am getting

b    100
c    100
d      0
e    100
f      0
g      0
Name: (2020-06-03, beta), dtype: int64

again
done = breakdown.loc[(date(2020, 6, 3), "beta"), :]

You are on pandas 1.2.1 right?

cc @jbrockmendel Thoughts on how to handle this?

@Stevinson
Copy link
Author

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.9.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.2.1
numpy            : 1.19.5
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 21.0
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

@jbrockmendel
Copy link
Member

@phofl i gotta run, but tenatively this looks weird:

key = date(2020, 6, 3)
mi = breakdown.index

>>> mi.get_level_values(0).get_loc(key)
slice(0, 3, None)

>>> mi._get_level_indexer(key, 0)   # <--i.e. mi.get_loc(key)
slice(0, 2, None)

@phofl
Copy link
Member

phofl commented Feb 4, 2021

Yep, can see the changing result now too.

This is a groupby issue. I am getting

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

as breakdown in the slice(0,2) cases while

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

is returned in the other cases

@phofl phofl added Bug Groupby and removed Indexing Related to indexing on series/frames, not to indexes themselves Usage Question labels Feb 4, 2021
@phofl
Copy link
Member

phofl commented Feb 4, 2021

This is happening in

k = kh_get_pymap(self.table, <PyObject*>val)

The return value of kh_get_pymap is wrong in case of slice(0, 2).

Would need a bit of help to debug this further

@jbrockmendel
Copy link
Member

how do we get to that kh_get_pymap call? my intuition is that we shouldnt get there with slice objects

@phofl
Copy link
Member

phofl commented Feb 5, 2021

This is happening in groupby, not related to Indexing. See my second to last comment. This causes different groups which causes then the KeyError

@jbrockmendel
Copy link
Member

OK, but do you know what the call stack looks like that gets to this line?

@phofl
Copy link
Member

phofl commented Feb 5, 2021

Yes, sorry should have thought about this earlier. I added raise ValueError right before kh_get_pymap is called to show the stack

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.3/scratches/scratch_4.py", line 426, in <module>
    breakdown = df.groupby(["date", "a"]).sum()
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1670, in sum
    result = self._agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1044, in _agg_general
    result = self._cython_agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1037, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1135, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 376, in apply
    result = func(self.values, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1114, in blk_func
    result = self.grouper._cython_operation(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 610, in _cython_operation
    out_shape = (self.ngroups,) + values.shape[1:]
  File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.__get__
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 327, in ngroups
    return len(self.result_index)
  File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.__get__
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 340, in result_index
    codes = self.reconstructed_codes
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 331, in reconstructed_codes
    codes = self.codes
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 257, in codes
    return [ping.codes for ping in self.groupings]
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 257, in <listcomp>
    return [ping.codes for ping in self.groupings]
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/grouper.py", line 567, in codes
    self._make_codes()
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/grouper.py", line 599, in _make_codes
    codes, uniques = algorithms.factorize(
  File "/home/developer/PycharmProjects/pandas/pandas/core/algorithms.py", line 724, in factorize
    codes, uniques = factorize_array(
  File "/home/developer/PycharmProjects/pandas/pandas/core/algorithms.py", line 528, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5336, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5263, in pandas._libs.hashtable.PyObjectHashTable._unique
ValueError

Process finished with exit code 1

@jbrockmendel
Copy link
Member

maybe use breakpoint() instead of ValueError to track down the call args? i still think its really weird to get here with a slice

@phofl
Copy link
Member

phofl commented Feb 13, 2021

Not sure if I understand you correctly, but we are getting there with

breakdown = df.groupby(["date", "a"]).sum()

not with the indexing, so not related to the slice?

@jbrockmendel
Copy link
Member

I'm having trouble reproducing this on master, can you still get it?

@jbrockmendel
Copy link
Member

and in 1.1.4 im getting the failure in the line done = breakdown.loc[date(2020, 6, 3), "beta"]

@phofl
Copy link
Member

phofl commented Feb 14, 2021

The error is raised in there yes, but the cause of the error is, that the groupby line returns

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

most of the time (14 out of 15 or so) but sometimes we get

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

which then raises a KeyError, because the key is obvioulsly not in there, but the KeyError is actually just a consequential failure. I'll try to reproduce on master

@jbrockmendel
Copy link
Member

Of course 2 minutes later i get it on master.

With breakdown as defined in the OP:

mi = breakdown.index
key = (date(2020, 6, 3), "beta")

indices = [0 if pd.isna(v) else lev.get_loc(v) + 1 for lev, v in zip(mi.levels, key)]
inds = mi.levels[0].get_loc(key[0])

The indices = line is part of BaseMultiIndexCodesEngine.get_loc, and the last line is just the relevant part of the indices = line. On runs that dont raise, inds == 1. On runs that raise, inds == slice(0, 2), which raises TypeError when trying to add 1 in the indices = line.

So if im right so far (@phofl definitely worth double-checking) then we need need to figure out why that get_loc is non-deterministic.

@phofl
Copy link
Member

phofl commented Feb 14, 2021

GroupBy is non deterministic, see my explanation above. This is because of the line I marked with the traceback.

@jbrockmendel
Copy link
Member

gotcha, im on the totally wrong track. thanks

@phofl
Copy link
Member

phofl commented Feb 14, 2021

Wasn't explaining it very well either I think, sorry for this.

I am not familiar enough with

kh_get_pymap(self.table, <PyObject*>val) 

to debug this on my own. The input values are the same but the output varies sometimes.

@jbrockmendel
Copy link
Member

@realead is probably the best person to ask about kh_get_pymap

@realead
Copy link
Contributor

realead commented Feb 14, 2021

@phofl I think I can explain why kh_get_pymap returns different indexes for different runs.

We use the hash-function of PyObject for the hash-map. The hash functions for str and bytes are salted, see here :

By default, the hash() values of str and bytes objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
...
See also PYTHONHASHSEED.

That means in different runs of the intepreter, the resulting hashes can be different. Different hash-values will lead to different places in the hash-map and thus different returned indexes.

You can verify this theory by setting the environment variable PYTHONHASHSEED to 0 - after that there should be no differences. Or alternatively, to try out different seeds and to find one for which the problem is always present.

Another observation, which supports my theory: the different behaviors were never observed in the same run of the interpreter - always in diffent runs.

@phofl
Copy link
Member

phofl commented Feb 18, 2021

@relead thy for the explanation, not sure if I get everything, will have to look this up.

One thing I would like to add:

We are using

k = kh_get_pymap(self.table, <PyObject*>val)

and if

if k == self.table.n_buckets:

then k has not been set.

The first value we pass in is
2020-06-03 15:59:59.999999+00:00 and the second is 2020-06-03. Most of the times kh_get_pymap returns 16, which is self.table.n_buckets for both cases. But sometimes kh_get_pymap returns 5 for 2020-06-03, this is what causes the inconsistency. Do you have an idea how this can happen?

@realead
Copy link
Contributor

realead commented Feb 18, 2021

@phofl

I have added

print("Val", val, "hash", hash(val), "k", k)

directly after the call of kh_get_pymap(self.table, <PyObject*>val)

For the original example, I see the following trace in case of a successful run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash -5723614487856817989 k 16
Val 2020-06-03 hash -5723614487856817989 k 9
Val 2020-06-04 hash 4682208479840147490 k 16
Val 2020-06-04 hash 4682208479840147490 k 12
Val 2020-06-05 hash -3412894078602059675 k 16
Val 2020-06-05 hash -3412894078602059675 k 15

The behavior is as expected: the first call with a value returns always 16 -> the key isn't in the map (after this the key gets added to the map) thus a second call with the same value returns not 16, but an index where the value was put in.

Here are traces for a failed run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash 5740215080242998926 k 16
Val 2020-06-03 hash 5740215080242998926 k 13
Val 2020-06-04 hash 6389150965059193424 k 16
Val 2020-06-04 hash 6389150965059193424 k 3
Val 2020-06-05 hash -1444037058501195733 k 16
Val 2020-06-05 hash -1444037058501195733 k 9

First thing: the hash values (and thus indexes k) are different (reason is PYTHONHASHSEED as explained in my first comment), but it doesn't change the behavior, which looks correct to me.

I'm not sure kh_get_pymap is the issue here...

@phofl
Copy link
Member

phofl commented Feb 18, 2021

This is what I am getting in case of a failed run:

Val 2020-06-03 15:59:59.999999+00:00 hash -1166986026689279599 k 16
Val 2020-06-03 hash -223041617241465135 k 5
Val 2020-06-03 hash -223041617241465135 k 5
Val 2020-06-04 hash -5855114764210332090 k 16
Val 2020-06-04 hash -5855114764210332090 k 2
Val 2020-06-05 hash 7680890035772906995 k 16
Val 2020-06-05 hash 7680890035772906995 k 9

The 5 in the second row is the issue I think?

Edit: I have added your print line also directly below the pymap call

@realead
Copy link
Contributor

realead commented Feb 18, 2021

Ok, the first element (2020-06-03 15:59:59.999999+00:00) is of type pandas._libs.tslibs.timestamps.Timestamp, the second (2020-06-03) is of type datetime.date.

The issue is, that while they have different hash-values, they are equal:

>>> data["date"][0] == data["date"][1]
True

this is an issue, because for using hash-map the following must hold: "a equals b => hash(a)==hash(b)". Which is not the case here.

Because the hash-values are different, there is sometimes a collision (k=5) and sometimes no collision (k=16).

However, I can see the error, even if this collision doesn't happen. Thus it could be a red herring.

@phofl
Copy link
Member

phofl commented Feb 18, 2021

Thx for the explanation. makes sense.
just to be clear: You can see the KeyError, when breakdown is

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
2020-06-03                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-04                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0
2020-06-05                       alpha  100  100  0  100  0  0
                                 beta   100  100  0  100  0  0

? I could not produce an error in this case

@realead
Copy link
Contributor

realead commented Feb 18, 2021

just to be clear: You can see the KeyError, when breakdown is

                                          b    c  d    e  f  g
date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  100  100  0  100  0  0
...

Yes, I also see sometimes

date                             a                            
2020-06-03 15:59:59.999999+00:00 alpha  200  200  0  200  0  0
...

without an error.

@jbrockmendel
Copy link
Member

>>> data["date"][0] == data["date"][1]
True

luckily this behavior in Timestamp.__richcmp__ is deprecated. Still a while before that gets enforced.

@Stevinson
Copy link
Author

Is the current state-of-play waiting for the offending behaviour of Timestamp.__richcmp__ to be deprecated?

Is there a temporary workaround that could be used to avoid this error in the meantime?

@jbrockmendel
Copy link
Member

Is there a temporary workaround that could be used to avoid this error in the meantime?

only thing that comes to mind is to not use date objects

@rhshadrach
Copy link
Member

>>> data["date"][0] == data["date"][1]
True

luckily this behavior in Timestamp.__richcmp__ is deprecated. Still a while before that gets enforced.

@jbrockmendel @phofl - with the deprecation now enforced, are we good to close? I currently get Cannot compare Timestamp with datetime.date with the OP's example, haven't been able to reproduce if I remove the date objects.

@matiaslindgren
Copy link
Contributor

matiaslindgren commented Jul 21, 2024

Non-deterministic behaviour can also be seen in #57922, but with tuple subclasses created with collections.namedtuple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants