Skip to content

incosistencies in dataframe.dtype.isin() - I am getting inconsistent with different runs on same code. #32443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maleknia opened this issue Mar 4, 2020 · 20 comments
Labels
Bug hashing hash_pandas_object isin isin method Upstream issue Issue related to pandas dependency

Comments

@maleknia
Copy link

maleknia commented Mar 4, 2020

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas

data = [[2, 'tom', 10], [3, 'nick', 15], [4, 'juli', 14]]

myDf = pandas.DataFrame(data, columns=['a','b','c'])

print (myDf.dtypes)

print((myDf.dtypes.isin(['int64'])))

Problem description

Problem is that I am getting inconsistent results based on running the same code:

So first run gives me:

a  False
b  False
c  False

and second (sometimes it takes up to 10 attempts) run gives me:

a  True
b  False
c  True

the second case is the truth and matches the myDf.dtypes:

a     int64
b    object
c     int64

.

Again to reiterate, I am not changing the code but when I keep running the same code I sometime get the first output and sometime get the second output. This is while I do not change the input, code or interpreter. I tried it on different machines and interpreters and still getting the same inconsistent result

the version of Python 3.7
the version of Pandas: 1.0

Expected Output

Consistent result (as it sometimes matches with the current output).

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.3.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@TomAugspurger
Copy link
Contributor

Do you have a minimal reproducible example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. Ideally you wouldn't need to read from CSV to reproduce this.

Can you also check the formatting on your post? https://guides.github.com/features/mastering-markdown/

@mroeschke mroeschke added the Needs Info Clarification about behavior needed to assess issue label Mar 4, 2020
@maleknia
Copy link
Author

maleknia commented Mar 4, 2020

Updated the info and format.

@TomAugspurger
Copy link
Contributor

Thanks. I'm not able to reproduce locally.

@maleknia
Copy link
Author

maleknia commented Mar 5, 2020

May I ask how many times have you run this?

@maleknia
Copy link
Author

Thanks. I'm not able to reproduce locally.

Hello again. I am trying to follow up on this. Would you please elaborate the result you are getting? I just wanted to make sure that I emphasize the result will show up after running the same script about 10 times (inconsistent results in 10 different runs).

@TomAugspurger
Copy link
Contributor

When you say re-run do you mean within a single python interpreter session, or multiple python script.py?

@MarcoGorelli
Copy link
Member

I'm seeing this too. I saved a file called script.py with the code you pasted, and then ran:

(pandas-dev) m.gorelli@ws-1808:~/pandas$ python script.py 
a     int64
b    object
c     int64
dtype: object
a    False
b    False
c    False
dtype: bool
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python script.py 
a     int64
b    object
c     int64
dtype: object
a    False
b    False
c    False
dtype: bool
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python script.py 
a     int64
b    object
c     int64
dtype: object
a    False
b    False
c    False
dtype: bool
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python script.py 
a     int64
b    object
c     int64
dtype: object
a    False
b    False
c    False
dtype: bool
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python script.py 
a     int64
b    object
c     int64
dtype: object
a     True
b    False
c     True
dtype: bool

@TomAugspurger
Copy link
Contributor

Interesting, thanks. NumPy dtypes are a bit strange, since I think the aren't always singletons, and different instances may hash differently. If you always use the same object, do you see inconsistences?

In [14]: intd = myDf.dtypes[0]

In [15]: myDf.dtypes.isin([intd]

@MarcoGorelli
Copy link
Member

@TomAugspurger nope, with that I always get

a     int64
b    object
c     int64
dtype: object
a     True
b    False
c     True
dtype: bool

as output

@simonjayhawkins
Copy link
Member

I'm seeing this too. I saved a file called script.py with the code you pasted, and then ran:

also seeing this. (on master and 0.25.3)

@simonjayhawkins simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Bug and removed Needs Info Clarification about behavior needed to assess issue labels Mar 31, 2020
@TomAugspurger
Copy link
Contributor

Thanks. We need to determine whether this is something we want to / can support.

I think the thing to figure out is when NumPy dtypes are not singletons or how we can ensure they factorize the same.

@mike1936
Copy link

I have occured the same problem similar with this issue:

i.e., unpredictable assertions of pd.DataFrame.dtypes.isin([np.float64]) behavior between python sessions

Reproducible Example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2], 'b': [1.0, 2.0]})
assert df.dtypes.isin([np.float64])['b']

The code in the reproducible example behave differently between python sessions. Sometimes the assertion passed, sometimes it failed.

image

INSTALLED VERSIONS ------------------ commit : 06d2301 python : 3.10.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Chinese (Simplified)_China.936

pandas : 1.4.1
numpy : 1.23.0
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.2
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@billtubbs
Copy link

I spent a fair bit of time today puzzled by this inconsistent behaviour. See the following Stackoverflow discussions about it:

  1. Why is df.dtypes.isin not giving expected results when passed a list of strings?
    (this question was closed as it was deemed a duplicate of the following one)
  2. dtype comparison: == and isin produce different results for "object"

In my opinion this problem is worth addressing as a data science application should not have random (i.e. non-deterministic) behaviour creeping in in unexpected places.

In my case the non-deterministic behaviour occurred when I passed a collection of strings to pd.DataFrame.dtypes.isin:

df.dtypes.isin(['int64', 'float64'])

While this use may be considered incorrect I think it is likely that others may use it unwittingly since string dtypes are accepted in other situations (e.g. df.dtypes.A == 'float64'), since it raises no exceptions, and since at first glance the output (although random) may give the impression that the code is functioning as expected.

@rhshadrach
Copy link
Member

The nondeterministic behavior and the use of strings makes me think this is due to PYTHONHASHSEED. The following script demonstrates this by producing no output:

import os
from subprocess import call

script = """
import os
import pandas as pd

data = [[2, 'tom', 10], [3, 'nick', 15], [4, 'juli', 14]]
df = pd.DataFrame(data, columns=['a','b','c'])
result = df.dtypes.isin(['int64'])

if os.environ["PYTHONHASHSEED"] in ["1", "5", "6", "13", "15"]:
    expected = pd.Series([True, False, True], index=df.columns)
else:
    expected = pd.Series([False, False, False], index=df.columns)
pd.testing.assert_series_equal(result, expected)
"""

cmd = ['python', '-c', script]
env = os.environ
for _ in range(10):
    for seed in range(20):
        env['PYTHONHASHSEED'] = str(seed)
        call(cmd, env=env)

@billtubbs - do you get the same result?

@billtubbs
Copy link

@rhshadrach I don't understand your code but when I ran it as a script it produced no output. When I ran it in the interpreter it outputted a series of 0s.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 31, 2023

Thanks! The code sets PYTHONHASHSEED and then verifies that with this set, the result is deterministic. When the value of the hash seed is 1, 5, 6, 13, or 15 you get the True False True pattern and otherwise you get False False False (for seeds up to 20). The trick is that you must set this seed before starting up the Python interpreter, which is why you need to use subprocess.call.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 31, 2023

I think this is what's going on: hash(np.dtype("int64")) is computed as hash(repr(np.dtype("int64"))) which evaluates to hash("dtype('int64')"). pandas isin works by creating a hash table of needles, and then taking each value in the haystack and seeing if its hash occurs in the table. When it gets a likely-positive based on hashes, it then tests for equality. So you only get the True value here when hash("dtype('int64')") == hash("int64") modulo the size of the hash map. This is nondeterministic by default because Python's hashing of strings is nondeterministic.

In short, this issue arises because you have a == b but hash(a) != hash(b). It seems to me this is a NumPy issue, but perhaps there isn't anything that can be done about it. I plan to report this to NumPy (if it isn't already up as an issue).

@rhshadrach rhshadrach added Upstream issue Issue related to pandas dependency hashing hash_pandas_object and removed Dtype Conversions Unexpected or buggy dtype conversions labels Oct 31, 2023
@rhshadrach
Copy link
Member

This is numpy/numpy#7242

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Oct 31, 2023
@billtubbs
Copy link

billtubbs commented Oct 31, 2023

I don't fully understand the issue but even if the root cause is a problem with Numpy, can something can be done in Pandas to avoid the non-deterministic behaviour of pd.Series.isin when the series values are numpy.dtype and the collection to search in contains strings? Either do the conversion (e.g. map(np.dtype, values) or np.vectorize(np.dtype)(values)) or raise an exception or warning perhaps.

As a bare minimum a warning message could be added to the docstring for pandas.Series.isin regarding the problem when checking values of type numpy.dtype.

@rhshadrach
Copy link
Member

This isn't a phenomenon unique to NumPy. Anytime you violate a == b implies hash(a) == hash(b), this will happen. NumPy dtypes are just an example of an object that violates this.

I don't believe we should special-case the logic for NumPy dtypes. To have consistent behavior in pandas, we would then need to special case in any place that puts values into a hashmap (indexing, groupby). This would hurt performance and maintainability.

While it leaks the internals of isin, I'd be okay with adding a line to the docstring like

Operates via a hashmap and will not perform reliably if the objects in question violate hashing requirements. In particular, if strings are involved then results may vary with different Python interpreter sessions.

@rhshadrach rhshadrach removed the Closing Candidate May be closeable, needs more eyeballs label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug hashing hash_pandas_object isin isin method Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

9 participants