-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
incosistencies in dataframe.dtype.isin() - I am getting inconsistent with different runs on same code. #32443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have a minimal reproducible example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. Ideally you wouldn't need to read from CSV to reproduce this. Can you also check the formatting on your post? https://guides.github.com/features/mastering-markdown/ |
Updated the info and format. |
Thanks. I'm not able to reproduce locally. |
May I ask how many times have you run this? |
Hello again. I am trying to follow up on this. Would you please elaborate the result you are getting? I just wanted to make sure that I emphasize the result will show up after running the same script about 10 times (inconsistent results in 10 different runs). |
When you say re-run do you mean within a single python interpreter session, or multiple |
I'm seeing this too. I saved a file called
|
Interesting, thanks. NumPy dtypes are a bit strange, since I think the aren't always singletons, and different instances may hash differently. If you always use the same object, do you see inconsistences? In [14]: intd = myDf.dtypes[0]
In [15]: myDf.dtypes.isin([intd] |
@TomAugspurger nope, with that I always get
as output |
also seeing this. (on master and 0.25.3) |
Thanks. We need to determine whether this is something we want to / can support. I think the thing to figure out is when NumPy dtypes are not singletons or how we can ensure they factorize the same. |
I have occured the same problem similar with this issue: i.e., unpredictable assertions of pd.DataFrame.dtypes.isin([np.float64]) behavior between python sessions Reproducible Example:
The code in the reproducible example behave differently between python sessions. Sometimes the assertion passed, sometimes it failed.
INSTALLED VERSIONS
------------------
commit : 06d2301
python : 3.10.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936
pandas : 1.4.1 |
I spent a fair bit of time today puzzled by this inconsistent behaviour. See the following Stackoverflow discussions about it:
In my opinion this problem is worth addressing as a data science application should not have random (i.e. non-deterministic) behaviour creeping in in unexpected places. In my case the non-deterministic behaviour occurred when I passed a collection of strings to df.dtypes.isin(['int64', 'float64']) While this use may be considered incorrect I think it is likely that others may use it unwittingly since string dtypes are accepted in other situations (e.g. |
The nondeterministic behavior and the use of strings makes me think this is due to PYTHONHASHSEED. The following script demonstrates this by producing no output:
@billtubbs - do you get the same result? |
@rhshadrach I don't understand your code but when I ran it as a script it produced no output. When I ran it in the interpreter it outputted a series of |
Thanks! The code sets PYTHONHASHSEED and then verifies that with this set, the result is deterministic. When the value of the hash seed is 1, 5, 6, 13, or 15 you get the |
I think this is what's going on: In short, this issue arises because you have |
This is numpy/numpy#7242 |
I don't fully understand the issue but even if the root cause is a problem with Numpy, can something can be done in Pandas to avoid the non-deterministic behaviour of As a bare minimum a warning message could be added to the docstring for |
This isn't a phenomenon unique to NumPy. Anytime you violate I don't believe we should special-case the logic for NumPy dtypes. To have consistent behavior in pandas, we would then need to special case in any place that puts values into a hashmap (indexing, groupby). This would hurt performance and maintainability. While it leaks the internals of
|
Code Sample, a copy-pastable example if possible
Problem description
Problem is that I am getting inconsistent results based on running the same code:
So first run gives me:
and second (sometimes it takes up to 10 attempts) run gives me:
the second case is the truth and matches the myDf.dtypes:
.
Again to reiterate, I am not changing the code but when I keep running the same code I sometime get the first output and sometime get the second output. This is while I do not change the input, code or interpreter. I tried it on different machines and interpreters and still getting the same inconsistent result
the version of Python 3.7
the version of Pandas: 1.0
Expected Output
Consistent result (as it sometimes matches with the current output).
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.3.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: