You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The .isin() method is inconsistent when asked to compare numerical strings to ints, such that (pd.Series([1,2,3]).isin(pd.Series(['1','2','3']))) != (pd.Series(['1','2','3']).isin(pd.Series([1,2,3])))
There should be consistent handling of these cases.
This bug has been touched upon in issues #16938 and #24918 , but this report is distinct in identifying a smaller subset of the problem that should be solvable in a way that those issues have not been.
Expected Output
Two options
For backwards compatibility, I would suggest that pandas now needs to offer .isin() matching between str and int, as some users will have used this in their code.
Alternatively isin could raise an error if the dtypes input are not the same. This would be easier to achieve after pd.NA is fully integrated and locked down in pandas (Why? Because people will not like it if they can't compare superficially equivalent columns because the existance of an NA value has changed the dytpe of a column, as in pd<v1.0.0)
^^ Mainly we need to take a call on the desired behaviour for isin; the current inconsistency is not valid behaviour but it not clear what behaviour we should have instead.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Tue Nov 10 00:07:31 PST 2020; root:xnu-4903.278.51~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
Correction: I think one of the datetime fixes for isin() between 1.1.5 and 1.2.0 may have altered this issue by forcing False if you attempt to compare different dtypes in an isin()?
If that is deliberate behaviour that is an okay resolution, although it would be better to raise a warning when an attempt is made to compare incomparable dtypes that will always return false?
Correction: I think one of the datetime fixes for isin() between 1.1.5 and 1.2.0 may have altered this issue by forcing False if you attempt to compare different dtypes in an isin()?
closing this as the code sample does give the expected output on 1.2.0 see also #38781
feel free to open a new issue to highlight any of the other points raised here.
simonjayhawkins
added
Algos
Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
isin
isin method
and removed
Needs Triage
Issue that has not been reviewed by a pandas team member
labels
Jan 19, 2021
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
The .isin() method is inconsistent when asked to compare numerical strings to ints, such that
(pd.Series([1,2,3]).isin(pd.Series(['1','2','3']))) != (pd.Series(['1','2','3']).isin(pd.Series([1,2,3])))
There should be consistent handling of these cases.
This bug has been touched upon in issues #16938 and #24918 , but this report is distinct in identifying a smaller subset of the problem that should be solvable in a way that those issues have not been.
Expected Output
Two options
For backwards compatibility, I would suggest that pandas now needs to offer
.isin()
matching between str and int, as some users will have used this in their code.Alternatively isin could raise an error if the dtypes input are not the same. This would be easier to achieve after
pd.NA
is fully integrated and locked down in pandas (Why? Because people will not like it if they can't compare superficially equivalent columns because the existance of an NA value has changed the dytpe of a column, as in pd<v1.0.0)^^ Mainly we need to take a call on the desired behaviour for
isin
; the current inconsistency is not valid behaviour but it not clear what behaviour we should have instead.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Tue Nov 10 00:07:31 PST 2020; root:xnu-4903.278.51~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.2.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.2.0
Cython : None
pytest : 3.10.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 0.6.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2
The text was updated successfully, but these errors were encountered: