-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
How should isin behave with pd.NA? #31990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can show a similar example with a float array with np.nan |
Yes, I think we should indeed propagate the NA here, similarly how we do it with other operations that result in booleans (like comparisons). And the result here should indeed also be a nullable boolean anyway (even when not propagating the NA), as we should try as much as possible return nullable dtypes from opertations with columns with nullable dtypes. |
Jumping into this a bit late in the game ... but some questions about pd.NA came up on gitter and I see it is still experimental. If there is some place for discussion about pd.NA please point me there. This use case points to allowing pd.NA to change the underlying numpy array dtypes to objects could be pretty bad. If it was simply implemented as a mask attribute (sprase or dense optionally) everyone might have a better time down the road. For example, isin could be lightly modified with no significant performance hit if a pd.NA mask is present. If I understand the origins, I'm imagining this is largely to allow for missing integers and booleans? |
For integers and booleans, it is using a mask (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/masked.py). But for string dtype not (or not yet). For general questions, we don't really have a good place to ask those, further discussing it on gitter is fine, or otherwise the mailing list. And if you have specific issues (bug reports, enhancement requests), new issues are fine! |
I think a case could be made that the actual output is not correct in these cases, and that both should return the nullable boolean
pd.Series(pd.array([True, True, pd.NA]))
. In the first case we don't know that"c"
is not inarr
, and in the second case we don't know ifpd.NA
happens to be"a"
or"b"
, so again we should havepd.NA
.This is obviously an edge case but may be worth considering for the sake of consistency with the other three-valued logic operations (since
isin
is essentially an "or" statement).The text was updated successfully, but these errors were encountered: