How should isin behave with pd.NA? #31990

dsaxton · 2020-02-15T05:50:08Z

import pandas as pd

arr = pd.array(["a", "b", pd.NA], dtype="string")
s = pd.Series(["a", "b", "c"])

print(s.isin(arr))
# 0     True
# 1     True
# 2    False
# dtype: bool

print(pd.Series(arr).isin(["a", "b"]))
# 0     True
# 1     True
# 2    False
# dtype: bool

I think a case could be made that the actual output is not correct in these cases, and that both should return the nullable boolean pd.Series(pd.array([True, True, pd.NA])). In the first case we don't know that "c" is not in arr, and in the second case we don't know if pd.NA happens to be "a" or "b", so again we should have pd.NA.

This is obviously an edge case but may be worth considering for the sake of consistency with the other three-valued logic operations (since isin is essentially an "or" statement).

The text was updated successfully, but these errors were encountered:

jreback · 2020-02-15T05:52:30Z

can show a similar example with a float array with np.nan

jorisvandenbossche · 2020-02-15T08:50:51Z

Yes, I think we should indeed propagate the NA here, similarly how we do it with other operations that result in booleans (like comparisons).

And the result here should indeed also be a nullable boolean anyway (even when not propagating the NA), as we should try as much as possible return nullable dtypes from opertations with columns with nullable dtypes.

cottrell · 2020-04-12T08:36:21Z

Jumping into this a bit late in the game ... but some questions about pd.NA came up on gitter and I see it is still experimental. If there is some place for discussion about pd.NA please point me there.

This use case points to allowing pd.NA to change the underlying numpy array dtypes to objects could be pretty bad.

If it was simply implemented as a mask attribute (sprase or dense optionally) everyone might have a better time down the road. For example, isin could be lightly modified with no significant performance hit if a pd.NA mask is present.

If I understand the origins, I'm imagining this is largely to allow for missing integers and booleans?

jorisvandenbossche · 2020-04-12T13:15:44Z

For integers and booleans, it is using a mask (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/masked.py). But for string dtype not (or not yet).

For general questions, we don't really have a good place to ask those, further discussing it on gitter is fine, or otherwise the mailing list. And if you have specific issues (bug reports, enhancement requests), new issues are fine!

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Feb 15, 2020

jorisvandenbossche added this to the 1.1 milestone Feb 15, 2020

dsaxton mentioned this issue Apr 12, 2020

Implement pd.NA checking function #33490

Closed

4 tasks

mroeschke added the Bug label Apr 28, 2020

jreback modified the milestones: 1.1, Contributions Welcome Jul 11, 2020

jbrockmendel added the isin isin method label Oct 30, 2020

mzeitlin11 mentioned this issue Sep 5, 2021

API: ExtensionArray.isin treatment of missing values #42545

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should isin behave with pd.NA? #31990

How should isin behave with pd.NA? #31990

dsaxton commented Feb 15, 2020

jreback commented Feb 15, 2020

jorisvandenbossche commented Feb 15, 2020

cottrell commented Apr 12, 2020

jorisvandenbossche commented Apr 12, 2020

How should isin behave with pd.NA? #31990

How should isin behave with pd.NA? #31990

Comments

dsaxton commented Feb 15, 2020

jreback commented Feb 15, 2020

jorisvandenbossche commented Feb 15, 2020

cottrell commented Apr 12, 2020

jorisvandenbossche commented Apr 12, 2020