-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: isin with nullable types with missing values raising #42473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: isin with nullable types with missing values raising #42473
Conversation
pandas/_libs/lib.pyx
Outdated
if arr[i] is C_NA: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be overkill, but fits with some other cython methods like has_infs
and would hopefully be useful in other locations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could live in _libs.missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have moved
|
||
# For now, NA does not propagate so set result according to presence of NA, | ||
# see https://github.com/pandas-dev/pandas/pull/38379 for some discussion | ||
result[self._mask] = values_have_NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be equivalent to the deleted if/else AFAICT, but I think makes logic easier to follow
asv_bench/benchmarks/algos/isin.py
Outdated
from pandas.compat import np_version_under1p20 | ||
except ImportError: | ||
from pandas.compat.numpy import _np_version_under1p20 as np_version_under1p20 | ||
from pandas.core.dtypes.cast import is_extension_array_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get this from dtypes.common
pandas/_libs/lib.pyx
Outdated
@@ -538,6 +538,24 @@ def has_infs(floating[:] arr) -> bool: | |||
return ret | |||
|
|||
|
|||
@cython.wraparound(False) | |||
@cython.boundscheck(False) | |||
def has_NA(ndarray arr) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndarray[object]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try that (or object[:]
), end up getting errors like ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, this should only be called on object-dtype, in other cases we already know the result is False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it thanks, have changed
asv_bench/benchmarks/algos/isin.py
Outdated
self.values = np.arange(MaxNumber).astype(dtype) | ||
|
||
if is_extension_array_dtype(dtype): | ||
vals_dtype = dtype.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be more specific to Int64Dtype/Float64Dtype? we we had e.g. PeriodDtype then the astype on L308 would break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, have limited to those to make clearer what can pass through here in current form
asv_bench/benchmarks/algos/isin.py
Outdated
"float64", | ||
"float32", | ||
"object", | ||
Int64Dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very odd, why doesn't Int64
not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was so the type
attribute could be accessed to get the underlying numpy
compatible type. I'll try to reorganize to avoid this
asv_bench/benchmarks/algos/isin.py
Outdated
@@ -297,6 +298,10 @@ def setup(self, dtype, MaxNumber, series_type): | |||
array = np.arange(N) + MaxNumber | |||
|
|||
self.series = Series(array).astype(dtype) | |||
|
|||
if isinstance(dtype, (Int64Dtype, Float64Dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, this is a problem if you need to do this
pandas/_libs/missing.pyx
Outdated
@@ -334,6 +334,22 @@ def isnaobj2d_old(arr: ndarray) -> ndarray: | |||
return result.view(np.bool_) | |||
|
|||
|
|||
@cython.wraparound(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as: cpdef ndarray[uint8_t] isnaobj(ndarray arr)
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its might be a very tiny bit faster but likely not worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here was that we need to be able to explicitly check for NA
(and not other missing values) to maintain the existing behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually test this? i think we accept multiple types of Nulls here (e.g. None is accepted as well as np.nan for strings). I would just use our existing routine unless you have tests that show this is not sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tested (hence the regression :)
On 1.2.x the behavior is
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: ser = pd.Series([1, 1, pd.NA], dtype="Int64")
In [4]: ser.isin([pd.NA])
Out[4]:
0 False
1 False
2 True
dtype: bool
In [5]: ser.isin([np.nan])
Out[5]:
0 False
1 False
2 False
dtype: bool
This isin
behavior singling out pd.NA
also matches strings:
In [7]: ser = pd.Series(["a", "b", None], dtype="string")
In [8]: ser
Out[8]:
0 a
1 b
2 <NA>
dtype: string
In [9]: ser.isin([np.nan])
Out[9]:
0 False
1 False
2 False
dtype: bool
In [10]: ser.isin([pd.NA])
Out[10]:
0 False
1 False
2 True
dtype: bool
So this matches existing behavior and how StringDtype
handles NA
for isin
, even if the behavior itself may be questionable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add an explict test for this?
The added test has parameterizations which hit this explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also -1 on the expansion of yet another null checker. if we want to keep the existing then do this via a flag to isnaobj.
If we're uncertain about this behavior, could also just turn the added cython routine into something similar done in python space just to fix the regression. Then that could easily be removed for 1.4 if we decide to change the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. this should match anything we allow for nulls in the _from_sequence which is what isnaobj does
i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're going to at some point need a vectorized is_matching_na, which this would be a special case of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype
Not sure if along the lines of what you were thinking, but pushed a different solution just checking for presence of a valid missing value using dtype.na_value
Have fixed the issue in benchmark for nullable types in a different way. If still looks a bit kludgy let me know, and we can just leave that out of this pr |
also rebase |
This comment has been minimized.
This comment has been minimized.
@meeseeksdev backport 1.3.x |
This comment has been minimized.
This comment has been minimized.
this is going to need a manual backport, cc @mzeitlin11 |
…sing values raising
Didn't look like there was a consensus reached on NA propagation in #39844, this should maintain the behavior from 1.2.x of not propagating and just fix the regression