-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
fix series.isin slow issue with Dtype IntegerArray #38379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
109c0e7
e9f96ea
a6be9c8
415b590
f3e5afb
14579fc
562c918
1449d3c
3ccc917
98a0683
6e2917e
a4b6503
f95fde9
2348a60
c963183
c151102
ef99e86
f23ba94
13dc64f
cc38088
94846cc
f4cb5ce
2ee8b05
7763b18
87dfff9
d2d32d1
90ef57a
a48e00b
2279519
c726d4a
1134ad6
bce0e3e
bf788e5
40950be
570d640
0f89578
199c11c
9f35b5b
2238dc5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( | ||
values.dtype | ||
): | ||
if type(comps).__name__ == "IntegerArray": | ||
comps = comps._data # type: ignore[attr-defined, assignment] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is going to lead to incorrect answers, e.g.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @jbrockmendel, thanks for pointing out this. If there is pd.NA in the array, the result could be incorrect, but we can multiply the result with ~comps._mask, which is a boolean numpy array for pd.NA values: def isin_for_masked_array(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values) * np.invert(comps._mask)
return result
return isin(comps, values) I've tested the result is correct and the runtime is not bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And if the case is def isin_for_masked_array2(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values) * np.invert(comps._mask)
if any(x is pd.NA for x in values):
result += comps._mask
return result
return isin(comps, values) We have to be careful when 2nd array contains 1, because MaskArray's NA value will be 1 in self._data. def wrong_solution(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values)
if any(x is pd.NA for x in values):
pass
else:
result *= np.invert(comps._mask)
return result
return isin(comps, values) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Have to be a little bit careful about which null values you allow, e.g. if values contains As for whether to set NA locations to False or NA, best to check with @jorisvandenbossche for his thoughts on the desired behavior. On the code organization, #38422 is a sketch of where I suggest you put the implementation. |
||
return isin(np.asarray(comps), np.asarray(values)) | ||
|
||
# GH16012 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to check isinstance(comps, numpy.ma.MaskedArray) I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not the numpy function but check for the masked EA class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could use
here.