Skip to content

fix series.isin slow issue with Dtype IntegerArray #38379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jan 20, 2021
Merged
Changes from 8 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
109c0e7
fix series.isin slow issue with Dtype IntegerArray
tushushu Dec 9, 2020
e9f96ea
Move isinstance(comps, IntegerArray) to algo.isin
tushushu Dec 9, 2020
a6be9c8
cannot import IntegerArray due to circular import
tushushu Dec 9, 2020
415b590
fix bug in pandas (Linux py38_np_dev)
tushushu Dec 9, 2020
f3e5afb
fix pre commit issue.
tushushu Dec 9, 2020
14579fc
fix the code style issue.
tushushu Dec 9, 2020
562c918
move the logic to elif block.
tushushu Dec 11, 2020
1449d3c
remove blank line.
tushushu Dec 11, 2020
3ccc917
copy codes from #38422
tushushu Dec 27, 2020
98a0683
make `isin` correct for pd.NA
tushushu Dec 27, 2020
6e2917e
sort imports
tushushu Dec 27, 2020
a4b6503
Avoiding import pandas as pd.
tushushu Dec 31, 2020
f95fde9
fix cannot import NA issue.
tushushu Dec 31, 2020
2348a60
Merge pull request #1 from pandas-dev/master
tushushu Jan 2, 2021
c963183
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 7, 2021
c151102
Merge pull request #2 from pandas-dev/master
tushushu Jan 7, 2021
ef99e86
Merge branch 'ENH-implement-fast-isin' of github.com:tushushu/pandas …
tushushu Jan 7, 2021
f23ba94
Merge remote-tracking branch 'origin/master' into ENH-implement-fast-…
tushushu Jan 9, 2021
13dc64f
Adding Int64 and Float64 for benchmarks.
tushushu Jan 9, 2021
cc38088
Adding isin benchmarks for Boolean array
tushushu Jan 9, 2021
94846cc
Adding what's new note.
tushushu Jan 9, 2021
f4cb5ce
fix IsInFloat64 benchmarks
tushushu Jan 9, 2021
2ee8b05
always return false for null values.
tushushu Jan 9, 2021
7763b18
fix flake8 error.
tushushu Jan 9, 2021
87dfff9
remove unused lines.
tushushu Jan 10, 2021
d2d32d1
refactors for series benchmarks.
tushushu Jan 10, 2021
90ef57a
fix flake8 errors.
tushushu Jan 10, 2021
a48e00b
Change back to see if can pass the tests.
tushushu Jan 10, 2021
2279519
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 14, 2021
c726d4a
makes NA isin [NA] return True.
tushushu Jan 14, 2021
1134ad6
remove redundant codes.
tushushu Jan 14, 2021
bce0e3e
makes performance better.
tushushu Jan 14, 2021
bf788e5
fix flake8 errors.
tushushu Jan 14, 2021
40950be
polish codes
tushushu Jan 16, 2021
570d640
not import NA
tushushu Jan 16, 2021
0f89578
fix code style
tushushu Jan 16, 2021
199c11c
fix black error.
tushushu Jan 16, 2021
9f35b5b
fix CI
tushushu Jan 17, 2021
2238dc5
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
if type(comps).__name__ == "IntegerArray":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to check isinstance(comps, numpy.ma.MaskedArray) I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not the numpy function but check for the masked EA class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could use

if isinstance(comps, BaseMaskedArray):

here.

comps = comps._data # type: ignore[attr-defined, assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to lead to incorrect answers, e.g.

import pandas as pd
from pandas.core.algorithms import isin

arr = pd.array([2, 3, pd.NA, 5])

>>> isin(arr, [1])
array([False, False, False, False])
>>> isin(arr._data, [1])
array([False, False,  True, False])

Copy link
Contributor Author

@tushushu tushushu Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbrockmendel, thanks for pointing out this. If there is pd.NA in the array, the result could be incorrect, but we can multiply the result with ~comps._mask, which is a boolean numpy array for pd.NA values:

def isin_for_masked_array(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        return result
    return isin(comps, values)

I've tested the result is correct and the runtime is not bad.

Copy link
Contributor Author

@tushushu tushushu Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the case is isin([2, 3, pd.NA, 5], [1, pd.NA]), then above solution is not correct. We have to check if there is null value in the second array, then add the comps._mask to make null value's place become True in the first array. The solution could be:

def isin_for_masked_array2(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        if any(x is pd.NA for x in values):
            result += comps._mask
        return result
    return isin(comps, values)

We have to be careful when 2nd array contains 1, because MaskArray's NA value will be 1 in self._data.
The solution below could be wrong:

def wrong_solution(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values)
        if any(x is pd.NA for x in values):
            pass
        else:
             result *= np.invert(comps._mask)
        return result
    return isin(comps, values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to check if there is null value in the second array

Have to be a little bit careful about which null values you allow, e.g. if values contains pd.NaT and comps is an IntegerArray, we probably dont want to count it.

As for whether to set NA locations to False or NA, best to check with @jorisvandenbossche for his thoughts on the desired behavior.

On the code organization, #38422 is a sketch of where I suggest you put the implementation.

return isin(np.asarray(comps), np.asarray(values))

# GH16012
Expand Down