Skip to content

WIP: reduce unwanted type conversions of pd.Series.isin #19508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

jdoepfert
Copy link
Contributor

@jdoepfert jdoepfert commented Feb 2, 2018

Addresses the bug in pd.Series.isin(values) that sometimes undesirably converts the dtype of values to the the dtype of the series before the isin comparison. Here is an example where values is converted to int, and hence the output (True, True) is yielded, while (True, False) is expected:

>>>  values = [1, 0.5]
>>> pd.Series([1, 0]).isin(values)
0    True
1    True
dtype: bool

In my understanding, the bug has two sources:

1) L407-408 in pandas/core/algorithms.py

comps, dtype, _ = _ensure_data(comps)
values, _, _ = _ensure_data(values, dtype=dtype)

Here, first the dtype of the series (comps) is determined, and then values is forcefully converted to that dtype. This e.g. leads to the undesired conversion of the example above.

2) L417-434 in pandas/core/algorithms.py

    elif is_integer_dtype(comps):
        try:
            values = values.astype('int64', copy=False)
            comps = comps.astype('int64', copy=False)
            f = lambda x, y: htable.ismember_int64(x, y)
        except (TypeError, ValueError):
            values = values.astype(object)
            comps = comps.astype(object)

    elif is_float_dtype(comps):
        try:
            values = values.astype('float64', copy=False)
            comps = comps.astype('float64', copy=False)
            checknull = isna(values).any()
            f = lambda x, y: htable.ismember_float64(x, y, checknull)
        except (TypeError, ValueError):
            values = values.astype(object)
            comps = comps.astype(object)

Here, if the dtype of the series (comps) is either float or int, a forced conversion of values takes place again.

The fix:
1) L408ff
Only force the dtype of comps to values if both are int-like, float-like, or datetime-like. If not, use _ensure_data(values) without the dtype argument. I tried to keep the cost of the dtype checks low by exploiting lazy evaluation.

2) L432ff
Add and statements such that the conversion only takes place when both comps and values are either float or int. If they are none of the above is true, both are converted to object type in the else block. This ensures that return f(comps, values) works (since f = lambda x, y: htable.ismember_object(x, values) from L426).

@pep8speaks
Copy link

pep8speaks commented Feb 2, 2018

Hello @jdoepfert! Thanks for updating the PR.

Line 419:13: E265 block comment should start with '# '

Comment last updated on February 05, 2018 at 01:05 Hours UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a release note.

this absolutely needs a performance check. needs to run the asvs for algorithms

@@ -405,7 +406,21 @@ def isin(comps, values):
values = construct_1d_object_array_from_listlike(list(values))

comps, dtype, _ = _ensure_data(comps)
values, _, _ = _ensure_data(values, dtype=dtype)
# Convert `values` to `dtype` if dtype of `values` is like `dtype`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to check all these conditions, rather you can just do

In [14]: l = [1.0, 1]

In [15]: set([type(v) for v in l])
Out[15]: {float, int}

If you have a set of len(1) then you can go ahead and pass the dtype (else you can use object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thx for the comment, not sure if I got you right though. Imagine I have

pd.Series([1, 0]).isin([1.0, 0.5])

Then the len of the set would be one (I only have floats), but the conversion to int would still not be desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a different case which is actually simple. the dtype that ensure_data returns is different from the dtype that the set check returns.

NONE of this checking can be done here, rather it is already done in ensure_data

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Bug labels Feb 2, 2018
@codecov
Copy link

codecov bot commented Feb 2, 2018

Codecov Report

Merging #19508 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19508      +/-   ##
==========================================
- Coverage   91.67%   91.65%   -0.02%     
==========================================
  Files         148      148              
  Lines       48553    48569      +16     
==========================================
+ Hits        44513    44518       +5     
- Misses       4040     4051      +11
Flag Coverage Δ
#multiple 90.02% <100%> (-0.02%) ⬇️
#single 41.72% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 92.9% <100%> (-1.25%) ⬇️
pandas/plotting/_core.py 82.23% <0%> (-0.18%) ⬇️
pandas/tseries/offsets.py 97% <0%> (-0.09%) ⬇️
pandas/core/indexes/datetimes.py 95.25% <0%> (+0.01%) ⬆️
pandas/core/ops.py 95.74% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6b260b...80d9c3d. Read the comment docs.

@jdoepfert
Copy link
Contributor Author

jdoepfert commented Feb 2, 2018

@jreback Added a description. Agree, this definitely needs a performance check, will work on that tomorrow

EDIT: can't run asv at the moment, see #19518

@@ -405,7 +406,21 @@ def isin(comps, values):
values = construct_1d_object_array_from_listlike(list(values))

comps, dtype, _ = _ensure_data(comps)
values, _, _ = _ensure_data(values, dtype=dtype)
# Convert `values` to `dtype` if dtype of `values` is like `dtype`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a different case which is actually simple. the dtype that ensure_data returns is different from the dtype that the set check returns.

NONE of this checking can be done here, rather it is already done in ensure_data

@jdoepfert
Copy link
Contributor Author

Still not sure I got it. Tried to implement the set check approach, but due to this test (L502 in test_algos.py)

    def test_i8(self):

        arr = pd.date_range('20130101', periods=3).values
        result = algos.isin(arr, [arr[0]])
        expected = np.array([True, False, False])

I still had to to some datetime-like type checking:
Here, the dtype of comps is initially datetime64[ns], but ensure_data makes it int64 (<M8[ns]). The dtype of the items in values is different, so no conversion is enforced.
Hence I do not only check for exact equality of the dtypes (that ensure_data and the set check return) but I additionally use is_time_like().

Some tests are still failing, but if that's the way to go, I'll fix this up

@jdoepfert
Copy link
Contributor Author

Won't have time to work on this for the next 2 months, so if anyone wants to take over, give it a go! The last commit with all tests passing was cd14c56.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

closing as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.isin implicitly converts data types
3 participants