-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: reduce unwanted type conversions of pd.Series.isin #19508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jdoepfert! Thanks for updating the PR.
Comment last updated on February 05, 2018 at 01:05 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a release note.
this absolutely needs a performance check. needs to run the asvs for algorithms
pandas/core/algorithms.py
Outdated
@@ -405,7 +406,21 @@ def isin(comps, values): | |||
values = construct_1d_object_array_from_listlike(list(values)) | |||
|
|||
comps, dtype, _ = _ensure_data(comps) | |||
values, _, _ = _ensure_data(values, dtype=dtype) | |||
# Convert `values` to `dtype` if dtype of `values` is like `dtype` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to check all these conditions, rather you can just do
In [14]: l = [1.0, 1]
In [15]: set([type(v) for v in l])
Out[15]: {float, int}
If you have a set of len(1) then you can go ahead and pass the dtype (else you can use object
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thx for the comment, not sure if I got you right though. Imagine I have
pd.Series([1, 0]).isin([1.0, 0.5])
Then the len of the set would be one (I only have floats), but the conversion to int
would still not be desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a different case which is actually simple. the dtype that ensure_data returns is different from the dtype that the set check returns.
NONE of this checking can be done here, rather it is already done in ensure_data
Codecov Report
@@ Coverage Diff @@
## master #19508 +/- ##
==========================================
- Coverage 91.67% 91.65% -0.02%
==========================================
Files 148 148
Lines 48553 48569 +16
==========================================
+ Hits 44513 44518 +5
- Misses 4040 4051 +11
Continue to review full report at Codecov.
|
pandas/core/algorithms.py
Outdated
@@ -405,7 +406,21 @@ def isin(comps, values): | |||
values = construct_1d_object_array_from_listlike(list(values)) | |||
|
|||
comps, dtype, _ = _ensure_data(comps) | |||
values, _, _ = _ensure_data(values, dtype=dtype) | |||
# Convert `values` to `dtype` if dtype of `values` is like `dtype` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a different case which is actually simple. the dtype that ensure_data returns is different from the dtype that the set check returns.
NONE of this checking can be done here, rather it is already done in ensure_data
Still not sure I got it. Tried to implement the set check approach, but due to this test (L502 in def test_i8(self):
arr = pd.date_range('20130101', periods=3).values
result = algos.isin(arr, [arr[0]])
expected = np.array([True, False, False]) I still had to to some datetime-like type checking: Some tests are still failing, but if that's the way to go, I'll fix this up |
Won't have time to work on this for the next 2 months, so if anyone wants to take over, give it a go! The last commit with all tests passing was cd14c56. |
closing as stale. |
Addresses the bug in pd.Series.isin(values) that sometimes undesirably converts the dtype of
values
to the the dtype of the series before theisin
comparison. Here is an example wherevalues
is converted toint
, and hence the output(True, True)
is yielded, while(True, False)
is expected:In my understanding, the bug has two sources:
1) L407-408 in
pandas/core/algorithms.py
Here, first the dtype of the series (
comps
) is determined, and thenvalues
is forcefully converted to that dtype. This e.g. leads to the undesired conversion of the example above.2) L417-434 in
pandas/core/algorithms.py
Here, if the dtype of the series (
comps
) is eitherfloat
orint
, a forced conversion ofvalues
takes place again.The fix:
1) L408ff
Only force the dtype of
comps
tovalues
if both are int-like, float-like, or datetime-like. If not, use_ensure_data(values)
without thedtype
argument. I tried to keep the cost of the dtype checks low by exploiting lazy evaluation.2) L432ff
Add
and
statements such that the conversion only takes place when bothcomps
andvalues
are eitherfloat
orint
. If they are none of the above is true, both are converted toobject
type in theelse
block. This ensures thatreturn f(comps, values)
works (sincef = lambda x, y: htable.ismember_object(x, values)
from L426).git diff upstream/master -u -- "*.py" | flake8 --diff