BUG: Fixes unwanted casting in .isin (GH21804) #21893

KalyanGokhale · 2018-07-13T13:49:03Z

closes BUG: unwanted casting in .isin #21804
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Results from running asvs on algorithms are:

       before           after         ratio
     [365eac4d]       [ee66578f]
+        1.91±0ms      2.28±0.02ms     1.20  algorithms.Hashing.time_series_timedeltas
-     2.34±0.02ms         1.88±0ms     0.80  algorithms.Hashing.time_series_int

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Updating to 0.23.0

Update 18 May

22May

Revert "22May"

26MAY18

…ndas-dev-master

codecov · 2018-07-13T16:11:48Z

Codecov Report

Merging #21893 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21893      +/-   ##
==========================================
- Coverage   91.96%   91.96%   -0.01%     
==========================================
  Files         166      166              
  Lines       50334    50337       +3     
==========================================
+ Hits        46292    46293       +1     
- Misses       4042     4044       +2

Flag	Coverage Δ
#multiple	`90.36% <100%> (-0.01%)`	⬇️
#single	`42.23% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.4% <100%> (-0.29%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a0b2b9...82386c3. Read the comment docs.

gfyoung · 2018-07-13T16:25:44Z

pandas/tests/test_algos.py

+    @pytest.mark.parametrize("comps,values,expected", [
+        ([1, 2], [1], [True, False]),
+        ([1, 0], [1, 0.5], [True, False]),
+        ([1.0, 0], [1, 0.5], [True, False]),


To reviewers: Just for reference, here are the new tests.

@KalyanGokhale : Nice refactoring!

Thanks @gfyoung

gfyoung · 2018-07-13T16:27:00Z

pandas/core/algorithms.py

+                values = values.astype('float64', copy=False)
+                comps = comps.astype('float64', copy=False)
+                checknull = isna(values).any()
+                f = lambda x, y: htable.ismember_float64(x, y, checknull)


How come you were able to remove the try-except blocks from before?

In the earlier code only dtype of comps was being checked for being either an int or float, and then the values were force casted, which would have necessitated a try-except
Now, both comps and values are being explicitly checked to be either int or float before their conversion to int64 or float64

KalyanGokhale · 2018-07-13T16:46:53Z

Revised benchmarks after 84be606

       before           after         ratio
     [365eac4d]       [84be606b]
+        1.82±0ms         2.23±0ms     1.23  algorithms.Hashing.time_series_int
-        2.24±0ms         1.83±0ms     0.82  algorithms.Hashing.time_series_float

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-07-14T15:04:13Z

pandas/core/algorithms.py

-    # faster for larger cases to use np.in1d
-    f = lambda x, y: htable.ismember_object(x, values)
+    is_int = lambda x: ((x == np.int64) or (x == int))
+


woa, what are you doing? this is way less understandable that before. too many more if/thens here. pls fit this into the existing structure.

Have edited to retain most of the existing structure

Revised asv benchmarks on algorithms after 5416711

before after ratio [365eac4d] [54167114] - 2.25±0.01ms 1.87±0.02ms 0.83 algorithms.Hashing.time_series_timedeltas SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback thanks for the review - any other edits needed?
Have tried to retain the existing structure and reduced if-blocks, though have removed the redundant try-except blocks. Have also tried to cluster the blocks logically.

This reverts commit 9fca52c.

jreback · 2018-07-20T12:59:57Z

doc/source/whatsnew/v0.23.4.txt

@@ -31,8 +31,7 @@ Bug Fixes

 **Conversion**

-
-
+- Unwanted casting of float to int in :func:`isin` (:issue:`21804`)


can you be more specific here, as a user I have no idea what this means.

jreback · 2018-07-20T13:01:27Z

pandas/core/algorithms.py

+    f = lambda x, y: htable.ismember_object(x.astype(object), y.astype(object))
+
+    comps_types = set(type(v) for v in comps)
+    values_types = set(type(v) for v in values)


you are going to great lengths to circument the structure below. This needs to be an extension of the if/then. I don't want to see int_flg or anything like that. This is also not performant. The reason we check dtypes in this way is to avoid conversions and materialization.

Thanks - done
ASV results on algorithms after 82386c3 are similar to the most recent one 5416711

before after ratio [537b65cb] [82386c31] - 2.36±0.02ms 1.92±0.02ms 0.82 algorithms.Hashing.time_series_timedeltas SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

KalyanGokhale · 2018-07-28T17:05:03Z

Any other edits needed? Thanks

jreback · 2018-07-28T17:38:52Z

pandas/core/algorithms.py

@@ -415,33 +417,40 @@ def isin(comps, values):
    comps = com._values_from_object(comps)

    comps, dtype, _ = _ensure_data(comps)
-    values, _, _ = _ensure_data(values, dtype=dtype)
+
+    is_time_like = lambda x: (is_datetime_or_timedelta_dtype(x)


i already indicate that this is not the path forward

this is much slower than the existing

you need to keep it along the current structure

and post benchmarks

@jreback Thanks - have removed the flags int_flg / float_flg as suggested per your earlier review and posted the benchmarks after each commit.
Re-pasting the benchmarks after the last commit

ASV results on algorithms after 82386c3 are similar to the most recent one 5416711

before after ratio [537b65cb] [82386c31] - 2.36±0.02ms 1.92±0.02ms 0.82 algorithms.Hashing.time_series_timedeltas SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Also, could not get around writing a custom function is_time_like - had tried it in
9fca52c

Please feel free to close this PR - since I think I am not sure I understand what is the expectation in terms of maintaining the existing structure

there are many benchmarks

the point is that this code is highly sensistive to changes and requires a lot of benchmark running to avoid regressions

KalyanGokhale · 2018-07-28T18:30:27Z

Should I run the complete ASV suite? Earlier I only ran it for algorithms...
If the results are not promising, we can just close this PR. If they are we can re-think the approach - any suggestions are welcome as I think I am stuck in my thought pattern. Thanks

jreback · 2018-07-28T19:15:18Z

run a battery of tests anything relating to value_counts, unique, factorize

KalyanGokhale · 2018-07-30T16:17:29Z

@jreback you are correct - indeed many benchmarks have worsened (I actually didn't know where to start, so ran the full asv suite). Let me rebase and run the asv again and then will rethink the approach.

Current results below (pasting only the ones where it worsened):

       before           after         ratio
     [537b65cb]       [82386c31]
+        1.46±0ms      15.9±0.08ms    10.86  series_methods.IsIn.time_isin('int64')
+          9.20ms           78.3ms     8.51  categoricals.Isin.time_isin_categorical('int64')
+      9.56±0.4ms       79.7±0.4ms     8.33  categoricals.Isin.time_isin_categorical('object')
+     2.47±0.02ms       15.0±0.1ms     6.05  series_methods.IsIn.time_isin('object')
+     8.21±0.08ms       20.5±0.2ms     2.50  indexing.MultiIndexing.time_index_slice
+     2.89±0.01ms       4.49±0.1ms     1.56  binary_ops.Ops.time_frame_comparison(True, 1)
+        77.3±2ms          120±3ms     1.56  binary_ops.Ops.time_frame_comparison(False, 1)
+      6.74±0.4ms       9.97±0.4ms     1.48  groupby.Categories.time_groupby_nosort
+      78.4±0.7ms         115±10ms     1.47  binary_ops.Ops.time_frame_comparison(False, 'default')
+     2.93±0.02ms      4.29±0.01ms     1.46  binary_ops.Ops.time_frame_add(False, 'default')
+     2.94±0.05ms       4.06±0.2ms     1.38  binary_ops.Ops.time_frame_add(False, 1)
+     3.98±0.07ms       5.07±0.2ms     1.28  gil.ParallelRolling.time_rolling('mean')
+          86.0μs            108μs     1.25  frame_methods.XS.time_frame_xs(0)
+     2.32±0.05ms      2.87±0.01ms     1.24  rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'float', 'count')
+      69.6±0.3μs       85.7±0.2μs     1.23  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
+     4.40±0.04ms      5.32±0.08ms     1.21  rolling.Methods.time_rolling('Series', 1000, 'float', 'std')
+     2.49±0.01ms      2.98±0.04ms     1.20  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db1e0>, True)
+          12.9ms           15.1ms     1.17  categoricals.CategoricalSlicing.time_getitem_bool_array('non_monotonic')
+     1.93±0.07ms       2.27±0.1ms     1.17  frame_methods.NSort.time_nsmallest('last')
+     2.07±0.09ms         2.35±0ms     1.13  algorithms.Hashing.time_series_dates
+      7.29±0.2μs           8.26μs     1.13  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+     1.94±0.04ms       2.19±0.1ms     1.13  frame_methods.NSort.time_nlargest('all')
+      57.2±0.5ms       64.5±0.4ms     1.13  groupby.Groups.time_series_groups('object_small')
+           313ms            347ms     1.11  gil.ParallelReadCSV.time_read_csv('float')
+     35.7±0.06μs         39.5±1μs     1.11  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db2f0>, True)

jreback · 2018-07-31T13:23:42Z

thanks @KalyanGokhale the code here looks really simple, but quickly get into perf issues.

KalyanGokhale · 2018-07-31T16:05:21Z

@jreback @gfyoung

Did some further tinkering today, and with the original code for .isin the results are as follows:

>>> import pandas as pd
>>> import numpy as np
>>> import pandas.core.algorithms as algos
>>> comps=[1,0]
>>> values=[1,0.5]
>>> algos.isin(comps, values)
array([ True, False])

Above is as expected, whereas the result below is not...

>>> pd.Series(comps).isin(values)
0    True
1    True
dtype: bool

Now thinking about it, its not an isolated casting issue in .isin (which actually seems to be working fine) - but rather a fundamental issue of casting with Series in general or something else upstream?
e.g.

>>> s = pd.Series([])
>>> s[0]=100
>>> s
0    100
dtype: int64
>>> s[0]=100.589
>>> s
0    100
dtype: int64

(was also thinking of the issue #21881 and related ones) - not sure of whether my thinking is on the correct lines...

The current issue can be addressed (probably even improving the perf) - but now seems rather like a band-aid, unless we address the casting issue(s) for Series....(have not yet checked if this was already discussed and the consensus is to not do it for some valid reason...)
Thoughts?

p.s:
This means that the current test cases I had written certainly need to be updated specifically to include Series - though had tested it specifically for Series on command line with the original test case :)

jreback · 2018-08-01T22:20:02Z

@KalyanGokhale there are very thorny casting issues involved, so certainly add the other test cases

jreback · 2018-09-25T16:51:04Z

can you rebase and update to comments

KalyanGokhale · 2018-09-27T15:50:51Z

can you rebase and update to comments

@jreback yes - will do over coming few days

jreback · 2018-11-23T03:31:06Z

closing as stale. if you'd like to continue, pls ping.

KalyanGokhale added 12 commits May 17, 2018 22:55

Merge pull request #1 from pandas-dev/master

d0c7ebc

Updating to 0.23.0

Merge pull request #3 from pandas-dev/master

143566a

Update 18 May

Merge pull request #4 from pandas-dev/master

dd60b4e

22May

Revert "22May"

3209172

Merge pull request #5 from KalyanGokhale/revert-4-master

18751ba

Revert "22May"

Merge pull request #6 from pandas-dev/master

031616d

26MAY18

For Rebasing

9d92a76

Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…

d8b5242

…ndas-dev-master

Merge branch 'pandas-dev-master'

050f80a

Update 13Jul18

3aa8561

Initial commit

ee66578

Replaced list with generator

84be606

gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jul 13, 2018

gfyoung reviewed Jul 13, 2018

View reviewed changes

jreback requested changes Jul 14, 2018

View reviewed changes

KalyanGokhale added 5 commits July 15, 2018 08:40

Code restructure

f8cc271

Code restructing, included missing element

b71dad6

Cleaned removed ifs for int_flg and float_flg

5416711

Using existing type check functions

9fca52c

Revert "Using existing type check functions"

dd37f9c

This reverts commit 9fca52c.

jreback requested changes Jul 20, 2018

View reviewed changes

KalyanGokhale added 4 commits July 20, 2018 21:35

Remove flgs and retain try-except

52f8131

Remove flgs

63a5f15

Remove flg, fix whatsnew

2e0bb49

Whatsnew updated

82386c3

jreback requested changes Jul 28, 2018

View reviewed changes

jreback closed this Nov 23, 2018

KalyanGokhale deleted the ISINv3 branch November 25, 2018 05:50

avinashpancham mentioned this pull request Nov 15, 2020

BUG: Prevent Series.isin from unwantedly casting isin values from float to integer (GH21804) #37861

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fixes unwanted casting in .isin (GH21804) #21893

BUG: Fixes unwanted casting in .isin (GH21804) #21893

KalyanGokhale commented Jul 13, 2018

codecov bot commented Jul 13, 2018 •

edited

Loading

gfyoung Jul 13, 2018

KalyanGokhale Jul 13, 2018

gfyoung Jul 13, 2018

KalyanGokhale Jul 13, 2018 •

edited

Loading

KalyanGokhale commented Jul 13, 2018

jreback Jul 14, 2018

KalyanGokhale Jul 15, 2018 •

edited

Loading

KalyanGokhale Jul 15, 2018

KalyanGokhale Jul 18, 2018

jreback Jul 20, 2018

jreback Jul 20, 2018

KalyanGokhale Jul 20, 2018

KalyanGokhale commented Jul 28, 2018

jreback Jul 28, 2018

KalyanGokhale Jul 28, 2018

jreback Jul 28, 2018

KalyanGokhale commented Jul 28, 2018

jreback commented Jul 28, 2018

KalyanGokhale commented Jul 30, 2018

jreback commented Jul 31, 2018

KalyanGokhale commented Jul 31, 2018

jreback commented Aug 1, 2018

jreback commented Sep 25, 2018

KalyanGokhale commented Sep 27, 2018

jreback commented Nov 23, 2018

BUG: Fixes unwanted casting in .isin (GH21804) #21893

BUG: Fixes unwanted casting in .isin (GH21804) #21893

Conversation

KalyanGokhale commented Jul 13, 2018

codecov bot commented Jul 13, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KalyanGokhale Jul 13, 2018 • edited Loading

Choose a reason for hiding this comment

KalyanGokhale commented Jul 13, 2018

Choose a reason for hiding this comment

KalyanGokhale Jul 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KalyanGokhale commented Jul 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KalyanGokhale commented Jul 28, 2018

jreback commented Jul 28, 2018

KalyanGokhale commented Jul 30, 2018

jreback commented Jul 31, 2018

KalyanGokhale commented Jul 31, 2018

jreback commented Aug 1, 2018

jreback commented Sep 25, 2018

KalyanGokhale commented Sep 27, 2018

jreback commented Nov 23, 2018

codecov bot commented Jul 13, 2018 •

edited

Loading

KalyanGokhale Jul 13, 2018 •

edited

Loading

KalyanGokhale Jul 15, 2018 •

edited

Loading