Skip to content

BUG: Fixes unwanted casting in .isin (GH21804) #21893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

BUG: Fixes unwanted casting in .isin (GH21804) #21893

wants to merge 21 commits into from

Conversation

KalyanGokhale
Copy link
Contributor

Results from running asvs on algorithms are:

       before           after         ratio
     [365eac4d]       [ee66578f]
+        1.91±0ms      2.28±0.02ms     1.20  algorithms.Hashing.time_series_timedeltas
-     2.34±0.02ms         1.88±0ms     0.80  algorithms.Hashing.time_series_int

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@codecov
Copy link

codecov bot commented Jul 13, 2018

Codecov Report

Merging #21893 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21893      +/-   ##
==========================================
- Coverage   91.96%   91.96%   -0.01%     
==========================================
  Files         166      166              
  Lines       50334    50337       +3     
==========================================
+ Hits        46292    46293       +1     
- Misses       4042     4044       +2
Flag Coverage Δ
#multiple 90.36% <100%> (-0.01%) ⬇️
#single 42.23% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.4% <100%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a0b2b9...82386c3. Read the comment docs.

@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jul 13, 2018
@pytest.mark.parametrize("comps,values,expected", [
([1, 2], [1], [True, False]),
([1, 0], [1, 0.5], [True, False]),
([1.0, 0], [1, 0.5], [True, False]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers: Just for reference, here are the new tests.

@KalyanGokhale : Nice refactoring!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gfyoung

values = values.astype('float64', copy=False)
comps = comps.astype('float64', copy=False)
checknull = isna(values).any()
f = lambda x, y: htable.ismember_float64(x, y, checknull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you were able to remove the try-except blocks from before?

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the earlier code only dtype of comps was being checked for being either an int or float, and then the values were force casted, which would have necessitated a try-except
Now, both comps and values are being explicitly checked to be either int or float before their conversion to int64 or float64

@KalyanGokhale
Copy link
Contributor Author

Revised benchmarks after 84be606

       before           after         ratio
     [365eac4d]       [84be606b]
+        1.82±0ms         2.23±0ms     1.23  algorithms.Hashing.time_series_int
-        2.24±0ms         1.83±0ms     0.82  algorithms.Hashing.time_series_float

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

# faster for larger cases to use np.in1d
f = lambda x, y: htable.ismember_object(x, values)
is_int = lambda x: ((x == np.int64) or (x == int))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, what are you doing? this is way less understandable that before. too many more if/thens here. pls fit this into the existing structure.

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have edited to retain most of the existing structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised asv benchmarks on algorithms after 5416711

       before           after         ratio
     [365eac4d]       [54167114]
-     2.25±0.01ms      1.87±0.02ms     0.83  algorithms.Hashing.time_series_timedeltas

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for the review - any other edits needed?
Have tried to retain the existing structure and reduced if-blocks, though have removed the redundant try-except blocks. Have also tried to cluster the blocks logically.

@@ -31,8 +31,7 @@ Bug Fixes

**Conversion**

-
-
- Unwanted casting of float to int in :func:`isin` (:issue:`21804`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific here, as a user I have no idea what this means.

f = lambda x, y: htable.ismember_object(x.astype(object), y.astype(object))

comps_types = set(type(v) for v in comps)
values_types = set(type(v) for v in values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are going to great lengths to circument the structure below. This needs to be an extension of the if/then. I don't want to see int_flg or anything like that. This is also not performant. The reason we check dtypes in this way is to avoid conversions and materialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done
ASV results on algorithms after 82386c3 are similar to the most recent one 5416711

       before           after         ratio
     [537b65cb]       [82386c31]
-     2.36±0.02ms      1.92±0.02ms     0.82  algorithms.Hashing.time_series_timedeltas

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@KalyanGokhale
Copy link
Contributor Author

Any other edits needed? Thanks

@@ -415,33 +417,40 @@ def isin(comps, values):
comps = com._values_from_object(comps)

comps, dtype, _ = _ensure_data(comps)
values, _, _ = _ensure_data(values, dtype=dtype)

is_time_like = lambda x: (is_datetime_or_timedelta_dtype(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i already indicate that this is not the path forward

this is much slower than the existing

you need to keep it along the current structure

and post benchmarks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Thanks - have removed the flags int_flg / float_flg as suggested per your earlier review and posted the benchmarks after each commit.
Re-pasting the benchmarks after the last commit

ASV results on algorithms after 82386c3 are similar to the most recent one 5416711

       before           after         ratio
     [537b65cb]       [82386c31]
-     2.36±0.02ms      1.92±0.02ms     0.82  algorithms.Hashing.time_series_timedeltas

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Also, could not get around writing a custom function is_time_like - had tried it in
9fca52c

Please feel free to close this PR - since I think I am not sure I understand what is the expectation in terms of maintaining the existing structure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many benchmarks

the point is that this code is highly sensistive to changes and requires a lot of benchmark running to avoid regressions

@KalyanGokhale
Copy link
Contributor Author

Should I run the complete ASV suite? Earlier I only ran it for algorithms...
If the results are not promising, we can just close this PR. If they are we can re-think the approach - any suggestions are welcome as I think I am stuck in my thought pattern. Thanks

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

run a battery of tests anything relating to value_counts, unique, factorize

@KalyanGokhale
Copy link
Contributor Author

@jreback you are correct - indeed many benchmarks have worsened (I actually didn't know where to start, so ran the full asv suite). Let me rebase and run the asv again and then will rethink the approach.

Current results below (pasting only the ones where it worsened):

       before           after         ratio
     [537b65cb]       [82386c31]
+        1.46±0ms      15.9±0.08ms    10.86  series_methods.IsIn.time_isin('int64')
+          9.20ms           78.3ms     8.51  categoricals.Isin.time_isin_categorical('int64')
+      9.56±0.4ms       79.7±0.4ms     8.33  categoricals.Isin.time_isin_categorical('object')
+     2.47±0.02ms       15.0±0.1ms     6.05  series_methods.IsIn.time_isin('object')
+     8.21±0.08ms       20.5±0.2ms     2.50  indexing.MultiIndexing.time_index_slice
+     2.89±0.01ms       4.49±0.1ms     1.56  binary_ops.Ops.time_frame_comparison(True, 1)
+        77.3±2ms          120±3ms     1.56  binary_ops.Ops.time_frame_comparison(False, 1)
+      6.74±0.4ms       9.97±0.4ms     1.48  groupby.Categories.time_groupby_nosort
+      78.4±0.7ms         115±10ms     1.47  binary_ops.Ops.time_frame_comparison(False, 'default')
+     2.93±0.02ms      4.29±0.01ms     1.46  binary_ops.Ops.time_frame_add(False, 'default')
+     2.94±0.05ms       4.06±0.2ms     1.38  binary_ops.Ops.time_frame_add(False, 1)
+     3.98±0.07ms       5.07±0.2ms     1.28  gil.ParallelRolling.time_rolling('mean')
+          86.0μs            108μs     1.25  frame_methods.XS.time_frame_xs(0)
+     2.32±0.05ms      2.87±0.01ms     1.24  rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'float', 'count')
+      69.6±0.3μs       85.7±0.2μs     1.23  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
+     4.40±0.04ms      5.32±0.08ms     1.21  rolling.Methods.time_rolling('Series', 1000, 'float', 'std')
+     2.49±0.01ms      2.98±0.04ms     1.20  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db1e0>, True)
+          12.9ms           15.1ms     1.17  categoricals.CategoricalSlicing.time_getitem_bool_array('non_monotonic')
+     1.93±0.07ms       2.27±0.1ms     1.17  frame_methods.NSort.time_nsmallest('last')
+     2.07±0.09ms         2.35±0ms     1.13  algorithms.Hashing.time_series_dates
+      7.29±0.2μs           8.26μs     1.13  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+     1.94±0.04ms       2.19±0.1ms     1.13  frame_methods.NSort.time_nlargest('all')
+      57.2±0.5ms       64.5±0.4ms     1.13  groupby.Groups.time_series_groups('object_small')
+           313ms            347ms     1.11  gil.ParallelReadCSV.time_read_csv('float')
+     35.7±0.06μs         39.5±1μs     1.11  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db2f0>, True)

@jreback
Copy link
Contributor

jreback commented Jul 31, 2018

thanks @KalyanGokhale the code here looks really simple, but quickly get into perf issues.

@KalyanGokhale
Copy link
Contributor Author

@jreback @gfyoung

Did some further tinkering today, and with the original code for .isin the results are as follows:

>>> import pandas as pd
>>> import numpy as np
>>> import pandas.core.algorithms as algos
>>> comps=[1,0]
>>> values=[1,0.5]
>>> algos.isin(comps, values)
array([ True, False])

Above is as expected, whereas the result below is not...

>>> pd.Series(comps).isin(values)
0    True
1    True
dtype: bool

Now thinking about it, its not an isolated casting issue in .isin (which actually seems to be working fine) - but rather a fundamental issue of casting with Series in general or something else upstream?
e.g.

>>> s = pd.Series([])
>>> s[0]=100
>>> s
0    100
dtype: int64
>>> s[0]=100.589
>>> s
0    100
dtype: int64

(was also thinking of the issue #21881 and related ones) - not sure of whether my thinking is on the correct lines...

The current issue can be addressed (probably even improving the perf) - but now seems rather like a band-aid, unless we address the casting issue(s) for Series....(have not yet checked if this was already discussed and the consensus is to not do it for some valid reason...)
Thoughts?

p.s:
This means that the current test cases I had written certainly need to be updated specifically to include Series - though had tested it specifically for Series on command line with the original test case :)

@jreback
Copy link
Contributor

jreback commented Aug 1, 2018

@KalyanGokhale there are very thorny casting issues involved, so certainly add the other test cases

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

can you rebase and update to comments

@KalyanGokhale
Copy link
Contributor Author

can you rebase and update to comments

@jreback yes - will do over coming few days

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

closing as stale. if you'd like to continue, pls ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: unwanted casting in .isin
3 participants