-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fixes unwanted casting in .isin (GH21804) #21893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updating to 0.23.0
Update 18 May
Revert "22May"
Codecov Report
@@ Coverage Diff @@
## master #21893 +/- ##
==========================================
- Coverage 91.96% 91.96% -0.01%
==========================================
Files 166 166
Lines 50334 50337 +3
==========================================
+ Hits 46292 46293 +1
- Misses 4042 4044 +2
Continue to review full report at Codecov.
|
@pytest.mark.parametrize("comps,values,expected", [ | ||
([1, 2], [1], [True, False]), | ||
([1, 0], [1, 0.5], [True, False]), | ||
([1.0, 0], [1, 0.5], [True, False]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reviewers: Just for reference, here are the new tests.
@KalyanGokhale : Nice refactoring!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gfyoung
pandas/core/algorithms.py
Outdated
values = values.astype('float64', copy=False) | ||
comps = comps.astype('float64', copy=False) | ||
checknull = isna(values).any() | ||
f = lambda x, y: htable.ismember_float64(x, y, checknull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you were able to remove the try-except
blocks from before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the earlier code only dtype of comps
was being checked for being either an int or float, and then the values
were force casted, which would have necessitated a try-except
Now, both comps
and values
are being explicitly checked to be either int or float before their conversion to int64
or float64
Revised benchmarks after 84be606
|
pandas/core/algorithms.py
Outdated
# faster for larger cases to use np.in1d | ||
f = lambda x, y: htable.ismember_object(x, values) | ||
is_int = lambda x: ((x == np.int64) or (x == int)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, what are you doing? this is way less understandable that before. too many more if/thens here. pls fit this into the existing structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have edited to retain most of the existing structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised asv benchmarks on algorithms after 5416711
before after ratio
[365eac4d] [54167114]
- 2.25±0.01ms 1.87±0.02ms 0.83 algorithms.Hashing.time_series_timedeltas
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for the review - any other edits needed?
Have tried to retain the existing structure and reduced if-blocks, though have removed the redundant try-except blocks. Have also tried to cluster the blocks logically.
doc/source/whatsnew/v0.23.4.txt
Outdated
@@ -31,8 +31,7 @@ Bug Fixes | |||
|
|||
**Conversion** | |||
|
|||
- | |||
- | |||
- Unwanted casting of float to int in :func:`isin` (:issue:`21804`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be more specific here, as a user I have no idea what this means.
pandas/core/algorithms.py
Outdated
f = lambda x, y: htable.ismember_object(x.astype(object), y.astype(object)) | ||
|
||
comps_types = set(type(v) for v in comps) | ||
values_types = set(type(v) for v in values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are going to great lengths to circument the structure below. This needs to be an extension of the if/then. I don't want to see int_flg
or anything like that. This is also not performant. The reason we check dtypes in this way is to avoid conversions and materialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done
ASV results on algorithms after 82386c3 are similar to the most recent one 5416711
before after ratio
[537b65cb] [82386c31]
- 2.36±0.02ms 1.92±0.02ms 0.82 algorithms.Hashing.time_series_timedeltas
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
Any other edits needed? Thanks |
@@ -415,33 +417,40 @@ def isin(comps, values): | |||
comps = com._values_from_object(comps) | |||
|
|||
comps, dtype, _ = _ensure_data(comps) | |||
values, _, _ = _ensure_data(values, dtype=dtype) | |||
|
|||
is_time_like = lambda x: (is_datetime_or_timedelta_dtype(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i already indicate that this is not the path forward
this is much slower than the existing
you need to keep it along the current structure
and post benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Thanks - have removed the flags int_flg
/ float_flg
as suggested per your earlier review and posted the benchmarks after each commit.
Re-pasting the benchmarks after the last commit
ASV results on algorithms after 82386c3 are similar to the most recent one 5416711
before after ratio
[537b65cb] [82386c31]
- 2.36±0.02ms 1.92±0.02ms 0.82 algorithms.Hashing.time_series_timedeltas
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
Also, could not get around writing a custom function is_time_like
- had tried it in
9fca52c
Please feel free to close this PR - since I think I am not sure I understand what is the expectation in terms of maintaining the existing structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are many benchmarks
the point is that this code is highly sensistive to changes and requires a lot of benchmark running to avoid regressions
Should I run the complete ASV suite? Earlier I only ran it for algorithms... |
run a battery of tests anything relating to value_counts, unique, factorize |
@jreback you are correct - indeed many benchmarks have worsened (I actually didn't know where to start, so ran the full asv suite). Let me rebase and run the asv again and then will rethink the approach. Current results below (pasting only the ones where it worsened): before after ratio
[537b65cb] [82386c31]
+ 1.46±0ms 15.9±0.08ms 10.86 series_methods.IsIn.time_isin('int64')
+ 9.20ms 78.3ms 8.51 categoricals.Isin.time_isin_categorical('int64')
+ 9.56±0.4ms 79.7±0.4ms 8.33 categoricals.Isin.time_isin_categorical('object')
+ 2.47±0.02ms 15.0±0.1ms 6.05 series_methods.IsIn.time_isin('object')
+ 8.21±0.08ms 20.5±0.2ms 2.50 indexing.MultiIndexing.time_index_slice
+ 2.89±0.01ms 4.49±0.1ms 1.56 binary_ops.Ops.time_frame_comparison(True, 1)
+ 77.3±2ms 120±3ms 1.56 binary_ops.Ops.time_frame_comparison(False, 1)
+ 6.74±0.4ms 9.97±0.4ms 1.48 groupby.Categories.time_groupby_nosort
+ 78.4±0.7ms 115±10ms 1.47 binary_ops.Ops.time_frame_comparison(False, 'default')
+ 2.93±0.02ms 4.29±0.01ms 1.46 binary_ops.Ops.time_frame_add(False, 'default')
+ 2.94±0.05ms 4.06±0.2ms 1.38 binary_ops.Ops.time_frame_add(False, 1)
+ 3.98±0.07ms 5.07±0.2ms 1.28 gil.ParallelRolling.time_rolling('mean')
+ 86.0μs 108μs 1.25 frame_methods.XS.time_frame_xs(0)
+ 2.32±0.05ms 2.87±0.01ms 1.24 rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'float', 'count')
+ 69.6±0.3μs 85.7±0.2μs 1.23 groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
+ 4.40±0.04ms 5.32±0.08ms 1.21 rolling.Methods.time_rolling('Series', 1000, 'float', 'std')
+ 2.49±0.01ms 2.98±0.04ms 1.20 ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db1e0>, True)
+ 12.9ms 15.1ms 1.17 categoricals.CategoricalSlicing.time_getitem_bool_array('non_monotonic')
+ 1.93±0.07ms 2.27±0.1ms 1.17 frame_methods.NSort.time_nsmallest('last')
+ 2.07±0.09ms 2.35±0ms 1.13 algorithms.Hashing.time_series_dates
+ 7.29±0.2μs 8.26μs 1.13 categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+ 1.94±0.04ms 2.19±0.1ms 1.13 frame_methods.NSort.time_nlargest('all')
+ 57.2±0.5ms 64.5±0.4ms 1.13 groupby.Groups.time_series_groups('object_small')
+ 313ms 347ms 1.11 gil.ParallelReadCSV.time_read_csv('float')
+ 35.7±0.06μs 39.5±1μs 1.11 ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x7f06c40db2f0>, True) |
thanks @KalyanGokhale the code here looks really simple, but quickly get into perf issues. |
Did some further tinkering today, and with the original code for >>> import pandas as pd
>>> import numpy as np
>>> import pandas.core.algorithms as algos
>>> comps=[1,0]
>>> values=[1,0.5]
>>> algos.isin(comps, values)
array([ True, False]) Above is as expected, whereas the result below is not... >>> pd.Series(comps).isin(values)
0 True
1 True
dtype: bool Now thinking about it, its not an isolated casting issue in >>> s = pd.Series([])
>>> s[0]=100
>>> s
0 100
dtype: int64
>>> s[0]=100.589
>>> s
0 100
dtype: int64 (was also thinking of the issue #21881 and related ones) - not sure of whether my thinking is on the correct lines... The current issue can be addressed (probably even improving the perf) - but now seems rather like a band-aid, unless we address the casting issue(s) for Series....(have not yet checked if this was already discussed and the consensus is to not do it for some valid reason...) p.s: |
@KalyanGokhale there are very thorny casting issues involved, so certainly add the other test cases |
can you rebase and update to comments |
@jreback yes - will do over coming few days |
closing as stale. if you'd like to continue, pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Results from running asvs on algorithms are: