-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix numpy boolean subtraction error in Series.diff #28251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 48 commits
8be97f8
7c982d2
45f9c60
83907a2
6c3875e
ecf1c04
dd8ff98
e123486
269d148
3993206
b610dd9
92075fb
1b90657
182cbc2
3ba0958
47feb30
d425cdf
050b8cb
e028aef
9d947db
b115c9e
865e913
6f8986b
081c2c8
96bb1b4
91ebae5
74a474f
40df511
18a2e35
716a70e
e879dd7
850e315
8d2547a
e6f1890
3231d8f
3e0dbc7
5f101fe
f98cbf2
2127651
a174239
b83135d
b14f7c6
d92d6c7
b96bde8
6a72f2c
030d305
649042f
d1c1755
1e4a7ae
951556e
d1c6b65
8323e71
bbd2550
21defe6
6227564
88cea85
acc8f87
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1915,6 +1915,8 @@ def diff(arr, n, axis=0): | |
dtype = arr.dtype | ||
|
||
is_timedelta = False | ||
is_bool = False | ||
is_obj_bool = False | ||
if needs_i8_conversion(arr): | ||
dtype = np.float64 | ||
arr = arr.view("i8") | ||
|
@@ -1923,10 +1925,15 @@ def diff(arr, n, axis=0): | |
|
||
elif is_bool_dtype(dtype): | ||
dtype = np.object_ | ||
is_bool = True | ||
|
||
elif is_integer_dtype(dtype): | ||
dtype = np.float64 | ||
|
||
elif is_object_dtype(dtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not a fan of doing anything anything with object arrays here at all. |
||
if np.all(np.isin(arr, (True, False), False) | isna(arr)): | ||
Unprocessable marked this conversation as resolved.
Show resolved
Hide resolved
|
||
is_obj_bool = True | ||
|
||
dtype = np.dtype(dtype) | ||
out_arr = np.empty(arr.shape, dtype=dtype) | ||
|
||
|
@@ -1962,9 +1969,18 @@ def diff(arr, n, axis=0): | |
result = res - lag | ||
result[mask] = na | ||
out_arr[res_indexer] = result | ||
elif is_bool: | ||
out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer] | ||
else: | ||
out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer] | ||
|
||
if is_obj_bool: | ||
# converting numbers to bool | ||
na_index = isna(out_arr) | ||
out_arr = out_arr.astype(bool).astype(object) | ||
# resetting nan previously converted to True | ||
out_arr[na_index] = np.nan | ||
|
||
if is_timedelta: | ||
out_arr = out_arr.astype("int64").view("timedelta64[ns]") | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,7 @@ | |
from pandas.api.types import is_scalar | ||
from pandas.core.index import MultiIndex | ||
from pandas.core.indexes.datetimes import Timestamp | ||
from pandas.core.indexes.timedeltas import TimedeltaIndex | ||
import pandas.util.testing as tm | ||
from pandas.util.testing import ( | ||
assert_almost_equal, | ||
|
@@ -237,6 +238,60 @@ def test_npdiff(self): | |
r = np.diff(s) | ||
assert_series_equal(Series([nan, 0, 0, 0, nan]), r) | ||
|
||
def test_diff(self): | ||
# Combined datetime diff, normal diff and boolean diff test | ||
ts = tm.makeTimeSeries(name="ts") | ||
ts.diff() | ||
|
||
# int dtype | ||
a = 10000000000000000 | ||
b = a + 1 | ||
s = Series([a, b]) | ||
|
||
result = s.diff() | ||
assert result[1] == 1 | ||
|
||
# neg n | ||
result = ts.diff(-1) | ||
expected = ts - ts.shift(-1) | ||
assert_series_equal(result, expected) | ||
|
||
# 0 | ||
result = ts.diff(0) | ||
expected = ts - ts | ||
assert_series_equal(result, expected) | ||
|
||
# datetime diff (GH3100) | ||
s = Series(date_range("20130102", periods=5)) | ||
result = s - s.shift(1) | ||
expected = s.diff() | ||
assert_series_equal(result, expected) | ||
|
||
# timedelta diff | ||
result = result - result.shift(1) # previous result | ||
expected = expected.diff() # previously expected | ||
assert_series_equal(result, expected) | ||
|
||
# with tz | ||
s = Series( | ||
date_range("2000-01-01 09:00:00", periods=5, tz="US/Eastern"), name="foo" | ||
) | ||
result = s.diff() | ||
expected = Series(TimedeltaIndex(["NaT"] + ["1 days"] * 4), name="foo") | ||
assert_series_equal(result, expected) | ||
|
||
# boolean series (test for fixing #17294) | ||
s = Series([False, True, True, False, False]) | ||
result = s.diff() | ||
expected = Series([nan, True, False, True, False]) | ||
assert_series_equal(result, expected) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is x - x.shift() also true here (as it is for obj_diff)? can you add that assert as well (do another result= and expected=), otherwise this PR lgtm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it is not and will not with the current implementation. We all agreed on this, please re-read the thread if you have the time. @TomAugspurger specifically wanted a boolean Series instead of a list of integers, as shown in the quote below:
Edit: This is also the reason the diff on a boolean object series is different from a diff on a series with only booleans. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok fair enough, do we have a followup issue on this? @TomAugspurger ? (which likely will block on a real BoolType EA) |
||
|
||
# boolean nan series | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't agree that this should differ from the previous case. IMO, both of these are boolean dtype. Pandas just doesn't have a nullable boolean dtype (yet) but we generally try to treat a mix of bool and NA the same as bool. So I think that this should also be object dtype, but the values should be booleans instead of integers. Is that difficult to support? I suppose it'll require a scan of the values for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, it should not differ. However, when I previously tested my code I did notice a difference, hence why I added the tests as such. If you are sure that pandas does not treat it any different and testing that assumption is not required now and in the future, I will gladly remove the first test. In regards to your second point, I could definitely change the code to create only a boolean series instead, but why? The same could be done with logic instead of diff. I think the point of diff is to see the difference between the current value and the previous one (e.g. True - False = 1 and False - True = -1). This is also in line with the subtraction done by diff on other non-boolean series and what happens if you subtract booleans in Python. I found this bug looking for ways to detect these changes. Do you not agree or did you mean something different? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having a hard time following, but IMO these two should have the same output In [3]: pd.Series([True, False, True]).diff()
Out[3]:
0 NaN
1 True
2 True
dtype: object
In [4]: pd.Series([True, False, True]).astype(object).diff()
Out[4]:
0 NaN
1 -1
2 1
dtype: object I think both of those should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I seem to be missing something, when I run np.diff on the Series it results in this:
Which is pretty close to the current pandas implementation. Do we want to remove the first item in the Series (always np.nan) to make it fit better with np.diff? That would mean changing the other code in the diff function as well. Ignoring that, the results are the same as my code. As the dtypes are the problem, we could either convert the Series to always be of the object dtype or do the same numpy does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to make sure, I also tested np.diff on a numpy array with the same results:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we want the same behavior as Because of the missing values, we'll always return an object-dtype Series where the values are either NaN, True, or False. cc @jbrockmendel if you have thoughts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like this would be a lot easier if we had a BoolNA in place. Is that accurate? Since bool-dtype is a thin wrapper over uint8, I wonder how hard it would be to adapt IntegerNA to this purpose (@WillAyd you had a PR about this a while back, right?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this would be easier with BoolNA. I think we want to achieve the BoolNA behavior, but with object dtype today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. am -1 on doing anything with object dtype as it makes the code much too complex. |
||
s = Series([False, True, nan, False, False]) | ||
result = s.diff() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this routine is way too long can you parameterize There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback Shall I rewrite all other tests in pandas as well? I did not write the tests the way they are, I only added a few lines... I would rather do this in a refactor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@ WillAyd Done. |
||
expected = Series([nan, 1, nan, nan, 0], dtype="object") | ||
assert_series_equal(result, expected) | ||
|
||
def _check_accum_op(self, name, datetime_series_, check_dtype=True): | ||
func = getattr(np, name) | ||
tm.assert_numpy_array_equal( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in :meth:
Series.diff
where a boolean series would incorrectly raiseTypeError
(:issue:....)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: if people start looking for that bug ('the - operator is deprecated'), how would they know to update pandas if I remove the second part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I removed the second part