-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix numpy boolean subtraction error in Series.diff #28251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Numpy no longer allows subtraction of boolean values. Not sure if Numpy version checking should be done before this code... Note: I haven't actually run this code, feel free to check it, but should be correct. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-46-3da3b949c6bd> in <module> 1 data = pd.Series([0,-1,-2,-3,-4,-3,-2,-1,0,-1,-1,0,-1,-2,-3,-2,0]) 2 filtered = data.between(-2,0, inclusive = True) ----> 3 filtered.diff() 4 print(filtered) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in diff(self, periods) 2191 dtype: float64 2192 """ -> 2193 result = algorithms.diff(com.values_from_object(self), periods) 2194 return self._constructor(result, index=self.index).__finalize__(self) 2195 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis) 1817 out_arr[res_indexer] = result 1818 else: -> 1819 out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer] 1820 1821 if is_timedelta: TypeError: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
I copied the data over to the other file, diff tests should be in one file and the boolean test cannot be added to this one as it is not a time series.
As requested...
As far as I know, this should work. It is changed back to the same code it was several tens of commits ago. If it does not work, I will fix it later. If everything checks out, feel free to merge with master. |
|
||
# boolean nan series | ||
s = Series([False, True, nan, False, False]) | ||
result = s.diff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I did create a new test previously. There are several reasons it is like this now:
- because @jbrockmendel asked me to
- because it does not make sense to have several tests in different files all testing diff separately
- because the name implies a general diff test, not a time series specific one
- because the code that is tested is all contained in the diff function, so it would be better for it be tested in a contained test, it makes debugging a failed test easier
- the general test for diff is also included in those tests (a normal integer test), so it looks like it was never a time series specific test in the first place
I understand the need for different tests as well, so if you get a consensus of three more people agreeing it should be in different tests, I will gladly change it.
@TomAugspurger @jreback |
Agreed with your fix in the method, and @WillAyd's comment about the test. I'd prefer a focused, new test like def test_diff_bool(self):
s = Series([False, True, True, False, False])
result = s.diff()
expected = Series([nan, True, False, True, False])
assert_series_equal(result, expected) And leaving out the object-dtype Series.diff test. |
I just tested a diff with a combination of datetime values and floats, this will crash as well. As long as there is a combination of integers, floats (nan included) or booleans, the object diff will work. While testing that specific case might not be a good idea, we do not want to introduce a patch that will make booleans from a float/int object series. For that reason I think it is better to include a test for that. @TomAugspurger How would you test that a diff on a combined boolean and float object series results in floats instead of booleans? I am pretty certain that someone doing a diff on [0, 1.0, True] expects [nan, -1.0, 0] instead of [nan, True, False]. The test reflects that. |
I'd like this diff to be as small as possible. I'd only like to fix the original issue: calling Let's leave the rest for another issue. |
Looks like I will be parametrizing and splitting the function into several functions. If there are no bugs, this should do.
I have been fixing the tests as well as there was a need for that from you (among others). I would rather not (re)write any tests, but I get the need for them. The issue is now fixed and the boolean diff test is both parametrized and split from the rest.
I agree. If anyone does not agree with how it was written, it will be possible to change it later. The code works, does what it should do and should probably be merged before 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requests should be implemented.
|
||
# boolean nan series | ||
s = Series([False, True, nan, False, False]) | ||
result = s.diff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is
@ WillAyd Done.
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -211,6 +211,7 @@ Other | |||
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`) | |||
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`) | |||
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`) | |||
- Bug in :meth:`Series.diff` where a boolean series would cause a TypeError (the - operator is deprecated) when using NumPy >= 1.13.0 (:issue:`17294`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I removed the second part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, small doc comment, CI is currently failing however, so wait a little for that to be updated
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -240,6 +240,7 @@ Other | |||
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`) | |||
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`) | |||
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`) | |||
- Bug in :meth:`Series.diff` where a boolean series would incorrectly raise a TypeError (the - operator is deprecated) (:issue:`17294`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks around TypeError
, I would remove the (the - operator is deprecated)
, not obvious what this is in reference to a casual reader
Changing the description again. The numpy version and 'the - operator is deprecated' are now both removed.
s = Series(input) | ||
result = s.diff() | ||
expected = Series(output) | ||
assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is x - x.shift() also true here (as it is for obj_diff)? can you add that assert as well (do another result= and expected=), otherwise this PR lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not and will not with the current implementation. We all agreed on this, please re-read the thread if you have the time. @TomAugspurger specifically wanted a boolean Series instead of a list of integers, as shown in the quote below:
I'm having a hard time following, but IMO these two should have the same output
In [3]: pd.Series([True, False, True]).diff() Out[3]: 0 NaN 1 True 2 True dtype: object In [4]: pd.Series([True, False, True]).astype(object).diff() Out[4]: 0 NaN 1 -1 2 1 dtype: objectI think both of those should be
Out[3]
, the same result asnp.diff
with the missing values handled appropriately.
Edit: This is also the reason the diff on a boolean object series is different from a diff on a series with only booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair enough, do we have a followup issue on this? @TomAugspurger ? (which likely will block on a real BoolType EA)
thanks @Unprocessable sorry this took a while. sometimes PRs trigger a lot of discussion about the original issue (which is what happened here). |
No worries, it is good to see a simple question/PR causing a review of the core code. At least it means it was due for some commenting. This is also my first contribution to an open source project, so there is that. Thanks to all for keeping pandas updated and working! |
A new version of the pull request (#27755), the other one was a bit behind. Original text below.
This fixes #17294, for more than three years now NumPy has not allowed the subtraction of boolean series.
TypeError Traceback (most recent call last)
in
1 data = pd.Series([0,-1,-2,-3,-4,-3,-2,-1,0,-1,-1,0,-1,-2,-3,-2,0])
2 filtered = data.between(-2,0, inclusive = True)
----> 3 filtered.diff()
4 print(filtered)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in diff(self, periods)
2191 dtype: float64
2192 """
-> 2193 result = algorithms.diff(com.values_from_object(self), periods)
2194 return self._constructor(result, index=self.index).finalize(self)
2195
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis)
1817 out_arr[res_indexer] = result
1818 else:
-> 1819 out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
1820
1821 if is_timedelta:
TypeError: numpy boolean subtract, the
-
operator, is deprecated, use the bitwise_xor, the^
operator, or the logical_xor function instead.black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff