Skip to content

BUG: Fix numpy boolean subtraction error in Series.diff #28251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Oct 1, 2019

Conversation

Unprocessable
Copy link
Contributor

@Unprocessable Unprocessable commented Sep 2, 2019

A new version of the pull request (#27755), the other one was a bit behind. Original text below.

This fixes #17294, for more than three years now NumPy has not allowed the subtraction of boolean series.

TypeError Traceback (most recent call last)
in
1 data = pd.Series([0,-1,-2,-3,-4,-3,-2,-1,0,-1,-1,0,-1,-2,-3,-2,0])
2 filtered = data.between(-2,0, inclusive = True)
----> 3 filtered.diff()
4 print(filtered)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in diff(self, periods)
2191 dtype: float64
2192 """
-> 2193 result = algorithms.diff(com.values_from_object(self), periods)
2194 return self._constructor(result, index=self.index).finalize(self)
2195

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis)
1817 out_arr[res_indexer] = result
1818 else:
-> 1819 out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
1820
1821 if is_timedelta:

TypeError: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.

Numpy no longer allows subtraction of boolean values. Not sure if Numpy version checking should be done before this code...
Note: I haven't actually run this code, feel free to check it, but should be correct.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-3da3b949c6bd> in <module>
      1 data = pd.Series([0,-1,-2,-3,-4,-3,-2,-1,0,-1,-1,0,-1,-2,-3,-2,0])
      2 filtered = data.between(-2,0, inclusive = True)
----> 3 filtered.diff()
      4 print(filtered)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in diff(self, periods)
   2191         dtype: float64
   2192         """
-> 2193         result = algorithms.diff(com.values_from_object(self), periods)
   2194         return self._constructor(result, index=self.index).__finalize__(self)
   2195 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis)
   1817             out_arr[res_indexer] = result
   1818         else:
-> 1819             out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
   1820 
   1821     if is_timedelta:

TypeError: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
I copied the data over to the other file, diff tests should be in one file and the boolean test cannot be added to this one as it is not a time series.
@Unprocessable
Copy link
Contributor Author

As far as I know, this should work. It is changed back to the same code it was several tens of commits ago. If it does not work, I will fix it later. If everything checks out, feel free to merge with master.


# boolean nan series
s = Series([False, True, nan, False, False])
result = s.diff()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

Copy link
Contributor Author

@Unprocessable Unprocessable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I did create a new test previously. There are several reasons it is like this now:

  1. because @jbrockmendel asked me to
  2. because it does not make sense to have several tests in different files all testing diff separately
  3. because the name implies a general diff test, not a time series specific one
  4. because the code that is tested is all contained in the diff function, so it would be better for it be tested in a contained test, it makes debugging a failed test easier
  5. the general test for diff is also included in those tests (a normal integer test), so it looks like it was never a time series specific test in the first place

I understand the need for different tests as well, so if you get a consensus of three more people agreeing it should be in different tests, I will gladly change it.

@Unprocessable
Copy link
Contributor Author

@TomAugspurger @jreback
Do you currently agree that we ignore the object nan Series for now and only focus on the boolean-only series? The current commit reflects that change at the moment.

@TomAugspurger
Copy link
Contributor

Agreed with your fix in the method, and @WillAyd's comment about the test. I'd prefer a focused, new test like

def test_diff_bool(self):
        s = Series([False, True, True, False, False])
        result = s.diff()
        expected = Series([nan, True, False, True, False])
        assert_series_equal(result, expected)

And leaving out the object-dtype Series.diff test.

@Unprocessable
Copy link
Contributor Author

Unprocessable commented Sep 17, 2019

And leaving out the object-dtype Series.diff test.

I just tested a diff with a combination of datetime values and floats, this will crash as well. As long as there is a combination of integers, floats (nan included) or booleans, the object diff will work. While testing that specific case might not be a good idea, we do not want to introduce a patch that will make booleans from a float/int object series. For that reason I think it is better to include a test for that.

@TomAugspurger How would you test that a diff on a combined boolean and float object series results in floats instead of booleans? I am pretty certain that someone doing a diff on [0, 1.0, True] expects [nan, -1.0, 0] instead of [nan, True, False]. The test reflects that.

@TomAugspurger
Copy link
Contributor

I'd like this diff to be as small as possible. I'd only like to fix the original issue: calling - on a boolean ndarray.

Let's leave the rest for another issue.

Looks like I will be parametrizing and splitting the function into several functions. If there are no bugs, this should do.
@Unprocessable
Copy link
Contributor Author

Unprocessable commented Sep 24, 2019

I have been fixing the tests as well as there was a need for that from you (among others). I would rather not (re)write any tests, but I get the need for them. The issue is now fixed and the boolean diff test is both parametrized and split from the rest.

Let's leave the rest for another issue.

I agree. If anyone does not agree with how it was written, it will be possible to change it later. The code works, does what it should do and should probably be merged before 1.0.

Copy link
Contributor Author

@Unprocessable Unprocessable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requests should be implemented.


# boolean nan series
s = Series([False, True, nan, False, False])
result = s.diff()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

@ WillAyd Done.

@@ -211,6 +211,7 @@ Other
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`)
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`)
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`)
- Bug in :meth:`Series.diff` where a boolean series would cause a TypeError (the - operator is deprecated) when using NumPy >= 1.13.0 (:issue:`17294`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I removed the second part

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, small doc comment, CI is currently failing however, so wait a little for that to be updated

@@ -240,6 +240,7 @@ Other
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`)
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`)
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`)
- Bug in :meth:`Series.diff` where a boolean series would incorrectly raise a TypeError (the - operator is deprecated) (:issue:`17294`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around TypeError, I would remove the (the - operator is deprecated), not obvious what this is in reference to a casual reader

Changing the description again. The numpy version and 'the - operator is deprecated' are now both removed.
@Unprocessable Unprocessable requested a review from jreback October 1, 2019 09:43
@jreback jreback added this to the 1.0 milestone Oct 1, 2019
s = Series(input)
result = s.diff()
expected = Series(output)
assert_series_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is x - x.shift() also true here (as it is for obj_diff)? can you add that assert as well (do another result= and expected=), otherwise this PR lgtm.

Copy link
Contributor Author

@Unprocessable Unprocessable Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not and will not with the current implementation. We all agreed on this, please re-read the thread if you have the time. @TomAugspurger specifically wanted a boolean Series instead of a list of integers, as shown in the quote below:

I'm having a hard time following, but IMO these two should have the same output

In [3]: pd.Series([True, False, True]).diff()
Out[3]:
0     NaN
1    True
2    True
dtype: object

In [4]: pd.Series([True, False, True]).astype(object).diff()
Out[4]:
0    NaN
1     -1
2      1
dtype: object

I think both of those should be Out[3], the same result as np.diff with the missing values handled appropriately.

Edit: This is also the reason the diff on a boolean object series is different from a diff on a series with only booleans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough, do we have a followup issue on this? @TomAugspurger ? (which likely will block on a real BoolType EA)

@jreback jreback merged commit 65815e6 into pandas-dev:master Oct 1, 2019
@jreback
Copy link
Contributor

jreback commented Oct 1, 2019

thanks @Unprocessable

sorry this took a while. sometimes PRs trigger a lot of discussion about the original issue (which is what happened here).

@Unprocessable
Copy link
Contributor Author

Unprocessable commented Oct 1, 2019

No worries, it is good to see a simple question/PR causing a review of the core code. At least it means it was due for some commenting. This is also my first contribution to an open source project, so there is that. Thanks to all for keeping pandas updated and working!

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.Series.diff() on boolean values
7 participants