Skip to content

BUG: Fix numpy boolean subtraction error in Series.diff #28251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Oct 1, 2019
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
8be97f8
Fix numpy boolean subtraction error in Series.diff
Unprocessable Aug 5, 2019
7c982d2
Added is_bool outside if function
Unprocessable Aug 5, 2019
45f9c60
Added test_diff function
Unprocessable Aug 5, 2019
83907a2
Update test_diff.py
Unprocessable Aug 5, 2019
6c3875e
Changing style to match PEP
Unprocessable Aug 5, 2019
ecf1c04
Update test_diff.py
Unprocessable Aug 5, 2019
dd8ff98
Using tm to assert series differences
Unprocessable Aug 5, 2019
e123486
Added diff bug description to changelog
Unprocessable Aug 6, 2019
269d148
Added issue number
Unprocessable Aug 6, 2019
3993206
Adding timeseries tests as well
Unprocessable Aug 6, 2019
b610dd9
Removed test_diff, it is added to test_diff.py
Unprocessable Aug 6, 2019
92075fb
Removed too much code, importing nan again
Unprocessable Aug 6, 2019
1b90657
STYLE: formatting...
Unprocessable Aug 6, 2019
182cbc2
Splitting into two functions
Unprocessable Aug 6, 2019
3ba0958
Split into 3 functions
Unprocessable Aug 6, 2019
47feb30
Update test_diff.py
Unprocessable Aug 6, 2019
d425cdf
Adding nan test
Unprocessable Aug 6, 2019
050b8cb
STYLE: Black formatting
Unprocessable Aug 6, 2019
e028aef
Import order changed
Unprocessable Aug 6, 2019
9d947db
Update test_diff.py
Unprocessable Aug 7, 2019
b115c9e
Update test_diff.py
Unprocessable Aug 7, 2019
865e913
Moving test_diff.py to test_analytics.py
Unprocessable Aug 10, 2019
6f8986b
Removed test_diff.py, added to test_analytics
Unprocessable Aug 10, 2019
081c2c8
Removing tailing whitespaces
Unprocessable Aug 10, 2019
96bb1b4
STYLE: Black formatting
Unprocessable Aug 10, 2019
91ebae5
Update test_analytics.py
Unprocessable Aug 10, 2019
74a474f
Update test_analytics.py
Unprocessable Aug 10, 2019
40df511
Removing self.ts from code
Unprocessable Aug 10, 2019
18a2e35
Update test_analytics.py
Unprocessable Aug 10, 2019
716a70e
Adding tm.makeTimeSeries, as in the original
Unprocessable Aug 10, 2019
e879dd7
STYLE: Black
Unprocessable Aug 10, 2019
850e315
Adding updates
Sep 2, 2019
8d2547a
BUG: Fix numpy boolean subtraction error in Series.diff (#27755)
Sep 2, 2019
e6f1890
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 2, 2019
3231d8f
Resolved merge conflict added changes to older whatsnew.
Sep 2, 2019
3e0dbc7
Update v0.25.2.rst
Unprocessable Sep 2, 2019
5f101fe
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 2, 2019
f98cbf2
Update v0.25.1.rst
Unprocessable Sep 2, 2019
2127651
Refactored code and added version number
Sep 3, 2019
a174239
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 3, 2019
b83135d
Fixing typo
Sep 3, 2019
b14f7c6
Moving comment
Sep 3, 2019
d92d6c7
Removed a space
Sep 3, 2019
b96bde8
Fixed typo
Sep 3, 2019
6a72f2c
Added spaces
Sep 3, 2019
030d305
Fixed typo
Sep 3, 2019
649042f
Adding requirements
Sep 9, 2019
d1c1755
Updating
Sep 10, 2019
1e4a7ae
Updating documentation
Sep 11, 2019
951556e
Merge branch 'master' into patch-2
Unprocessable Sep 11, 2019
d1c6b65
Adding object series test
Sep 12, 2019
8323e71
Merging with master
Sep 13, 2019
bbd2550
Removing the extra object NaN code
Unprocessable Sep 15, 2019
21defe6
Adding requirements
Unprocessable Sep 23, 2019
6227564
BLACK formatting
Unprocessable Sep 23, 2019
88cea85
Adding comma
Unprocessable Sep 24, 2019
acc8f87
Changing bug description
Unprocessable Sep 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ Other
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`)
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`)
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`)
- Bug in :meth:`Series.diff` where a boolean series would cause a TypeError (the - operator is deprecated) when using NumPy >= 1.13.0 (:issue:`17294`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in :meth:Series.diff where a boolean series would incorrectly raise TypeError (:issue:....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: if people start looking for that bug ('the - operator is deprecated'), how would they know to update pandas if I remove the second part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I removed the second part



.. _whatsnew_1000.contributors:
Expand Down
16 changes: 16 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1915,6 +1915,8 @@ def diff(arr, n, axis=0):
dtype = arr.dtype

is_timedelta = False
is_bool = False
is_obj_bool = False
if needs_i8_conversion(arr):
dtype = np.float64
arr = arr.view("i8")
Expand All @@ -1923,10 +1925,15 @@ def diff(arr, n, axis=0):

elif is_bool_dtype(dtype):
dtype = np.object_
is_bool = True

elif is_integer_dtype(dtype):
dtype = np.float64

elif is_object_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of doing anything anything with object arrays here at all.

if np.all(np.isin(arr, (True, False), False) | isna(arr)):
is_obj_bool = True

dtype = np.dtype(dtype)
out_arr = np.empty(arr.shape, dtype=dtype)

Expand Down Expand Up @@ -1962,9 +1969,18 @@ def diff(arr, n, axis=0):
result = res - lag
result[mask] = na
out_arr[res_indexer] = result
elif is_bool:
out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]
else:
out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

if is_obj_bool:
# converting numbers to bool
na_index = isna(out_arr)
out_arr = out_arr.astype(bool).astype(object)
# resetting nan previously converted to True
out_arr[na_index] = np.nan

if is_timedelta:
out_arr = out_arr.astype("int64").view("timedelta64[ns]")

Expand Down
55 changes: 55 additions & 0 deletions pandas/tests/series/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from pandas.api.types import is_scalar
from pandas.core.index import MultiIndex
from pandas.core.indexes.datetimes import Timestamp
from pandas.core.indexes.timedeltas import TimedeltaIndex
import pandas.util.testing as tm
from pandas.util.testing import (
assert_almost_equal,
Expand Down Expand Up @@ -237,6 +238,60 @@ def test_npdiff(self):
r = np.diff(s)
assert_series_equal(Series([nan, 0, 0, 0, nan]), r)

def test_diff(self):
# Combined datetime diff, normal diff and boolean diff test
ts = tm.makeTimeSeries(name="ts")
ts.diff()

# int dtype
a = 10000000000000000
b = a + 1
s = Series([a, b])

result = s.diff()
assert result[1] == 1

# neg n
result = ts.diff(-1)
expected = ts - ts.shift(-1)
assert_series_equal(result, expected)

# 0
result = ts.diff(0)
expected = ts - ts
assert_series_equal(result, expected)

# datetime diff (GH3100)
s = Series(date_range("20130102", periods=5))
result = s - s.shift(1)
expected = s.diff()
assert_series_equal(result, expected)

# timedelta diff
result = result - result.shift(1) # previous result
expected = expected.diff() # previously expected
assert_series_equal(result, expected)

# with tz
s = Series(
date_range("2000-01-01 09:00:00", periods=5, tz="US/Eastern"), name="foo"
)
result = s.diff()
expected = Series(TimedeltaIndex(["NaT"] + ["1 days"] * 4), name="foo")
assert_series_equal(result, expected)

# boolean series (test for fixing #17294)
s = Series([False, True, True, False, False])
result = s.diff()
expected = Series([nan, True, False, True, False])
assert_series_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is x - x.shift() also true here (as it is for obj_diff)? can you add that assert as well (do another result= and expected=), otherwise this PR lgtm.

Copy link
Contributor Author

@Unprocessable Unprocessable Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not and will not with the current implementation. We all agreed on this, please re-read the thread if you have the time. @TomAugspurger specifically wanted a boolean Series instead of a list of integers, as shown in the quote below:

I'm having a hard time following, but IMO these two should have the same output

In [3]: pd.Series([True, False, True]).diff()
Out[3]:
0     NaN
1    True
2    True
dtype: object

In [4]: pd.Series([True, False, True]).astype(object).diff()
Out[4]:
0    NaN
1     -1
2      1
dtype: object

I think both of those should be Out[3], the same result as np.diff with the missing values handled appropriately.

Edit: This is also the reason the diff on a boolean object series is different from a diff on a series with only booleans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough, do we have a followup issue on this? @TomAugspurger ? (which likely will block on a real BoolType EA)


# boolean nan series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that this should differ from the previous case. IMO, both of these are boolean dtype. Pandas just doesn't have a nullable boolean dtype (yet) but we generally try to treat a mix of bool and NA the same as bool.

So I think that this should also be object dtype, but the values should be booleans instead of integers. Is that difficult to support? I suppose it'll require a scan of the values for the object-dtype case, to ensure that everything is boolean or NA?

Copy link
Contributor Author

@Unprocessable Unprocessable Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should not differ. However, when I previously tested my code I did notice a difference, hence why I added the tests as such. If you are sure that pandas does not treat it any different and testing that assumption is not required now and in the future, I will gladly remove the first test.

In regards to your second point, I could definitely change the code to create only a boolean series instead, but why? The same could be done with logic instead of diff. I think the point of diff is to see the difference between the current value and the previous one (e.g. True - False = 1 and False - True = -1). This is also in line with the subtraction done by diff on other non-boolean series and what happens if you subtract booleans in Python. I found this bug looking for ways to detect these changes. Do you not agree or did you mean something different?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following, but IMO these two should have the same output

In [3]: pd.Series([True, False, True]).diff()
Out[3]:
0     NaN
1    True
2    True
dtype: object

In [4]: pd.Series([True, False, True]).astype(object).diff()
Out[4]:
0    NaN
1     -1
2      1
dtype: object

I think both of those should be Out[3], the same result as np.diff with the missing values handled appropriately.

Copy link
Contributor Author

@Unprocessable Unprocessable Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to be missing something, when I run np.diff on the Series it results in this:

In [2]: np.diff(pd.Series([False, True]))
Out[2]: array([ True])

In [3]: np.diff(pd.Series([np.nan, False, True]))
Out[3]: array([nan, 1], dtype=object)

Which is pretty close to the current pandas implementation. Do we want to remove the first item in the Series (always np.nan) to make it fit better with np.diff? That would mean changing the other code in the diff function as well. Ignoring that, the results are the same as my code. As the dtypes are the problem, we could either convert the Series to always be of the object dtype or do the same numpy does.

Copy link
Contributor Author

@Unprocessable Unprocessable Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, I also tested np.diff on a numpy array with the same results:

In [4]: np.diff(np.array([True, False]))  # dtype = bool
Out[4]: array([ True])

In [5]: np.diff(np.array([True, False, np.nan]))  # dtype = object
Out[5]: array([-1., nan])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the same behavior as np.diff on a boolean-dtype ndarray.

Because of the missing values, we'll always return an object-dtype Series where the values are either NaN, True, or False.

cc @jbrockmendel if you have thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this would be a lot easier if we had a BoolNA in place. Is that accurate? Since bool-dtype is a thin wrapper over uint8, I wonder how hard it would be to adapt IntegerNA to this purpose (@WillAyd you had a PR about this a while back, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long time ago. @jreback had something more modern in #25415

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be easier with BoolNA.

I think we want to achieve the BoolNA behavior, but with object dtype today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am -1 on doing anything with object dtype as it makes the code much too complex.

s = Series([False, True, nan, False, False])
result = s.diff()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this routine is way too long can you parameterize

Copy link
Contributor Author

@Unprocessable Unprocessable Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Shall I rewrite all other tests in pandas as well? I did not write the tests the way they are, I only added a few lines... I would rather do this in a refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

@ WillAyd Done.

expected = Series([nan, 1, nan, nan, 0], dtype="object")
assert_series_equal(result, expected)

def _check_accum_op(self, name, datetime_series_, check_dtype=True):
func = getattr(np, name)
tm.assert_numpy_array_equal(
Expand Down
42 changes: 0 additions & 42 deletions pandas/tests/series/test_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,48 +355,6 @@ def test_asfreq_datetimeindex_empty_series(self):
)
tm.assert_index_equal(expected.index, result.index)

def test_diff(self):
# Just run the function
self.ts.diff()

# int dtype
a = 10000000000000000
b = a + 1
s = Series([a, b])

rs = s.diff()
assert rs[1] == 1

# neg n
rs = self.ts.diff(-1)
xp = self.ts - self.ts.shift(-1)
assert_series_equal(rs, xp)

# 0
rs = self.ts.diff(0)
xp = self.ts - self.ts
assert_series_equal(rs, xp)

# datetime diff (GH3100)
s = Series(date_range("20130102", periods=5))
rs = s - s.shift(1)
xp = s.diff()
assert_series_equal(rs, xp)

# timedelta diff
nrs = rs - rs.shift(1)
nxp = xp.diff()
assert_series_equal(nrs, nxp)

# with tz
s = Series(
date_range("2000-01-01 09:00:00", periods=5, tz="US/Eastern"), name="foo"
)
result = s.diff()
assert_series_equal(
result, Series(TimedeltaIndex(["NaT"] + ["1 days"] * 4), name="foo")
)

def test_pct_change(self):
rs = self.ts.pct_change(fill_method=None)
assert_series_equal(rs, self.ts / self.ts.shift(1) - 1)
Expand Down