Skip to content

BUG: Fix numpy boolean subtraction error in Series.diff #28251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Oct 1, 2019
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
8be97f8
Fix numpy boolean subtraction error in Series.diff
Unprocessable Aug 5, 2019
7c982d2
Added is_bool outside if function
Unprocessable Aug 5, 2019
45f9c60
Added test_diff function
Unprocessable Aug 5, 2019
83907a2
Update test_diff.py
Unprocessable Aug 5, 2019
6c3875e
Changing style to match PEP
Unprocessable Aug 5, 2019
ecf1c04
Update test_diff.py
Unprocessable Aug 5, 2019
dd8ff98
Using tm to assert series differences
Unprocessable Aug 5, 2019
e123486
Added diff bug description to changelog
Unprocessable Aug 6, 2019
269d148
Added issue number
Unprocessable Aug 6, 2019
3993206
Adding timeseries tests as well
Unprocessable Aug 6, 2019
b610dd9
Removed test_diff, it is added to test_diff.py
Unprocessable Aug 6, 2019
92075fb
Removed too much code, importing nan again
Unprocessable Aug 6, 2019
1b90657
STYLE: formatting...
Unprocessable Aug 6, 2019
182cbc2
Splitting into two functions
Unprocessable Aug 6, 2019
3ba0958
Split into 3 functions
Unprocessable Aug 6, 2019
47feb30
Update test_diff.py
Unprocessable Aug 6, 2019
d425cdf
Adding nan test
Unprocessable Aug 6, 2019
050b8cb
STYLE: Black formatting
Unprocessable Aug 6, 2019
e028aef
Import order changed
Unprocessable Aug 6, 2019
9d947db
Update test_diff.py
Unprocessable Aug 7, 2019
b115c9e
Update test_diff.py
Unprocessable Aug 7, 2019
865e913
Moving test_diff.py to test_analytics.py
Unprocessable Aug 10, 2019
6f8986b
Removed test_diff.py, added to test_analytics
Unprocessable Aug 10, 2019
081c2c8
Removing tailing whitespaces
Unprocessable Aug 10, 2019
96bb1b4
STYLE: Black formatting
Unprocessable Aug 10, 2019
91ebae5
Update test_analytics.py
Unprocessable Aug 10, 2019
74a474f
Update test_analytics.py
Unprocessable Aug 10, 2019
40df511
Removing self.ts from code
Unprocessable Aug 10, 2019
18a2e35
Update test_analytics.py
Unprocessable Aug 10, 2019
716a70e
Adding tm.makeTimeSeries, as in the original
Unprocessable Aug 10, 2019
e879dd7
STYLE: Black
Unprocessable Aug 10, 2019
850e315
Adding updates
Sep 2, 2019
8d2547a
BUG: Fix numpy boolean subtraction error in Series.diff (#27755)
Sep 2, 2019
e6f1890
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 2, 2019
3231d8f
Resolved merge conflict added changes to older whatsnew.
Sep 2, 2019
3e0dbc7
Update v0.25.2.rst
Unprocessable Sep 2, 2019
5f101fe
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 2, 2019
f98cbf2
Update v0.25.1.rst
Unprocessable Sep 2, 2019
2127651
Refactored code and added version number
Sep 3, 2019
a174239
Merge branch 'patch-2' of https://github.com/Unprocessable/pandas int…
Sep 3, 2019
b83135d
Fixing typo
Sep 3, 2019
b14f7c6
Moving comment
Sep 3, 2019
d92d6c7
Removed a space
Sep 3, 2019
b96bde8
Fixed typo
Sep 3, 2019
6a72f2c
Added spaces
Sep 3, 2019
030d305
Fixed typo
Sep 3, 2019
649042f
Adding requirements
Sep 9, 2019
d1c1755
Updating
Sep 10, 2019
1e4a7ae
Updating documentation
Sep 11, 2019
951556e
Merge branch 'master' into patch-2
Unprocessable Sep 11, 2019
d1c6b65
Adding object series test
Sep 12, 2019
8323e71
Merging with master
Sep 13, 2019
bbd2550
Removing the extra object NaN code
Unprocessable Sep 15, 2019
21defe6
Adding requirements
Unprocessable Sep 23, 2019
6227564
BLACK formatting
Unprocessable Sep 23, 2019
88cea85
Adding comma
Unprocessable Sep 24, 2019
acc8f87
Changing bug description
Unprocessable Sep 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,7 @@ Other
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`)
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`)
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`)
- Bug in :meth:`Series.diff` where a boolean series would incorrectly raise a TypeError (the - operator is deprecated) (:issue:`17294`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around TypeError, I would remove the (the - operator is deprecated), not obvious what this is in reference to a casual reader

- :meth:`Series.append` will no longer raise a ``TypeError`` when passed a tuple of ``Series`` (:issue:`28410`)

.. _whatsnew_1000.contributors:
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1915,6 +1915,7 @@ def diff(arr, n, axis=0):
dtype = arr.dtype

is_timedelta = False
is_bool = False
if needs_i8_conversion(arr):
dtype = np.float64
arr = arr.view("i8")
Expand All @@ -1923,6 +1924,7 @@ def diff(arr, n, axis=0):

elif is_bool_dtype(dtype):
dtype = np.object_
is_bool = True

elif is_integer_dtype(dtype):
dtype = np.float64
Expand Down Expand Up @@ -1962,6 +1964,8 @@ def diff(arr, n, axis=0):
result = res - lag
result[mask] = na
out_arr[res_indexer] = result
elif is_bool:
out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]
else:
out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

Expand Down
64 changes: 63 additions & 1 deletion pandas/tests/series/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from pandas.api.types import is_scalar
from pandas.core.index import MultiIndex
from pandas.core.indexes.datetimes import Timestamp
from pandas.core.indexes.timedeltas import TimedeltaIndex
import pandas.util.testing as tm
from pandas.util.testing import (
assert_almost_equal,
Expand Down Expand Up @@ -228,7 +229,7 @@ def test_cummax_timedelta64(self):
result = s.cummax(skipna=False)
tm.assert_series_equal(expected, result)

def test_npdiff(self):
def test_np_diff(self):
pytest.skip("skipping due to Series no longer being an ndarray")

# no longer works as the return type of np.diff is now nd.array
Expand All @@ -237,6 +238,67 @@ def test_npdiff(self):
r = np.diff(s)
assert_series_equal(Series([nan, 0, 0, 0, nan]), r)

def test_int_diff(self):
# int dtype
a = 10000000000000000
b = a + 1
s = Series([a, b])

result = s.diff()
assert result[1] == 1

def test_tz_diff(self):
# Combined datetime diff, normal diff and boolean diff test
ts = tm.makeTimeSeries(name="ts")
ts.diff()

# neg n
result = ts.diff(-1)
expected = ts - ts.shift(-1)
assert_series_equal(result, expected)

# 0
result = ts.diff(0)
expected = ts - ts
assert_series_equal(result, expected)

# datetime diff (GH3100)
s = Series(date_range("20130102", periods=5))
result = s.diff()
expected = s - s.shift(1)
assert_series_equal(result, expected)

# timedelta diff
result = result - result.shift(1) # previous result
expected = expected.diff() # previously expected
assert_series_equal(result, expected)

# with tz
s = Series(
date_range("2000-01-01 09:00:00", periods=5, tz="US/Eastern"), name="foo"
)
result = s.diff()
expected = Series(TimedeltaIndex(["NaT"] + ["1 days"] * 4), name="foo")
assert_series_equal(result, expected)

@pytest.mark.parametrize(
"input,output,diff",
[([False, True, True, False, False], [nan, True, False, True, False], 1)],
)
def test_bool_diff(self, input, output, diff):
# boolean series (test for fixing #17294)
s = Series(input)
result = s.diff()
expected = Series(output)
assert_series_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is x - x.shift() also true here (as it is for obj_diff)? can you add that assert as well (do another result= and expected=), otherwise this PR lgtm.

Copy link
Contributor Author

@Unprocessable Unprocessable Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not and will not with the current implementation. We all agreed on this, please re-read the thread if you have the time. @TomAugspurger specifically wanted a boolean Series instead of a list of integers, as shown in the quote below:

I'm having a hard time following, but IMO these two should have the same output

In [3]: pd.Series([True, False, True]).diff()
Out[3]:
0     NaN
1    True
2    True
dtype: object

In [4]: pd.Series([True, False, True]).astype(object).diff()
Out[4]:
0    NaN
1     -1
2      1
dtype: object

I think both of those should be Out[3], the same result as np.diff with the missing values handled appropriately.

Edit: This is also the reason the diff on a boolean object series is different from a diff on a series with only booleans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough, do we have a followup issue on this? @TomAugspurger ? (which likely will block on a real BoolType EA)


def test_obj_diff(self):
# object series
s = Series([False, True, 5.0, nan, True, False])
result = s.diff()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this routine is way too long can you parameterize

Copy link
Contributor Author

@Unprocessable Unprocessable Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Shall I rewrite all other tests in pandas as well? I did not write the tests the way they are, I only added a few lines... I would rather do this in a refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is

@ WillAyd Done.

expected = s - s.shift(1)
assert_series_equal(result, expected)

def _check_accum_op(self, name, datetime_series_, check_dtype=True):
func = getattr(np, name)
tm.assert_numpy_array_equal(
Expand Down
42 changes: 0 additions & 42 deletions pandas/tests/series/test_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,48 +355,6 @@ def test_asfreq_datetimeindex_empty_series(self):
)
tm.assert_index_equal(expected.index, result.index)

def test_diff(self):
# Just run the function
self.ts.diff()

# int dtype
a = 10000000000000000
b = a + 1
s = Series([a, b])

rs = s.diff()
assert rs[1] == 1

# neg n
rs = self.ts.diff(-1)
xp = self.ts - self.ts.shift(-1)
assert_series_equal(rs, xp)

# 0
rs = self.ts.diff(0)
xp = self.ts - self.ts
assert_series_equal(rs, xp)

# datetime diff (GH3100)
s = Series(date_range("20130102", periods=5))
rs = s - s.shift(1)
xp = s.diff()
assert_series_equal(rs, xp)

# timedelta diff
nrs = rs - rs.shift(1)
nxp = xp.diff()
assert_series_equal(nrs, nxp)

# with tz
s = Series(
date_range("2000-01-01 09:00:00", periods=5, tz="US/Eastern"), name="foo"
)
result = s.diff()
assert_series_equal(
result, Series(TimedeltaIndex(["NaT"] + ["1 days"] * 4), name="foo")
)

def test_pct_change(self):
rs = self.ts.pct_change(fill_method=None)
assert_series_equal(rs, self.ts / self.ts.shift(1) - 1)
Expand Down