-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Breaking changes for sum / prod of empty / all-NA #18921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
790330d
541a362
a267c2f
8c06739
df7c69a
66a3ab6
fb5937c
4c65c9c
52e4e6f
d6a6c22
fcd57f4
05b44f9
cdf5692
a97e133
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,12 +3,211 @@ | |
v0.22.0 | ||
------- | ||
|
||
This is a major release from 0.21.1 and includes a number of API changes, | ||
deprecations, new features, enhancements, and performance improvements along | ||
with a large number of bug fixes. We recommend that all users upgrade to this | ||
version. | ||
This is a major release from 0.21.1 and includes a single, API-breaking change. | ||
We recommend that all users upgrade to this version after carefully reading the | ||
release note (singular!). | ||
|
||
.. _whatsnew_0220.api_breaking: | ||
|
||
Backwards incompatible API changes | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The | ||
summary is that | ||
|
||
* The sum of an all-*NA* or empty ``Series`` is now ``0`` | ||
* The product of an all-*NA* or empty series is now ``1`` | ||
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` to control | ||
the minimum number of valid values for the result to be valid. If fewer than | ||
``min_count`` valid values are present, the result is NA. The default is | ||
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``. | ||
|
||
Some background: In pandas 0.21, we fixed a long-standing inconsistency | ||
in the return value of all-*NA* series depending on whether or not bottleneck | ||
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same | ||
time, we changed the sum and prod of an empty Series to also be ``NaN``. | ||
|
||
Based on feedback, we've partially reverted those changes. | ||
|
||
Arithmetic Operations | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The default sum for all-*NA* and empty series is now ``0``. | ||
|
||
*pandas 0.21.x* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make from here to Grouping a sub-section as well, maybe |
||
|
||
.. code-block:: ipython | ||
|
||
In [3]: pd.Series([]).sum() | ||
Out[3]: nan | ||
|
||
In [4]: pd.Series([np.nan]).sum() | ||
Out[4]: nan | ||
|
||
*pandas 0.22.0* | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([]).sum() | ||
pd.Series([np.nan]).sum() | ||
|
||
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say matches numpy? (I know you are saying |
||
also matches the behavior of ``np.nansum`` on empty and all-*NA* arrays. | ||
|
||
To have the sum of an empty series return ``NaN``, use the ``min_count`` | ||
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
series is conceptually the same as on an empty. The ``min_count`` parameter | ||
refers to the minimum number of *valid* values required for a non-NA sum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. valid -> not-null |
||
or product. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([]).sum(min_count=1) | ||
pd.Series([np.nan]).sum(min_count=1) | ||
|
||
Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck | ||
installed. | ||
|
||
:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`, | ||
returning ``1`` instead. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([]).prod() | ||
pd.Series([np.nan]).prod() | ||
pd.Series([]).prod(min_count=1) | ||
|
||
These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well. | ||
Finally, a few less obvious places in pandas are affected by this change. | ||
|
||
Grouping by a Categorical | ||
^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Grouping by a ``Categorical`` with some unobserved categories and computing the | ||
``sum`` / ``prod`` will behave differently. | ||
|
||
*pandas 0.21.x* | ||
|
||
.. code-block:: ipython | ||
|
||
In [5]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) | ||
|
||
In [6]: pd.Series([1, 2]).groupby(grouper).sum() | ||
Out[6]: | ||
a 3.0 | ||
b NaN | ||
dtype: float64 | ||
|
||
*pandas 0.22* | ||
|
||
.. ipython:: python | ||
|
||
grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) | ||
pd.Series([1, 2]).groupby(grouper).sum() | ||
|
||
To restore the 0.21 behavior of returning ``NaN`` of unobserved groups, | ||
use ``min_count>=1``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we actually tests min_count > 1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I added a few more in |
||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 2]).groupby(grouper).sum(min_count=1) | ||
|
||
Resample | ||
^^^^^^^^ | ||
|
||
The sum and product of all-*NA* bins will change: | ||
|
||
*pandas 0.21.x* | ||
|
||
.. code-block:: ipython | ||
|
||
In [7]: s = pd.Series([1, 1, np.nan, np.nan], | ||
...: index=pd.date_range('2017', periods=4)) | ||
...: | ||
|
||
In [8]: s | ||
Out[8]: | ||
2017-01-01 1.0 | ||
2017-01-02 1.0 | ||
2017-01-03 NaN | ||
2017-01-04 NaN | ||
Freq: D, dtype: float64 | ||
|
||
In [9]: s.resample('2d').sum() | ||
Out[9]: | ||
2017-01-01 2.0 | ||
2017-01-03 NaN | ||
Freq: 2D, dtype: float64 | ||
|
||
*pandas 0.22.0* | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([1, 1, np.nan, np.nan], | ||
index=pd.date_range('2017', periods=4)) | ||
s.resample('2d').sum() | ||
|
||
To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``. | ||
|
||
.. ipython:: python | ||
|
||
s.resample('2d').sum(min_count=1) | ||
|
||
In particular, upsampling and taking the sum or product is affected, as | ||
upsampling introduces all-*NA* original series was entirely valid. | ||
|
||
*pandas 0.21.x* | ||
|
||
.. code-block:: ipython | ||
|
||
In [10]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) | ||
|
||
In [10]: pd.Series([1, 2], index=idx).resample('12H').sum() | ||
Out[10]: | ||
2017-01-01 00:00:00 1.0 | ||
2017-01-01 12:00:00 NaN | ||
2017-01-02 00:00:00 2.0 | ||
Freq: 12H, dtype: float64 | ||
|
||
*pandas 0.22.0* | ||
|
||
.. ipython:: python | ||
|
||
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) | ||
pd.Series([1, 2], index=idx).resample("12H").sum() | ||
|
||
Once again, the ``min_count`` keyword is available to restore the 0.21 behavior. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: Add |
||
|
||
Rolling and Expanding | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Rolling and expanding already have a ``min_periods`` keyword that behaves | ||
similarly to ``min_count``. The only case that changes is when doing a rolling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar |
||
or expanding sum on an all-*NA* series with ``min_periods=0``. Previously this | ||
returned ``NaN``, now it will return ``0``. | ||
|
||
*pandas 0.21.1* | ||
|
||
.. ipython:: python | ||
|
||
In [11]: s = pd.Series([np.nan, np.nan]) | ||
|
||
In [12]: s.rolling(2, min_periods=0).sum() | ||
Out[12]: | ||
0 NaN | ||
1 NaN | ||
dtype: float64 | ||
|
||
*pandas 0.22.0* | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([np.nan, np.nan]) | ||
s.rolling(2, min_periods=0).sum() | ||
|
||
The default behavior of ``min_periods=None``, implying that ``min_periods`` | ||
equals the window size, is unchanged. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer): | |
right_closed: bint | ||
right endpoint closedness | ||
True if the right endpoint is closed, False if open | ||
|
||
floor: optional | ||
unit for flooring the unit | ||
""" | ||
def __init__(self, ndarray input, int64_t win, int64_t minp, | ||
bint left_closed, bint right_closed, ndarray index): | ||
bint left_closed, bint right_closed, ndarray index, | ||
object floor=None): | ||
|
||
self.is_variable = 1 | ||
self.N = len(index) | ||
self.minp = _check_minp(win, minp, self.N) | ||
self.minp = _check_minp(win, minp, self.N, floor=floor) | ||
|
||
self.start = np.empty(self.N, dtype='int64') | ||
self.start.fill(-1) | ||
|
@@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed, | |
|
||
if index is not None: | ||
indexer = VariableWindowIndexer(input, win, minp, left_closed, | ||
right_closed, index) | ||
right_closed, index, floor) | ||
elif use_mock: | ||
indexer = MockFixedWindowIndexer(input, win, minp, left_closed, | ||
right_closed, index, floor) | ||
|
@@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp, | |
object index, object closed): | ||
cdef: | ||
double val, prev_x, sum_x = 0 | ||
int64_t s, e | ||
int64_t s, e, range_endpoint | ||
int64_t nobs = 0, i, j, N | ||
bint is_variable | ||
ndarray[int64_t] start, end | ||
ndarray[double_t] output | ||
|
||
start, end, N, win, minp, is_variable = get_window_indexer(input, win, | ||
minp, index, | ||
closed) | ||
closed, | ||
floor=0) | ||
output = np.empty(N, dtype=float) | ||
|
||
# for performance we are going to iterate | ||
|
@@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp, | |
|
||
# fixed window | ||
|
||
range_endpoint = int_max(minp, 1) - 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice! |
||
|
||
with nogil: | ||
|
||
for i in range(0, minp - 1): | ||
for i in range(0, range_endpoint): | ||
add_sum(input[i], &nobs, &sum_x) | ||
output[i] = NaN | ||
|
||
for i in range(minp - 1, N): | ||
for i in range(range_endpoint, N): | ||
val = input[i] | ||
add_sum(val, &nobs, &sum_x) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around Series.