Skip to content

Breaking changes for sum / prod of empty / all-NA #18921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 29, 2017
209 changes: 204 additions & 5 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,213 @@
.. _whatsnew_0220:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't u need this?



v0.22.0
-------

This is a major release from 0.21.1 and includes a number of API changes,
deprecations, new features, enhancements, and performance improvements along
with a large number of bug fixes. We recommend that all users upgrade to this
version.
This is a major release from 0.21.1 and includes a single, API-breaking change.
We recommend that all users upgrade to this version after carefully reading the
release note (singular!).

.. _whatsnew_0220.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
summary is that

* The sum of an all-*NA* or empty ``Series`` is now ``0``
* The product of an all-*NA* or empty series is now ``1``
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` to control
the minimum number of valid values for the result to be valid. If fewer than
``min_count`` valid values are present, the result is NA. The default is
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.

Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-*NA* series depending on whether or not bottleneck
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same
time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.

Based on feedback, we've partially reverted those changes.

Arithmetic Operations
^^^^^^^^^^^^^^^^^^^^^

The default sum for all-*NA* and empty series is now ``0``.

*pandas 0.21.x*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make from here to Grouping a sub-section as well, maybe Ops?


.. code-block:: ipython

In [3]: pd.Series([]).sum()
Out[3]: nan

In [4]: pd.Series([np.nan]).sum()
Out[4]: nan

*pandas 0.22.0*

.. ipython:: python

pd.Series([]).sum()
pd.Series([np.nan]).sum()

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say matches numpy? (I know you are saying np.nansum, but can't hurt to actually say numpy)

also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.

To have the sum of an empty series return ``NaN``, use the ``min_count``
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipna=True

series is conceptually the same as the ``.sum`` of an empty one with
``skipna=True`` (the default). The ``min_count`` parameter refers to the
minimum number of *non-null* values required for a non-NA sum or product.

.. ipython:: python

pd.Series([]).sum(min_count=1)
pd.Series([np.nan]).sum(min_count=1)

Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck
installed.

:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
returning ``1`` instead.

.. ipython:: python

pd.Series([]).prod()
pd.Series([np.nan]).prod()
pd.Series([]).prod(min_count=1)

These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
Finally, a few less obvious places in pandas are affected by this change.

Grouping by a Categorical
^^^^^^^^^^^^^^^^^^^^^^^^^

Grouping by a ``Categorical`` with some unobserved categories and computing the
``sum`` / ``prod`` will behave differently.

*pandas 0.21.x*

.. code-block:: ipython

In [5]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [6]: pd.Series([1, 2]).groupby(grouper).sum()
Out[6]:
a 3.0
b NaN
dtype: float64

*pandas 0.22*

.. ipython:: python

grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
pd.Series([1, 2]).groupby(grouper).sum()

To restore the 0.21 behavior of returning ``NaN`` of unobserved groups,
use ``min_count>=1``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually tests min_count > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added a few more in frame/test_analytics.py, test_resample.py, and test_groupby.py.


.. ipython:: python

pd.Series([1, 2]).groupby(grouper).sum(min_count=1)

Resample
^^^^^^^^

The sum and product of all-*NA* bins will change:

*pandas 0.21.x*

.. code-block:: ipython

In [7]: s = pd.Series([1, 1, np.nan, np.nan],
...: index=pd.date_range('2017', periods=4))
...:

In [8]: s
Out[8]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64

In [9]: s.resample('2d').sum()
Out[9]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([1, 1, np.nan, np.nan],
index=pd.date_range('2017', periods=4))
s.resample('2d').sum()

To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.

.. ipython:: python

s.resample('2d').sum(min_count=1)

In particular, upsampling and taking the sum or product is affected, as
upsampling introduces all-*NA* original series was entirely valid.

*pandas 0.21.x*

.. code-block:: ipython

In [10]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [10]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[10]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64

*pandas 0.22.0*

.. ipython:: python

idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
pd.Series([1, 2], index=idx).resample("12H").sum()

Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.

.. ipython:: python

pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add .rolling and .expanding with all-NA and min_count=0.


Rolling and Expanding
^^^^^^^^^^^^^^^^^^^^^

Rolling and expanding already have a ``min_periods`` keyword that behaves
similar to ``min_count``. The only case that changes is when doing a rolling
or expanding sum on an all-*NA* series with ``min_periods=0``. Previously this
returned ``NaN``, now it will return ``0``.

*pandas 0.21.1*

.. ipython:: python

In [11]: s = pd.Series([np.nan, np.nan])

In [12]: s.rolling(2, min_periods=0).sum()
Out[12]:
0 NaN
1 NaN
dtype: float64

*pandas 0.22.0*

.. ipython:: python

s = pd.Series([np.nan, np.nan])
s.rolling(2, min_periods=0).sum()

The default behavior of ``min_periods=None``, implying that ``min_periods``
equals the window size, is unchanged.
4 changes: 2 additions & 2 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down Expand Up @@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
Py_ssize_t min_count=1):
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
Expand Down
21 changes: 13 additions & 8 deletions pandas/_libs/window.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer):
right_closed: bint
right endpoint closedness
True if the right endpoint is closed, False if open

floor: optional
unit for flooring the unit
"""
def __init__(self, ndarray input, int64_t win, int64_t minp,
bint left_closed, bint right_closed, ndarray index):
bint left_closed, bint right_closed, ndarray index,
object floor=None):

self.is_variable = 1
self.N = len(index)
self.minp = _check_minp(win, minp, self.N)
self.minp = _check_minp(win, minp, self.N, floor=floor)

self.start = np.empty(self.N, dtype='int64')
self.start.fill(-1)
Expand Down Expand Up @@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed,

if index is not None:
indexer = VariableWindowIndexer(input, win, minp, left_closed,
right_closed, index)
right_closed, index, floor)
elif use_mock:
indexer = MockFixedWindowIndexer(input, win, minp, left_closed,
right_closed, index, floor)
Expand Down Expand Up @@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
object index, object closed):
cdef:
double val, prev_x, sum_x = 0
int64_t s, e
int64_t s, e, range_endpoint
int64_t nobs = 0, i, j, N
bint is_variable
ndarray[int64_t] start, end
ndarray[double_t] output

start, end, N, win, minp, is_variable = get_window_indexer(input, win,
minp, index,
closed)
closed,
floor=0)
output = np.empty(N, dtype=float)

# for performance we are going to iterate
Expand Down Expand Up @@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,

# fixed window

range_endpoint = int_max(minp, 1) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


with nogil:

for i in range(0, minp - 1):
for i in range(0, range_endpoint):
add_sum(input[i], &nobs, &sum_x)
output[i] = NaN

for i in range(minp - 1, N):
for i in range(range_endpoint, N):
val = input[i]
add_sum(val, &nobs, &sum_x)

Expand Down
34 changes: 17 additions & 17 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7619,48 +7619,48 @@ def _doc_parms(cls):
_sum_examples = """\
Examples
--------
By default, the sum of an empty series is ``NaN``.
By default, the sum of an empty or all-NA Series is ``0``.

>>> pd.Series([]).sum() # min_count=1 is the default
nan
>>> pd.Series([]).sum() # min_count=0 is the default
0.0

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.

>>> pd.Series([]).sum(min_count=0)
0.0
>>> pd.Series([]).sum(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
0.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""

_prod_examples = """\
Examples
--------
By default, the product of an empty series is ``NaN``
By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([]).prod()
nan
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([]).prod(min_count=0)
1.0
>>> pd.Series([]).prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
nan

>>> pd.Series([np.nan]).sum(min_count=0)
1.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
"""


Expand All @@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
examples=examples)
@Appender(_num_doc)
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
min_count=1,
min_count=0,
**kwargs):
nv.validate_stat_func(tuple(), kwargs, fname=name)
if skipna is None:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1363,8 +1363,8 @@ def last(x):
else:
return last(x)

cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
cls.first = groupby_function('first', 'first', first_compat,
Expand Down
Loading