Skip to content

Commit dedfce9

Browse files
Breaking changes for sum / prod of empty / all-NA (#18921)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew
1 parent fae7920 commit dedfce9

20 files changed

+547
-159
lines changed

doc/source/whatsnew/v0.22.0.txt

+210-4
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,218 @@
33
v0.22.0
44
-------
55

6-
This is a major release from 0.21.1 and includes a number of API changes,
7-
deprecations, new features, enhancements, and performance improvements along
8-
with a large number of bug fixes. We recommend that all users upgrade to this
9-
version.
6+
This is a major release from 0.21.1 and includes a single, API-breaking change.
7+
We recommend that all users upgrade to this version after carefully reading the
8+
release note (singular!).
109

1110
.. _whatsnew_0220.api_breaking:
1211

1312
Backwards incompatible API changes
1413
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14+
15+
Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
16+
summary is that
17+
18+
* The sum of an empty or all-*NA* ``Series`` is now ``0``
19+
* The product of an empty or all-*NA* ``Series`` is now ``1``
20+
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` controlling
21+
the minimum number of valid values for the result to be valid. If fewer than
22+
``min_count`` non-*NA* values are present, the result is *NA*. The default is
23+
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.
24+
25+
Some background: In pandas 0.21, we fixed a long-standing inconsistency
26+
in the return value of all-*NA* series depending on whether or not bottleneck
27+
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`. At the same
28+
time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.
29+
30+
Based on feedback, we've partially reverted those changes.
31+
32+
Arithmetic Operations
33+
^^^^^^^^^^^^^^^^^^^^^
34+
35+
The default sum for empty or all-*NA* ``Series`` is now ``0``.
36+
37+
*pandas 0.21.x*
38+
39+
.. code-block:: ipython
40+
41+
In [1]: pd.Series([]).sum()
42+
Out[1]: nan
43+
44+
In [2]: pd.Series([np.nan]).sum()
45+
Out[2]: nan
46+
47+
*pandas 0.22.0*
48+
49+
.. ipython:: python
50+
51+
pd.Series([]).sum()
52+
pd.Series([np.nan]).sum()
53+
54+
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
55+
also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.
56+
57+
To have the sum of an empty series return ``NaN`` (the default behavior of
58+
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the ``min_count``
59+
keyword.
60+
61+
.. ipython:: python
62+
63+
pd.Series([]).sum(min_count=1)
64+
65+
Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
66+
series is conceptually the same as the ``.sum`` of an empty one with
67+
``skipna=True`` (the default).
68+
69+
.. ipython:: python
70+
71+
pd.Series([np.nan]).sum(min_count=1) # skipna=True by default
72+
73+
The ``min_count`` parameter refers to the minimum number of *non-null* values
74+
required for a non-NA sum or product.
75+
76+
:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
77+
returning ``1`` instead.
78+
79+
.. ipython:: python
80+
81+
pd.Series([]).prod()
82+
pd.Series([np.nan]).prod()
83+
pd.Series([]).prod(min_count=1)
84+
85+
These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
86+
Finally, a few less obvious places in pandas are affected by this change.
87+
88+
Grouping by a Categorical
89+
^^^^^^^^^^^^^^^^^^^^^^^^^
90+
91+
Grouping by a ``Categorical`` and summing now returns ``0`` instead of
92+
``NaN`` for categories with no observations. The product now returns ``1``
93+
instead of ``NaN``.
94+
95+
*pandas 0.21.x*
96+
97+
.. code-block:: ipython
98+
99+
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
100+
101+
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
102+
Out[9]:
103+
a 3.0
104+
b NaN
105+
dtype: float64
106+
107+
*pandas 0.22*
108+
109+
.. ipython:: python
110+
111+
grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
112+
pd.Series([1, 2]).groupby(grouper).sum()
113+
114+
To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
115+
use ``min_count>=1``.
116+
117+
.. ipython:: python
118+
119+
pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
120+
121+
Resample
122+
^^^^^^^^
123+
124+
The sum and product of all-*NA* bins has changed from ``NaN`` to ``0`` for
125+
sum and ``1`` for product.
126+
127+
*pandas 0.21.x*
128+
129+
.. code-block:: ipython
130+
131+
In [11]: s = pd.Series([1, 1, np.nan, np.nan],
132+
...: index=pd.date_range('2017', periods=4))
133+
...: s
134+
Out[11]:
135+
2017-01-01 1.0
136+
2017-01-02 1.0
137+
2017-01-03 NaN
138+
2017-01-04 NaN
139+
Freq: D, dtype: float64
140+
141+
In [12]: s.resample('2d').sum()
142+
Out[12]:
143+
2017-01-01 2.0
144+
2017-01-03 NaN
145+
Freq: 2D, dtype: float64
146+
147+
*pandas 0.22.0*
148+
149+
.. ipython:: python
150+
151+
s = pd.Series([1, 1, np.nan, np.nan],
152+
index=pd.date_range('2017', periods=4))
153+
s.resample('2d').sum()
154+
155+
To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.
156+
157+
.. ipython:: python
158+
159+
s.resample('2d').sum(min_count=1)
160+
161+
In particular, upsampling and taking the sum or product is affected, as
162+
upsampling introduces missing values even if the original series was
163+
entirely valid.
164+
165+
*pandas 0.21.x*
166+
167+
.. code-block:: ipython
168+
169+
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
170+
171+
In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
172+
Out[15]:
173+
2017-01-01 00:00:00 1.0
174+
2017-01-01 12:00:00 NaN
175+
2017-01-02 00:00:00 2.0
176+
Freq: 12H, dtype: float64
177+
178+
*pandas 0.22.0*
179+
180+
.. ipython:: python
181+
182+
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
183+
pd.Series([1, 2], index=idx).resample("12H").sum()
184+
185+
Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.
186+
187+
.. ipython:: python
188+
189+
pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
190+
191+
Rolling and Expanding
192+
^^^^^^^^^^^^^^^^^^^^^
193+
194+
Rolling and expanding already have a ``min_periods`` keyword that behaves
195+
similar to ``min_count``. The only case that changes is when doing a rolling
196+
or expanding sum with ``min_periods=0``. Previously this returned ``NaN``,
197+
when fewer than ``min_periods`` non-*NA* values were in the window. Now it
198+
returns ``0``.
199+
200+
*pandas 0.21.1*
201+
202+
.. code-block:: ipython
203+
204+
In [17]: s = pd.Series([np.nan, np.nan])
205+
206+
In [18]: s.rolling(2, min_periods=0).sum()
207+
Out[18]:
208+
0 NaN
209+
1 NaN
210+
dtype: float64
211+
212+
*pandas 0.22.0*
213+
214+
.. ipython:: python
215+
216+
s = pd.Series([np.nan, np.nan])
217+
s.rolling(2, min_periods=0).sum()
218+
219+
The default behavior of ``min_periods=None``, implying that ``min_periods``
220+
equals the window size, is unchanged.

pandas/_libs/groupby_helper.pxi.in

+2-2
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
3737
ndarray[int64_t] counts,
3838
ndarray[{{c_type}}, ndim=2] values,
3939
ndarray[int64_t] labels,
40-
Py_ssize_t min_count=1):
40+
Py_ssize_t min_count=0):
4141
"""
4242
Only aggregates on axis=0
4343
"""
@@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
101101
ndarray[int64_t] counts,
102102
ndarray[{{c_type}}, ndim=2] values,
103103
ndarray[int64_t] labels,
104-
Py_ssize_t min_count=1):
104+
Py_ssize_t min_count=0):
105105
"""
106106
Only aggregates on axis=0
107107
"""

pandas/_libs/window.pyx

+13-8
Original file line numberDiff line numberDiff line change
@@ -220,14 +220,16 @@ cdef class VariableWindowIndexer(WindowIndexer):
220220
right_closed: bint
221221
right endpoint closedness
222222
True if the right endpoint is closed, False if open
223-
223+
floor: optional
224+
unit for flooring the unit
224225
"""
225226
def __init__(self, ndarray input, int64_t win, int64_t minp,
226-
bint left_closed, bint right_closed, ndarray index):
227+
bint left_closed, bint right_closed, ndarray index,
228+
object floor=None):
227229

228230
self.is_variable = 1
229231
self.N = len(index)
230-
self.minp = _check_minp(win, minp, self.N)
232+
self.minp = _check_minp(win, minp, self.N, floor=floor)
231233

232234
self.start = np.empty(self.N, dtype='int64')
233235
self.start.fill(-1)
@@ -342,7 +344,7 @@ def get_window_indexer(input, win, minp, index, closed,
342344

343345
if index is not None:
344346
indexer = VariableWindowIndexer(input, win, minp, left_closed,
345-
right_closed, index)
347+
right_closed, index, floor)
346348
elif use_mock:
347349
indexer = MockFixedWindowIndexer(input, win, minp, left_closed,
348350
right_closed, index, floor)
@@ -441,15 +443,16 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
441443
object index, object closed):
442444
cdef:
443445
double val, prev_x, sum_x = 0
444-
int64_t s, e
446+
int64_t s, e, range_endpoint
445447
int64_t nobs = 0, i, j, N
446448
bint is_variable
447449
ndarray[int64_t] start, end
448450
ndarray[double_t] output
449451

450452
start, end, N, win, minp, is_variable = get_window_indexer(input, win,
451453
minp, index,
452-
closed)
454+
closed,
455+
floor=0)
453456
output = np.empty(N, dtype=float)
454457

455458
# for performance we are going to iterate
@@ -489,13 +492,15 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
489492

490493
# fixed window
491494

495+
range_endpoint = int_max(minp, 1) - 1
496+
492497
with nogil:
493498

494-
for i in range(0, minp - 1):
499+
for i in range(0, range_endpoint):
495500
add_sum(input[i], &nobs, &sum_x)
496501
output[i] = NaN
497502

498-
for i in range(minp - 1, N):
503+
for i in range(range_endpoint, N):
499504
val = input[i]
500505
add_sum(val, &nobs, &sum_x)
501506

pandas/core/generic.py

+17-17
Original file line numberDiff line numberDiff line change
@@ -7619,48 +7619,48 @@ def _doc_parms(cls):
76197619
_sum_examples = """\
76207620
Examples
76217621
--------
7622-
By default, the sum of an empty series is ``NaN``.
7622+
By default, the sum of an empty or all-NA Series is ``0``.
76237623
7624-
>>> pd.Series([]).sum() # min_count=1 is the default
7625-
nan
7624+
>>> pd.Series([]).sum() # min_count=0 is the default
7625+
0.0
76267626
76277627
This can be controlled with the ``min_count`` parameter. For example, if
7628-
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
7628+
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.
76297629
7630-
>>> pd.Series([]).sum(min_count=0)
7631-
0.0
7630+
>>> pd.Series([]).sum(min_count=1)
7631+
nan
76327632
76337633
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
76347634
empty series identically.
76357635
76367636
>>> pd.Series([np.nan]).sum()
7637-
nan
7638-
7639-
>>> pd.Series([np.nan]).sum(min_count=0)
76407637
0.0
7638+
7639+
>>> pd.Series([np.nan]).sum(min_count=1)
7640+
nan
76417641
"""
76427642

76437643
_prod_examples = """\
76447644
Examples
76457645
--------
7646-
By default, the product of an empty series is ``NaN``
7646+
By default, the product of an empty or all-NA Series is ``1``
76477647
76487648
>>> pd.Series([]).prod()
7649-
nan
7649+
1.0
76507650
76517651
This can be controlled with the ``min_count`` parameter
76527652
7653-
>>> pd.Series([]).prod(min_count=0)
7654-
1.0
7653+
>>> pd.Series([]).prod(min_count=1)
7654+
nan
76557655
76567656
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
76577657
empty series identically.
76587658
76597659
>>> pd.Series([np.nan]).prod()
7660-
nan
7661-
7662-
>>> pd.Series([np.nan]).sum(min_count=0)
76637660
1.0
7661+
7662+
>>> pd.Series([np.nan]).sum(min_count=1)
7663+
nan
76647664
"""
76657665

76667666

@@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
76837683
examples=examples)
76847684
@Appender(_num_doc)
76857685
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
7686-
min_count=1,
7686+
min_count=0,
76877687
**kwargs):
76887688
nv.validate_stat_func(tuple(), kwargs, fname=name)
76897689
if skipna is None:

pandas/core/groupby.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1363,8 +1363,8 @@ def last(x):
13631363
else:
13641364
return last(x)
13651365

1366-
cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
1367-
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
1366+
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
1367+
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
13681368
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
13691369
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
13701370
cls.first = groupby_function('first', 'first', first_compat,

0 commit comments

Comments
 (0)