Skip to content

Commit d5e475f

Browse files
committed
API: Change the sum of all-NA / all-Empty sum / prod
1 parent b9decb6 commit d5e475f

19 files changed

+387
-99
lines changed

doc/source/whatsnew/v0.22.0.txt

+177-4
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,185 @@
33
v0.22.0
44
-------
55

6-
This is a major release from 0.21.1 and includes a number of API changes,
7-
deprecations, new features, enhancements, and performance improvements along
8-
with a large number of bug fixes. We recommend that all users upgrade to this
9-
version.
6+
This is a major release from 0.21.1 and includes a single, API-breaking change.
7+
We recommend that all users upgrade to this version after carefully reading the
8+
release note (singular!).
109

1110
.. _whatsnew_0220.api_breaking:
1211

1312
Backwards incompatible API changes
1413
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14+
15+
Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The
16+
summary is that
17+
18+
* The sum of an all-NA or empty series is now 0
19+
* The product of an all-NA or empty series is now 1
20+
* We've added a ``min_count`` parameter to ``.sum`` and ``.prod`` to control
21+
the minimum number of valid values for the result to be valid. If fewer than
22+
``min_count`` valid values are present, the result is NA. The default is
23+
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.
24+
25+
Some background: In pandas 0.21, we fixed a long-standing inconsistency
26+
in the return value of all-NA series depending on whether or not bottleneck
27+
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same
28+
time, we changed the sum and prod of an empty Series to also be ``NaN``.
29+
30+
Based on feedback, we've partially reverted those changes. The default sum for
31+
all-NA and empty series is now 0 (1 for ``prod``).
32+
33+
*pandas 0.21*
34+
35+
.. code-block:: ipython
36+
37+
In [3]: pd.Series([]).sum()
38+
Out[3]: nan
39+
40+
In [4]: pd.Series([np.nan]).sum()
41+
Out[4]: nan
42+
43+
*pandas 0.22.0*
44+
45+
.. ipython:: python
46+
47+
pd.Series([]).sum()
48+
pd.Series([np.nan]).sum()
49+
50+
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
51+
also matches the behavior of ``np.nansum`` and ``np.nanprod`` on empty and
52+
all-NA arrays.
53+
54+
To have the sum of an empty series return ``NaN``, use the ``min_count``
55+
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-NA
56+
series is conceptually the same as on an empty. The ``min_count`` parameter
57+
refers to the minimum number of *valid* values required for a non-NA sum
58+
or product.
59+
60+
.. ipython:: python
61+
62+
pd.Series([]).sum(min_count=1)
63+
pd.Series([np.nan]).sum(min_count=1)
64+
65+
Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck
66+
installed.
67+
68+
Note that this affects some other places in the library:
69+
70+
1. Grouping by a Categorical with some unobserved categories and computing the
71+
``sum`` / ``prod``.
72+
73+
*pandas 0.21*
74+
75+
.. code-block:: ipython
76+
77+
In [5]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
78+
79+
In [6]: pd.Series([1, 2]).groupby(grouper).sum()
80+
Out[6]:
81+
a 3.0
82+
b NaN
83+
dtype: float64
84+
85+
*pandas 0.22*
86+
87+
.. ipython:: python
88+
89+
grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
90+
pd.Series([1, 2]).groupby(grouper).sum()
91+
92+
pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
93+
94+
2. Resampling and taking the ``sum`` / ``prod``.
95+
96+
The output for an all-NaN bin will change:
97+
98+
*pandas 0.21.0*
99+
100+
.. code-block:: ipython
101+
102+
In [7]: s = pd.Series([1, 1, np.nan, np.nan],
103+
...: index=pd.date_range('2017', periods=4))
104+
...:
105+
106+
In [8]: s
107+
Out[8]:
108+
2017-01-01 1.0
109+
2017-01-02 1.0
110+
2017-01-03 NaN
111+
2017-01-04 NaN
112+
Freq: D, dtype: float64
113+
114+
In [9]: s.resample('2d').sum()
115+
Out[9]:
116+
2017-01-01 2.0
117+
2017-01-03 NaN
118+
Freq: 2D, dtype: float64
119+
120+
*pandas 0.22.0*
121+
122+
.. ipython:: python
123+
124+
s = pd.Series([1, 1, np.nan, np.nan],
125+
index=pd.date_range('2017', periods=4))
126+
s.resample('2d').sum()
127+
128+
To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.
129+
130+
.. ipython:: python
131+
132+
s.resample('2d').sum(min_count=1)
133+
134+
In particular, upsampling and taking the sum or product is affected, as
135+
upsampling introduces all-NaN series even if your original series was
136+
entirely valid.
137+
138+
*pandas 0.21.0*
139+
140+
.. code-block:: ipython
141+
142+
In [10]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
143+
144+
In [10]: pd.Series([1, 2], index=idx).resample('12H').sum()
145+
Out[10]:
146+
2017-01-01 00:00:00 1.0
147+
2017-01-01 12:00:00 NaN
148+
2017-01-02 00:00:00 2.0
149+
Freq: 12H, dtype: float64
150+
151+
*pandas 0.22.0*
152+
153+
.. ipython:: python
154+
155+
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
156+
pd.Series([1, 2], index=idx).resample("12H").sum()
157+
158+
Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.
159+
160+
pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
161+
162+
3. Rolling / Expanding window operations and taking the ``sum`` / ``prod``.
163+
164+
Rolling and expanding already have a ``min_periods`` keyword that behaves
165+
similarly to ``min_count``. The only case that changes is when doing a rolling
166+
or expanding sum on an all-NaN series with ``min_periods=0``. Previously this
167+
returned ``NaN``, now it will return ``0`` (or ``1`` for ``prod``).
168+
169+
*pandas 0.21.1*
170+
171+
.. ipython:: python
172+
173+
In [11]: s = pd.Series([np.nan, np.nan])
174+
175+
In [12]: s.rolling(2, min_periods=0).sum()
176+
Out[12]:
177+
0 NaN
178+
1 NaN
179+
dtype: float64
180+
181+
*pandas 0.22.0*
182+
183+
.. ipython:: python
184+
185+
In [2]: s = pd.Series([np.nan, np.nan])
186+
187+
In [3]: s.rolling(2, min_periods=0).sum()

pandas/_libs/groupby_helper.pxi.in

+2-2
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
3737
ndarray[int64_t] counts,
3838
ndarray[{{c_type}}, ndim=2] values,
3939
ndarray[int64_t] labels,
40-
Py_ssize_t min_count=1):
40+
Py_ssize_t min_count=0):
4141
"""
4242
Only aggregates on axis=0
4343
"""
@@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
101101
ndarray[int64_t] counts,
102102
ndarray[{{c_type}}, ndim=2] values,
103103
ndarray[int64_t] labels,
104-
Py_ssize_t min_count=1):
104+
Py_ssize_t min_count=0):
105105
"""
106106
Only aggregates on axis=0
107107
"""

pandas/_libs/window.pyx

+10-5
Original file line numberDiff line numberDiff line change
@@ -409,10 +409,11 @@ def roll_count(ndarray[double_t] input, int64_t win, int64_t minp,
409409
# Rolling sum
410410

411411

412-
cdef inline double calc_sum(int64_t minp, int64_t nobs, double sum_x) nogil:
412+
cdef inline double calc_sum(int64_t minp, int64_t nobs, double sum_x,
413+
bint no_min=False) nogil:
413414
cdef double result
414415

415-
if nobs >= minp:
416+
if no_min or nobs >= minp:
416417
result = sum_x
417418
else:
418419
result = NaN
@@ -443,10 +444,14 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
443444
double val, prev_x, sum_x = 0
444445
int64_t s, e
445446
int64_t nobs = 0, i, j, N
446-
bint is_variable
447+
bint is_variable, no_min
447448
ndarray[int64_t] start, end
448449
ndarray[double_t] output
449450

451+
if minp == 0:
452+
no_min = True
453+
else:
454+
no_min = False
450455
start, end, N, win, minp, is_variable = get_window_indexer(input, win,
451456
minp, index,
452457
closed)
@@ -483,7 +488,7 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
483488
for j in range(end[i - 1], e):
484489
add_sum(input[j], &nobs, &sum_x)
485490

486-
output[i] = calc_sum(minp, nobs, sum_x)
491+
output[i] = calc_sum(minp, nobs, sum_x, no_min)
487492

488493
else:
489494

@@ -503,7 +508,7 @@ def roll_sum(ndarray[double_t] input, int64_t win, int64_t minp,
503508
prev_x = input[i - win]
504509
remove_sum(prev_x, &nobs, &sum_x)
505510

506-
output[i] = calc_sum(minp, nobs, sum_x)
511+
output[i] = calc_sum(minp, nobs, sum_x, no_min)
507512

508513
return output
509514

pandas/core/generic.py

+17-17
Original file line numberDiff line numberDiff line change
@@ -7619,48 +7619,48 @@ def _doc_parms(cls):
76197619
_sum_examples = """\
76207620
Examples
76217621
--------
7622-
By default, the sum of an empty series is ``NaN``.
7622+
By default, the sum of an empty series is ``0``.
76237623
7624-
>>> pd.Series([]).sum() # min_count=1 is the default
7625-
nan
7624+
>>> pd.Series([]).sum() # min_count=0 is the default
7625+
0.0
76267626
76277627
This can be controlled with the ``min_count`` parameter. For example, if
7628-
you'd like the sum of an empty series to be 0, pass ``min_count=0``.
7628+
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.
76297629
7630-
>>> pd.Series([]).sum(min_count=0)
7631-
0.0
7630+
>>> pd.Series([]).sum(min_count=1)
7631+
nan
76327632
76337633
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
76347634
empty series identically.
76357635
76367636
>>> pd.Series([np.nan]).sum()
7637-
nan
7638-
7639-
>>> pd.Series([np.nan]).sum(min_count=0)
76407637
0.0
7638+
7639+
>>> pd.Series([np.nan]).sum(min_count=1)
7640+
nan
76417641
"""
76427642

76437643
_prod_examples = """\
76447644
Examples
76457645
--------
7646-
By default, the product of an empty series is ``NaN``
7646+
By default, the product of an empty series is ``1``
76477647
76487648
>>> pd.Series([]).prod()
7649-
nan
7649+
1.0
76507650
76517651
This can be controlled with the ``min_count`` parameter
76527652
7653-
>>> pd.Series([]).prod(min_count=0)
7654-
1.0
7653+
>>> pd.Series([]).prod(min_count=1)
7654+
nan
76557655
76567656
Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
76577657
empty series identically.
76587658
76597659
>>> pd.Series([np.nan]).prod()
7660-
nan
7661-
7662-
>>> pd.Series([np.nan]).sum(min_count=0)
76637660
1.0
7661+
7662+
>>> pd.Series([np.nan]).sum(min_count=1)
7663+
nan
76647664
"""
76657665

76667666

@@ -7683,7 +7683,7 @@ def _make_min_count_stat_function(cls, name, name1, name2, axis_descr, desc,
76837683
examples=examples)
76847684
@Appender(_num_doc)
76857685
def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
7686-
min_count=1,
7686+
min_count=0,
76877687
**kwargs):
76887688
nv.validate_stat_func(tuple(), kwargs, fname=name)
76897689
if skipna is None:

pandas/core/groupby.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1286,8 +1286,8 @@ def last(x):
12861286
else:
12871287
return last(x)
12881288

1289-
cls.sum = groupby_function('sum', 'add', np.sum, min_count=1)
1290-
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=1)
1289+
cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)
1290+
cls.prod = groupby_function('prod', 'prod', np.prod, min_count=0)
12911291
cls.min = groupby_function('min', 'min', np.min, numeric_only=False)
12921292
cls.max = groupby_function('max', 'max', np.max, numeric_only=False)
12931293
cls.first = groupby_function('first', 'first', first_compat,

pandas/core/nanops.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@ def nanall(values, axis=None, skipna=True):
308308

309309
@disallow('M8')
310310
@bottleneck_switch()
311-
def nansum(values, axis=None, skipna=True, min_count=1):
311+
def nansum(values, axis=None, skipna=True, min_count=0):
312312
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
313313
dtype_sum = dtype_max
314314
if is_float_dtype(dtype):
@@ -645,7 +645,7 @@ def nankurt(values, axis=None, skipna=True):
645645

646646

647647
@disallow('M8', 'm8')
648-
def nanprod(values, axis=None, skipna=True, min_count=1):
648+
def nanprod(values, axis=None, skipna=True, min_count=0):
649649
mask = isna(values)
650650
if skipna and not is_any_int_dtype(values):
651651
values = values.copy()

pandas/core/resample.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -605,7 +605,7 @@ def size(self):
605605
# downsample methods
606606
for method in ['sum', 'prod']:
607607

608-
def f(self, _method=method, min_count=1, *args, **kwargs):
608+
def f(self, _method=method, min_count=0, *args, **kwargs):
609609
nv.validate_resampler_func(_method, args, kwargs)
610610
return self._downsample(_method, min_count=min_count)
611611
f.__doc__ = getattr(GroupBy, method).__doc__

0 commit comments

Comments
 (0)