Skip to content

Commit 4662a3c

Browse files
committed
Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
1 parent 7174f0f commit 4662a3c

20 files changed

+640
-258
lines changed

doc/source/whatsnew/v0.22.0.txt

+168-104
Original file line numberDiff line numberDiff line change
@@ -3,154 +3,218 @@
33
v0.22.0
44
-------
55

6-
This is a major release from 0.21.1 and includes a number of API changes,
7-
deprecations, new features, enhancements, and performance improvements along
8-
with a large number of bug fixes. We recommend that all users upgrade to this
9-
version.
6+
This is a major release from 0.21.1 and includes a single, API-breaking change.
7+
We recommend that all users upgrade to this version after carefully reading the
8+
release note (singular!).
109

11-
.. _whatsnew_0220.enhancements:
10+
.. _whatsnew_0220.api_breaking:
1211

13-
New features
14-
~~~~~~~~~~~~
12+
Backwards incompatible API changes
13+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1514

16-
-
17-
-
18-
-
15+
Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
16+
summary is that
1917

20-
.. _whatsnew_0220.enhancements.other:
18+
* The sum of an empty or all-*NA* ``Series`` is now ``0``
19+
* The product of an empty or all-*NA* ``Series`` is now ``1``
20+
* We've added a ``min_count`` parameter to ``.sum()`` and ``.prod()`` controlling
21+
the minimum number of valid values for the result to be valid. If fewer than
22+
``min_count`` non-*NA* values are present, the result is *NA*. The default is
23+
``0``. To return ``NaN``, the 0.21 behavior, use ``min_count=1``.
2124

22-
Other Enhancements
23-
^^^^^^^^^^^^^^^^^^
25+
Some background: In pandas 0.21, we fixed a long-standing inconsistency
26+
in the return value of all-*NA* series depending on whether or not bottleneck
27+
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`. At the same
28+
time, we changed the sum and prod of an empty ``Series`` to also be ``NaN``.
2429

25-
-
26-
-
27-
-
30+
Based on feedback, we've partially reverted those changes.
2831

29-
.. _whatsnew_0220.api_breaking:
32+
Arithmetic Operations
33+
^^^^^^^^^^^^^^^^^^^^^
3034

31-
Backwards incompatible API changes
32-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35+
The default sum for empty or all-*NA* ``Series`` is now ``0``.
3336

34-
-
35-
-
36-
-
37+
*pandas 0.21.x*
3738

38-
.. _whatsnew_0220.api:
39+
.. code-block:: ipython
3940

40-
Other API Changes
41-
^^^^^^^^^^^^^^^^^
41+
In [1]: pd.Series([]).sum()
42+
Out[1]: nan
4243

43-
-
44-
-
45-
-
44+
In [2]: pd.Series([np.nan]).sum()
45+
Out[2]: nan
4646

47-
.. _whatsnew_0220.deprecations:
47+
*pandas 0.22.0*
4848

49-
Deprecations
50-
~~~~~~~~~~~~
49+
.. ipython:: python
5150

52-
-
53-
-
54-
-
51+
pd.Series([]).sum()
52+
pd.Series([np.nan]).sum()
5553

56-
.. _whatsnew_0220.prior_deprecations:
54+
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
55+
also matches the behavior of NumPy's ``np.nansum`` on empty and all-*NA* arrays.
5756

58-
Removal of prior version deprecations/changes
59-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57+
To have the sum of an empty series return ``NaN`` (the default behavior of
58+
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the ``min_count``
59+
keyword.
6060

61-
-
62-
-
63-
-
61+
.. ipython:: python
6462

65-
.. _whatsnew_0220.performance:
63+
pd.Series([]).sum(min_count=1)
6664

67-
Performance Improvements
68-
~~~~~~~~~~~~~~~~~~~~~~~~
65+
Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA*
66+
series is conceptually the same as the ``.sum`` of an empty one with
67+
``skipna=True`` (the default).
6968

70-
-
71-
-
72-
-
69+
.. ipython:: python
7370

74-
.. _whatsnew_0220.docs:
71+
pd.Series([np.nan]).sum(min_count=1) # skipna=True by default
7572

76-
Documentation Changes
77-
~~~~~~~~~~~~~~~~~~~~~
73+
The ``min_count`` parameter refers to the minimum number of *non-null* values
74+
required for a non-NA sum or product.
7875

79-
-
80-
-
81-
-
76+
:meth:`Series.prod` has been updated to behave the same as :meth:`Series.sum`,
77+
returning ``1`` instead.
8278

83-
.. _whatsnew_0220.bug_fixes:
79+
.. ipython:: python
8480

85-
Bug Fixes
86-
~~~~~~~~~
81+
pd.Series([]).prod()
82+
pd.Series([np.nan]).prod()
83+
pd.Series([]).prod(min_count=1)
8784

88-
Conversion
89-
^^^^^^^^^^
85+
These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
86+
Finally, a few less obvious places in pandas are affected by this change.
9087

91-
-
92-
-
93-
-
88+
Grouping by a Categorical
89+
^^^^^^^^^^^^^^^^^^^^^^^^^
9490

95-
Indexing
96-
^^^^^^^^
91+
Grouping by a ``Categorical`` and summing now returns ``0`` instead of
92+
``NaN`` for categories with no observations. The product now returns ``1``
93+
instead of ``NaN``.
94+
95+
*pandas 0.21.x*
96+
97+
.. code-block:: ipython
9798

98-
-
99-
-
100-
-
99+
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
101100

102-
I/O
103-
^^^
101+
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
102+
Out[9]:
103+
a 3.0
104+
b NaN
105+
dtype: float64
104106

105-
-
106-
-
107-
-
107+
*pandas 0.22*
108108

109-
Plotting
109+
.. ipython:: python
110+
111+
grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
112+
pd.Series([1, 2]).groupby(grouper).sum()
113+
114+
To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
115+
use ``min_count>=1``.
116+
117+
.. ipython:: python
118+
119+
pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
120+
121+
Resample
110122
^^^^^^^^
111123

112-
-
113-
-
114-
-
124+
The sum and product of all-*NA* bins has changed from ``NaN`` to ``0`` for
125+
sum and ``1`` for product.
126+
127+
*pandas 0.21.x*
128+
129+
.. code-block:: ipython
130+
131+
In [11]: s = pd.Series([1, 1, np.nan, np.nan],
132+
...: index=pd.date_range('2017', periods=4))
133+
...: s
134+
Out[11]:
135+
2017-01-01 1.0
136+
2017-01-02 1.0
137+
2017-01-03 NaN
138+
2017-01-04 NaN
139+
Freq: D, dtype: float64
140+
141+
In [12]: s.resample('2d').sum()
142+
Out[12]:
143+
2017-01-01 2.0
144+
2017-01-03 NaN
145+
Freq: 2D, dtype: float64
146+
147+
*pandas 0.22.0*
148+
149+
.. ipython:: python
150+
151+
s = pd.Series([1, 1, np.nan, np.nan],
152+
index=pd.date_range('2017', periods=4))
153+
s.resample('2d').sum()
154+
155+
To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.
156+
157+
.. ipython:: python
158+
159+
s.resample('2d').sum(min_count=1)
160+
161+
In particular, upsampling and taking the sum or product is affected, as
162+
upsampling introduces missing values even if the original series was
163+
entirely valid.
164+
165+
*pandas 0.21.x*
166+
167+
.. code-block:: ipython
168+
169+
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
170+
171+
In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
172+
Out[15]:
173+
2017-01-01 00:00:00 1.0
174+
2017-01-01 12:00:00 NaN
175+
2017-01-02 00:00:00 2.0
176+
Freq: 12H, dtype: float64
177+
178+
*pandas 0.22.0*
179+
180+
.. ipython:: python
181+
182+
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
183+
pd.Series([1, 2], index=idx).resample("12H").sum()
184+
185+
Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.
115186

116-
Groupby/Resample/Rolling
117-
^^^^^^^^^^^^^^^^^^^^^^^^
187+
.. ipython:: python
118188

119-
-
120-
-
121-
-
189+
pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
122190

123-
Sparse
124-
^^^^^^
191+
Rolling and Expanding
192+
^^^^^^^^^^^^^^^^^^^^^
125193

126-
-
127-
-
128-
-
194+
Rolling and expanding already have a ``min_periods`` keyword that behaves
195+
similar to ``min_count``. The only case that changes is when doing a rolling
196+
or expanding sum with ``min_periods=0``. Previously this returned ``NaN``,
197+
when fewer than ``min_periods`` non-*NA* values were in the window. Now it
198+
returns ``0``.
129199

130-
Reshaping
131-
^^^^^^^^^
200+
*pandas 0.21.1*
132201

133-
-
134-
-
135-
-
202+
.. code-block:: ipython
136203

137-
Numeric
138-
^^^^^^^
204+
In [17]: s = pd.Series([np.nan, np.nan])
139205

140-
-
141-
-
142-
-
206+
In [18]: s.rolling(2, min_periods=0).sum()
207+
Out[18]:
208+
0 NaN
209+
1 NaN
210+
dtype: float64
143211

144-
Categorical
145-
^^^^^^^^^^^
212+
*pandas 0.22.0*
146213

147-
-
148-
-
149-
-
214+
.. ipython:: python
150215

151-
Other
152-
^^^^^
216+
s = pd.Series([np.nan, np.nan])
217+
s.rolling(2, min_periods=0).sum()
153218

154-
-
155-
-
156-
-
219+
The default behavior of ``min_periods=None``, implying that ``min_periods``
220+
equals the window size, is unchanged.

pandas/_libs/groupby_helper.pxi.in

+2-2
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
3737
ndarray[int64_t] counts,
3838
ndarray[{{c_type}}, ndim=2] values,
3939
ndarray[int64_t] labels,
40-
Py_ssize_t min_count=1):
40+
Py_ssize_t min_count=0):
4141
"""
4242
Only aggregates on axis=0
4343
"""
@@ -101,7 +101,7 @@ def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
101101
ndarray[int64_t] counts,
102102
ndarray[{{c_type}}, ndim=2] values,
103103
ndarray[int64_t] labels,
104-
Py_ssize_t min_count=1):
104+
Py_ssize_t min_count=0):
105105
"""
106106
Only aggregates on axis=0
107107
"""

0 commit comments

Comments
 (0)