-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Breaking changes for sum / prod of empty / all-NA #18921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes for sum / prod of empty / all-NA #18921
Conversation
doc/source/whatsnew/v0.22.0.txt
Outdated
idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) | ||
pd.Series([1, 2], index=idx).resample("12H").sum() | ||
|
||
pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add .rolling
and .expanding
with all-NA and min_count=0
.
pandas/tests/frame/test_analytics.py
Outdated
def wrapper(x): | ||
return alternative(x.values) | ||
|
||
if skipna_alternative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed like the easiest way of getting these tests to pass. skipna_alternative
is only used in sum and prod, and is nansum
or nanprod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok
@@ -813,8 +813,6 @@ def test__cython_agg_general(self): | |||
('mean', np.mean), | |||
('median', lambda x: np.median(x) if len(x) > 0 else np.nan), | |||
('var', lambda x: np.var(x, ddof=1)), | |||
('add', lambda x: np.sum(x) if len(x) > 0 else np.nan), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these to their own tests in e390d1c#diff-1c9535bf6ad763b307a780e37cc4185cR833 due to #18869
@@ -147,7 +147,7 @@ def test_empty_multi(self, method, unit): | |||
tm.assert_series_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize( | |||
"method", ['sum', 'mean', 'median', 'std', 'var']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: make sure I have coverage for this elsewhere... I don't remember why I removed it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are handled up above in test_empty
.
pandas/tests/test_base.py
Outdated
def test_nanops(self): | ||
@pytest.mark.parametrize('klass', [Index, Series]) | ||
@pytest.mark.parametrize('op', ['max', 'min']) | ||
def test_nanops_max_min(self, op, klass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought I did this parametrization in #18876. I can move it there if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these seem to be a remnant of an earlier version where I was adding min_count
everywhere. I'll just remove it.
pandas/tests/test_nanops.py
Outdated
for axis in list(range(targarval.ndim)) + [None]: | ||
for skipna in [False, True]: | ||
targartempval = targarval if skipna else targarnanval | ||
try: | ||
if skipna and empty_targfunc and pd.isna(targartempval).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reasonably sure I refactored these correctly, but would appreciate a close review here. empty_targfunc
should be like nansum
or nanprod
, in which case we want to call it on targartempval
. I should double check why just passing nansum
for targfunc
didn't work though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the imported isna
pandas/tests/test_window.py
Outdated
@@ -1358,9 +1390,10 @@ def get_result(arr, window, min_periods=None, center=False): | |||
assert notna(result[4]) | |||
|
|||
# min_periods=0 | |||
result0 = get_result(arr, 20, min_periods=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to disable this subset of tests, since I don't think it's true that min_period=0
is the same as min_period=1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so make new ones that make this diff clear (i agree they should be different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a keyword to control this check, since it is true for most operations. sum
is tested up above in test_missing_minp_zero
.
e390d1c
to
8cbd59a
Compare
pandas/_libs/window.pyx
Outdated
bint is_variable | ||
ndarray[int64_t] start, end | ||
ndarray[double_t] output | ||
|
||
if minp == 0: | ||
# in get_window_indexer, we ensure that minp >= 1. That's fine for | ||
# all cases except nobs = 0 (all missing values) and minp=0. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the minp logic is handled externally to window.pyx (in window.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying to move this there? I don't think that's possible without rewriting roll_sum
. All of it's logic assumes that minp
is at least 1. The only thing I need the user-provided minp
is for passing to calc_sum
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it should be changed. This is way too hacky. ALL logic related to minp is passed in.
pandas/core/nanops.py
Outdated
result.fill(fill_value) | ||
return result | ||
if values.size == 0 and kwds.get('min_count') is None: | ||
# We are empty, returning NA for our type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn’t min_count always passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just when the caller is sum
or prod
. Methods like mean
don't pass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very error prone. The internal functions can always pass min_count
I think, so it will be not-None. or at the very least comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here passing through the correct arguments to the nanop. pandas.core.nanops.sum
takes min_count
, while mean_doesn't
. The function calling f
knows that, but f
doesn't. So I would either
- refactor
f
to not do this special casing, and have each nanop properly handle size-0 arrays - add an unused
min_count
to each nanop, and always pass it through - have
f
addmin_count
tokwds
when it's not None.
I think 1 is the best option, but will take a bit of time. 3 doesn't strike me as any better than the current method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just pass it thru in make_stat_function
? e.g. .you always just pass
# not sum/prod (I know you have a separate function for this)
min_count=-1
return self._reduce(f, name, axis=axis, skipna=skipna,
numeric_only=numeric_only, min_count=min_count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be option 2. I'd have to add a min_count
keyword to each of our methods in pandas.core.nanops
, which I don't think we want.
pandas/tests/frame/test_analytics.py
Outdated
('prod', 1), | ||
]) | ||
def test_sum_prod_nanops(self, method, unit): | ||
idx = ['a', 'b', 'c'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add tests for Timedelta here as well (should work)
("sum", 0.0), | ||
("prod", 1.0) | ||
]) | ||
def test_empty(self, method, unit, use_bottleneck): | ||
with pd.option_context("use_bottleneck", use_bottleneck): | ||
# GH 9422 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this issue number
pandas/tests/test_resample.py
Outdated
pad = DataFrame([[np.nan, np.nan, np.nan, np.nan]], index=[3], | ||
|
||
if func == 'sum': | ||
fill_value = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u parametrize on func name and fill_value
This turned up a slight inconsistency between: a) The first doesn't upsample to include "unobserved" values, while the latter does. An example: df = pd.DataFrame({
'A': [1] * 6 + [2] * 6,
'B': pd.to_datetime(['2017'] * 3 +
['2019'] * 3 +
['2017'] * 3 +
['2019'] * 3),
'C': [1] * 12
})
a = df.groupby([pd.Grouper(key='B', freq='AS'), 'A']).C.count()
b = df.set_index("B").sort_index().groupby("A").resample("AS").C.count() In [53]: a # groupby([grouper, key]).agg
Out[53]:
B A
2017-01-01 1 3
2 3
2019-01-01 1 3
2 3
Name: C, dtype: int64
In [54]: b # groupby(key).resample().agg
Out[54]:
A B
1 2017-01-01 3
2018-01-01 0
2019-01-01 3
2 2017-01-01 3
2018-01-01 0
2019-01-01 3
Name: C, dtype: int64 I've use |
8cbd59a
to
1902af6
Compare
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 29, 2017 at 12:25 Hours UTC |
1902af6
to
c213aeb
Compare
RE: Grouoer/resample: that does look inconsistent. I think the resample behavior should be used for both? |
Somehow I could give an explanation for the difference in behaviour between the groupby and resample example (resample: conform to new index given the frequency -> full time series, groupby: makes groups of my data based on given frequency -> only those groups contained in data). But of course there are also good reasons to have this consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed only the whatsnew note.
In general, for the "changes" part of the whatsnew, should we also note somehow that whether this changed for you compared to pre-0.21 behaviour, depends on whether bottlneneck was installed or not (although that is maybe only for the main sum and not for groupby or resample?)
doc/source/whatsnew/v0.22.0.txt
Outdated
* We've added a ``min_count`` parameter to ``.sum`` and ``.prod`` to control | ||
the minimum number of valid values for the result to be valid. If fewer than | ||
``min_count`` valid values are present, the result is NA. The default is | ||
``0``. To restore the 0.21 behavior, use ``min_count=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explicitly say in the summary here that this result is np.nan
doc/source/whatsnew/v0.22.0.txt
Outdated
``min_count`` valid values are present, the result is NA. The default is | ||
``0``. To restore the 0.21 behavior, use ``min_count=1``. | ||
|
||
Some background: In pandas 0.21.1, we fixed a long-standing inconsistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.21.1 -> 0.21.0 ?
doc/source/whatsnew/v0.22.0.txt
Outdated
grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) | ||
pd.Series([1, 2]).groupby(grouper).sum() | ||
|
||
pd.Series([1, 2]).groupby(groupuer).sum(min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gruupeur -> grouper
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [1]: import pandas as pd; import numpy as np; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is not needed
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
s.resample('2d').sum(min_count=1) | ||
|
||
Upsampling in particular is affected, as this will introduce all-NaN series even |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emphasize here somewhere this change is only for sum/prod, and not for all other aggregation functions? (I know it is already said above, but I think we have to stress enough it that those examples only apply for specific cases)
I'm not sure. We have two tests that relied on that equivalence to construct the expected result. |
46590dd
to
d5e475f
Compare
pandas/_libs/window.pyx
Outdated
ndarray[int64_t] start, end | ||
ndarray[double_t] output | ||
|
||
if minp == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thoughts on this approach? no_min
is a boolean that indicates whether or not to consider min_p
when choosing if the result is valid in calc_sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. Why can't you just dispatch on minp==0
above
e.g. line 416
if nobs >= minp or minp==0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had earlier with the minp2
stuff.
The problem there was that was that in roll_sum
, minp
gets set to min(minp, 1)
, so we don't know if the user specified min_periods=0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't see that. maybe can just remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks bigger than it is because I had to ensure VariableWindowIndexer
passes through floor
. The main change is just using floor=0
in roll_sum
and iterating like range(min(1, minp), ...)
instead of just range(minmp, ...)
.
a034a70
to
b8d1f2b
Compare
The only failing tests are the equivalence tests between
@jorisvandenbossche could you explain a bit further? I don't quite understand. IMO, saying |
doc/source/whatsnew/v0.22.0.txt
Outdated
Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The | ||
summary is that | ||
|
||
* The sum of an all-NA or empty series is now 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add Series in capitalize with double-backticks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly also use double-backticks around all-NA
(or maybe italics)
doc/source/whatsnew/v0.22.0.txt
Outdated
Based on feedback, we've partially reverted those changes. The default sum for | ||
all-NA and empty series is now 0 (1 for ``prod``). | ||
|
||
*pandas 0.21* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas 0.21.x
doc/source/whatsnew/v0.22.0.txt
Outdated
Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck | ||
installed. | ||
|
||
Note that this affects some other places in the library: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break out these to separate sub-sections I think (e.g the numbered ones)
pandas/_libs/window.pyx
Outdated
ndarray[int64_t] start, end | ||
ndarray[double_t] output | ||
|
||
if minp == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. Why can't you just dispatch on minp==0
above
e.g. line 416
if nobs >= minp or minp==0
?
pandas/core/generic.py
Outdated
_sum_examples = """\ | ||
Examples | ||
-------- | ||
By default, the sum of an empty series is ``0``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add all-NaN as well
pandas/core/nanops.py
Outdated
@@ -292,6 +282,22 @@ def _wrap_results(result, dtype): | |||
return result | |||
|
|||
|
|||
def _na_for_min_count(values, axis): | |||
# we either return np.nan or pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add doc-string
pandas/tests/frame/test_analytics.py
Outdated
def wrapper(x): | ||
return alternative(x.values) | ||
|
||
if skipna_alternative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok
@@ -273,6 +273,21 @@ def test_timegrouper_with_reg_groups(self): | |||
'whole_cost'].sum() | |||
assert_series_equal(result2, expected) | |||
|
|||
def test_groupby_resample_equivalence(self): | |||
# Currently failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simplification of test_timegrouper_with_reg_groups
, they're failing for the same reason (#18921 (comment))
pandas/tests/test_resample.py
Outdated
columns=['A', 'B', 'C', 'D']) | ||
normal_result = getattr(normal_grouped, func)() | ||
dt_result = getattr(dt_grouped, func)() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this test moved above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, just parametrized.
pandas/tests/test_window.py
Outdated
@@ -1358,9 +1390,10 @@ def get_result(arr, window, min_periods=None, center=False): | |||
assert notna(result[4]) | |||
|
|||
# min_periods=0 | |||
result0 = get_result(arr, 20, min_periods=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so make new ones that make this diff clear (i agree they should be different).
Returning to #18921 (comment), it's only an issue for when there are multiple groupers: In [14]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})
In [15]: df.groupby(pd.Grouper(key="A", freq="AS")).B.count()
Out[15]:
A
2015-01-01 1
2016-01-01 0
2017-01-01 1
Freq: AS-JAN, Name: B, dtype: int64
In [16]: df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]]).B.count()
Out[16]:
A
2015-01-01 0 1
2017-01-01 0 1
Name: B, dtype: int64 Still trying to track down the appropriate spot for the fix. Will we want to include this bugfix in 0.22, or save it for 0.23.0? |
Fixing this is turning into quite the rabbit hole :/ |
IIUC, the issue is in how the new groups are determined In [9]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})
In [10]: gr1 = df.groupby(pd.Grouper(key="A", freq="AS")).B
In [11]: gr2 = df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]])
In [12]: gr1.grouper.result_index
Out[12]: DatetimeIndex(['2015-01-01', '2016-01-01', '2017-01-01'], dtype='datetime64[ns]', name='A', freq='AS-JAN')
In [13]: gr2.grouper.result_index
Out[13]:
MultiIndex(levels=[[2015-01-01 00:00:00, 2016-01-01 00:00:00, 2017-01-01 00:00:00], [0]],
labels=[[0, 2], [0, 0]],
names=['A', None]) So In [21]: gr2.grouper.groups
Out[21]:
{(Timestamp('2015-01-01 00:00:00'), 0): Int64Index([0], dtype='int64'),
(Timestamp('2017-01-01 00:00:00'), 0): Int64Index([1], dtype='int64')} I don't know if that's the root problem yet, or just a symptom. |
Codecov Report
@@ Coverage Diff @@
## master #18921 +/- ##
==========================================
- Coverage 91.6% 91.56% -0.05%
==========================================
Files 150 150
Lines 48939 48977 +38
==========================================
+ Hits 44833 44845 +12
- Misses 4106 4132 +26
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of tiny comments
doc/source/whatsnew/v0.22.0.txt
Outdated
Some background: In pandas 0.21, we fixed a long-standing inconsistency | ||
in the return value of all-*NA* series depending on whether or not bottleneck | ||
was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same | ||
time, we changed the sum and prod of an empty Series to also be ``NaN``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around Series.
pd.Series([]).sum() | ||
pd.Series([np.nan]).sum() | ||
|
||
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say matches numpy? (I know you are saying np.nansum
, but can't hurt to actually say numpy)
doc/source/whatsnew/v0.22.0.txt
Outdated
also matches the behavior of ``np.nansum`` on empty and all-*NA* arrays. | ||
|
||
To have the sum of an empty series return ``NaN``, use the ``min_count`` | ||
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipna=True
doc/source/whatsnew/v0.22.0.txt
Outdated
To have the sum of an empty series return ``NaN``, use the ``min_count`` | ||
keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA* | ||
series is conceptually the same as on an empty. The ``min_count`` parameter | ||
refers to the minimum number of *valid* values required for a non-NA sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid -> not-null
doc/source/whatsnew/v0.22.0.txt
Outdated
pd.Series([1, 2]).groupby(grouper).sum() | ||
|
||
To restore the 0.21 behavior of returning ``NaN`` of unobserved groups, | ||
use ``min_count>=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually tests min_count > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added a few more in frame/test_analytics.py
, test_resample.py
, and test_groupby.py
.
doc/source/whatsnew/v0.22.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Rolling and expanding already have a ``min_periods`` keyword that behaves | ||
similarly to ``min_count``. The only case that changes is when doing a rolling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -1,14 +1,213 @@ | |||
.. _whatsnew_0220: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't u need this?
tiny comment. ok merging on green. |
OK, all green. I'm going to
|
All green. Here we go. |
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
I not sure understand the rationale for making >>> s = pd.Series([0, 0.05, 0.12])
>>> (1+s).prod() -1
0.17600000000000016 # ok
>>> s = pd.Series([0, 0.0, 0.0])
>>> (1+s).prod() -1
0.0 # ok
>>> s = pd.Series([np.nan, np.nan, np.nan])
>>> (1+s).prod() -1
0.0 # how do you interpret this?. IMO percent calculation will be difficult to interpret, when product of Is there a write-up on this, in case I'm misunderstanding this? |
The rationale is that this is consistent with how sum() works: 0 is the additive identity, and 1 is the multiplicative identity. It would be confusing to use a different rule. There are absolutely use cases where these definitions for sum/prod is not desired, but it's impossible for the default behavior to work in all cases. Your use case of calculating percent change sounds like a good use for |
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew
Version 0.22.0 * tag 'v0.22.0': (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
* releases: (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
Changes the defaults for
min_count
so thatsum([])
andsum([np.nan])
are 0 by default, and NaN withmin_count>=1
.I'd recommend looking at only the latest commit until #18876 is merged. I'll probably force push changes here to keep all the relevant changes in the last commit until #18876 is in, rebase on that, and then start pushing changes regularly.
cc @jreback @jorisvandenbossche @shoyer @wesm