-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Sum of all NA and empty should be 0 #18678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've been following this conversation a bit and remain confused on one point. Options 1 and 2 both appear to rely on the premise that |
Here's fine. I think the shortest answer is that it's nice to have |
I would keep the more general discussion on the mailing list, and here more the details of the changes we then want to make |
this is not happening for 0.21.1 |
I think the question is rather: do we want to release this quickly, or only in a few months? (and then once answered, we can still discuss whether we call that 0.21.1, 0.21.2 or 0.22.0). But let's keep that discussion in #18244 (comment) |
Just to add a data point I tried to update to .21 and a large number of our unit tests failed so I came looking to see if this had been reported as a bug yet. A lot of our code assumes pd.Series([]).sum() == 0 for simple things like dollar weighted sums. We won't be upgrading until this is fixed and the fact a breaking change like this was made has lead us to seriously consider prohibiting pandas use within critical sections of our code base. |
@ryanbressler that discussion is happening on the mailing list: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html We're using the issue to design the fix. |
I've started a branch for this at master...TomAugspurger:na-sum. I'll have a PR tomorrow, but briefly my approach was to add a parameter to
In [3]: pd.Series().sum()
Out[3]: 0.0
In [4]: pd.Series().sum(empty_is_na=True)
Out[4]: nan |
@ryanbressler would you like to voice this experience on the mailing list? |
@TomAugspurger and @jorisvandenbossche it looks like I need to subscribe to that list to post? I'll do that after some meetings I have this morning. Feel free to delete my comments If they aren't appropriate on this thread but I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly. |
I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.
Not to report issues, that's just already been done -- this issue :)
We're trying to keep the discussion from fragmenting. Mailing List for
discussion what behavior we want, this issue for designing that behavior.
And usually most discussion happens on GitHub. This happens to be on the
mailing list since it's of broad interest, and not everybody
can keep up with the traffic on the GitHub repository.
…On Tue, Dec 12, 2017 at 12:28 PM, Ryan Bressler ***@***.***> wrote:
@TomAugspurger <https://github.com/tomaugspurger> and @jorisvandenbossche
<https://github.com/jorisvandenbossche> it looks like I need to subscribe
to that list to post? I'll do that after some meetings I have this morning.
Feel free to delete my comments If they aren't appropriate on this thread
but I must say expecting users to join the dev discussion list to report
issues feels a bit unfriendly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18678 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIoYBQ1mE554c8WRq6GjpNRJcicUqks5s_sXJgaJpZM4Q5w6R>
.
|
@TomAugspurger regarding your proposal above, that would only deal with the empty case? So that would not yet provide an option for returning NaN on an all NaN series? |
Yep, so far just handling the empty case. My preference is for additional keywords to For the all-NA case, Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, all_na_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values before computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for
Series.
empty_is_na : bool, default False
The result of operating on an empty array should be NA. The default
behavior is for the sum of an empty array to be 0, and the product
of an empty array to be 1.
all_na_is_na : bool, default False
The result of operating on an all-NA array should be NA. The default
behavior is for the sum of an all-NA array to be 0, and the product
of an all-NA array to be 1.
Returns
-------
sum : Series or DataFrame (if level specified)
File: ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type: method Are there any problematic interactions between these new keywords and So for an all-NA series,
Now
1 and 3 concern me a bit. I'll think about it a bit more. For an empty series, the only thing that matters is |
I am personally also in favor of additional keyword(s) instead of a new method. I don't really like the
This one concerns me as well. I think the |
I think I talked myself out of the need for two keywords. My I'm not fond of |
Indeed, do we really need two new keywords? I think
For ( The potentially compelling argument in favor of two different keyword arguments is that that gives you a way to use the legacy sum from pandas <0.21 (without bottleneck). It would be an easy switch to use when migrating code, but even then I suspect there are only a very small number of applications that need to different behavior between an empty and all-NaN sum. If need be, I would prefer |
I would just add
then this would all work and be quite simple. |
This seems strange, right? And that differs from pandas 0.20.3 with or without bottleneck, which always returned 0 for EDIT: sorry I misread. Ignore this. |
@jreback would your |
For reference, here's the 0.20.3 behavior on two series
import pandas as pd
import itertools
use_bn = [True, False]
skipna = [True, False]
series = [pd.Series(), pd.Series([np.nan])]
values = []
for bn, skip, s in itertools.product(use_bn, skipna, series):
with pd.option_context("compute.use_bottleneck", bn):
result = s.sum(skipna=skip)
values.append((bn, skip, len(s), result))
r = pd.DataFrame(values, columns=['use_bn', 'skipna', 'length', 'result'])
r.index.name = 'case' |
Could people give thoughts on this table for the desired behavior?
I think if we're OK with With case D= |
@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN" To me, this implies D= The other missing entry in your table is Case E=
I'm confused. Isn't that exactly what D=
|
Sorry, I think I was staring at that table for too long and started going crazy :) |
+1 @jreback In the end, a
I agree too.
So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0. Something else, regarding "extra keyword vs new method". The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add |
In all the user feedback on this issue, I do not recall any requests for this mixed behavior. So I think we can safely leave it out. |
Yes, I was thinking that this morning as well... |
Turning to In [32]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'],
...: categories=['a', 'b', 'c']),
...: 'B': [1, 2, 1]})
...: result = df.groupby("A").B.sum()
...:
In [33]: result
Out[33]:
A
a 2
b 2
c 0
Name: B, dtype: int64 Currently, aggregations like In [3]: pd.__version__
Out[3]: '0.20.3'
In [8]: df.groupby("A").B.agg(lambda x: np.nansum(x))
Out[8]:
A
a 2.0
b 2.0
c NaN
Name: B, dtype: float64 Perhaps custom aggregation functions aren't called on an empty array to see what the return type is? |
Another (large) place this affects is upsampling: In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.20.3'
In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00 0.0
2017-01-01 00:30:00 NaN
2017-01-01 01:00:00 1.0
2017-01-01 01:30:00 NaN
2017-01-01 02:00:00 2.0
2017-01-01 02:30:00 NaN
2017-01-01 03:00:00 3.0
Freq: 30T, dtype: float64 On my branch: In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.22.0.dev0+364.g70e9bf903.dirty'
In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00 0
2017-01-01 00:30:00 0
2017-01-01 01:00:00 1
2017-01-01 01:30:00 0
2017-01-01 02:00:00 2
2017-01-01 02:30:00 0
2017-01-01 03:00:00 3
Freq: 30T, dtype: int64 which is I suspect what we want, but another thing I'll make sure to document. |
Thoughts on this? Do we want |
i am -1 i’m changing anything groupby / resample sum of NaN is NaN; anything else is completely confusing |
A sum of empty/all-NaN values in resample should be no different than a sum of empty/all-NaN values elsewhere. So that means if we make sum of empty/all-NaN be 0, then it should be 0 for resample, too. The fact that it's not entirely clear whether resampling is over all-NaN or empty values is another indication that it is important to keep the empty sum and all-NaN sum consistent. It would be extremely confusing to use different rules for resample/groupby than elsewhere. |
well sum of all-NaN should be NaN as discussed many times |
I think the majority opinion in https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html is that the sum of all-NaN should be 0. |
I agree we should be consistent, so if we choose for option empty/all-NaN to be 0 for normal sum, it should also be the case for resample/groupby. |
I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling
I am similarly conflicting on this one, which is why I re-reaised it before
the call tomorrow.
…On Tue, Dec 19, 2017 at 2:55 PM, Joris Van den Bossche < ***@***.***> wrote:
I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one
of the cases where this would actually pop up quite often in user code and
have a possible larger impact on users when we change it. We *could*
maybe consider raising a warning first that it will change (and can already
pass the keyword to silence it)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18678 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIo5Y7_tI0sZVTsCh8k8-7nxR1ye1ks5tCCLPgaJpZM4Q5w6R>
.
|
Dipping my toes into this: I personally would be in favor of a separate function along similar lines to what @jorisvandenbossche said above. In addition, the difference between the two sum / prod functions would be handling of these zero-NaN cases discussed in the issue, while the "core" (private) sum / prod functionality would handle all other cases (that's my current thinking ATM) One might consider calling these functions For example, if you want all- |
@TomAugspurger to avoid confusion, should rename v0.22.0 in whatsnew to v0.23.0 and add a new v0.22.0 |
I just came across this behaviour today, and I was surprised/annoyed I didn't get 0s: because (as I maybe repeated too much already) that's what I expected, but also because I needed to build a histogram, so I had to manually |
@toobaz Yes, I think we decided anyway that it should be consistent, so once we change to have 0 the default, it will also be for resample. |
And the
prod
of those should be 1.xref: #9422, #17630, https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html
We need to implement that and design and implement the alternative (either a new method, or a keyword-argument to sum and prod).
The text was updated successfully, but these errors were encountered: