Skip to content

Sum of all NA and empty should be 0 #18678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Dec 7, 2017 · 39 comments · Fixed by #18876
Closed

Sum of all NA and empty should be 0 #18678

TomAugspurger opened this issue Dec 7, 2017 · 39 comments · Fixed by #18876
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

And the prod of those should be 1.

xref: #9422, #17630, https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

We need to implement that and design and implement the alternative (either a new method, or a keyword-argument to sum and prod).

@TomAugspurger TomAugspurger added API Design Blocker Blocking issue or pull request for an upcoming release Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 7, 2017
@TomAugspurger TomAugspurger added this to the 0.21.1 milestone Dec 7, 2017
@jbrockmendel
Copy link
Member

I've been following this conversation a bit and remain confused on one point. Options 1 and 2 both appear to rely on the premise that Series([nan, nan]).sum() should equal Series([]).sum().
Without having an intuition for why this is a desirable behavior, the merits of those two options hard to evaluate. Is it just consistency with [other software]? LMK if this question belongs elsewhere.

@TomAugspurger
Copy link
Contributor Author

Here's fine. I think the shortest answer is that it's nice to have s.sum(skipna=True) equal s.dropna().sum(skipna=True/False).

@jorisvandenbossche
Copy link
Member

I would keep the more general discussion on the mailing list, and here more the details of the changes we then want to make

@jreback jreback removed the Blocker Blocking issue or pull request for an upcoming release label Dec 8, 2017
@jreback jreback modified the milestones: 0.21.1, 0.22.0 Dec 8, 2017
@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

this is not happening for 0.21.1

@jorisvandenbossche jorisvandenbossche removed this from the 0.22.0 milestone Dec 8, 2017
@jorisvandenbossche
Copy link
Member

this is not happening for 0.21.1

I think the question is rather: do we want to release this quickly, or only in a few months? (and then once answered, we can still discuss whether we call that 0.21.1, 0.21.2 or 0.22.0). But let's keep that discussion in #18244 (comment)

@TomAugspurger TomAugspurger mentioned this issue Dec 11, 2017
58 tasks
@ryanbressler
Copy link

Just to add a data point I tried to update to .21 and a large number of our unit tests failed so I came looking to see if this had been reported as a bug yet. A lot of our code assumes pd.Series([]).sum() == 0 for simple things like dollar weighted sums.

We won't be upgrading until this is fixed and the fact a breaking change like this was made has lead us to seriously consider prohibiting pandas use within critical sections of our code base.

@TomAugspurger
Copy link
Contributor Author

@ryanbressler that discussion is happening on the mailing list: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

We're using the issue to design the fix.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 12, 2017

I've started a branch for this at master...TomAugspurger:na-sum. I'll have a PR tomorrow, but briefly my approach was to add a parameter to sum and prod: empty_is_na. Here's the docstring for sum:

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method
In [3]: pd.Series().sum()
Out[3]: 0.0

In [4]: pd.Series().sum(empty_is_na=True)
Out[4]: nan

@jorisvandenbossche
Copy link
Member

@ryanbressler would you like to voice this experience on the mailing list?

@ryanbressler
Copy link

@TomAugspurger and @jorisvandenbossche it looks like I need to subscribe to that list to post? I'll do that after some meetings I have this morning. Feel free to delete my comments If they aren't appropriate on this thread but I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 12, 2017 via email

@jorisvandenbossche
Copy link
Member

@TomAugspurger regarding your proposal above, that would only deal with the empty case? So that would not yet provide an option for returning NaN on an all NaN series?

@TomAugspurger
Copy link
Contributor Author

Yep, so far just handling the empty case. My preference is for additional keywords to prod and sum, rather than a new method.

For the all-NA case,

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, all_na_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.
all_na_is_na : bool, default False
    The result of operating on an all-NA array should be NA. The default
    behavior is for the sum of an all-NA array to be 0, and the product
    of an all-NA array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method

Are there any problematic interactions between these new keywords and skipna? Does the empty_is_na apply to it being empty before or after skipping the NA values? I'd propose that it applies to the before-skipping version, so that empty_is_na only has an effect when the original series is length-0.

So for an all-NA series, s = pd.Series([np.nan])

  1. s.sum(skipna=True, empty_is_na=False, all_na_is_na=False) -> 0
  2. s.sum(skipna=True, empty_is_na=True, all_na_is_na=True) -> NaN
  3. s.sum(skipna=True, empty_is_na=True, all_na_is_na=False) -> 0
  4. s.sum(skipna=True, empty_is_na=False, all_na_is_na=True) -> NaN

Now skipna=False. How should we handle all_na_is_na?

  1. s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)
  2. s.sum(skipna=False empty_is_na=True, all_na_is_na=True) -> NaN
  3. s.sum(skipna=False, empty_is_na=True, all_na_is_na=False) -> 0 (this differs from np.sum)
  4. s.sum(skipna=False, empty_is_na=False, all_na_is_na=True) -> NaN

1 and 3 concern me a bit. I'll think about it a bit more.

For an empty series, the only thing that matters is empty_is_na. With empty_is_na=False, the return value is 0. With empty_is_na=True, the return value is NA, regardless of skipna and all_na_is_na.

@jorisvandenbossche
Copy link
Member

My preference is for additional keywords to prod and sum, rather than a new method.

I am personally also in favor of additional keyword(s) instead of a new method. I don't really like the empty_is_na and all_na_is_na names, and that fact that it are two additional keywords, but I assume they are at least explicit and allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)

This one concerns me as well. I think the all_na_is_na only makes sense if you are skipping NAs. I can't really imagine were you would want Series([1, NA]).sum() to return NA but Series([NA, NA]).sum() to return 0.
If skipna=False, any occurence of a NA should turn the result into NA. Also the deviation from numpy would be unfortunate (as now skipna=False basically gives you numpy's sum).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 13, 2017

allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

I think I talked myself out of the need for two keywords. My empty_is_na keyword should only have an effect if len(s) is 0, which means there isn't even any NaNs.

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count. Though making Series([np.nan], skipna=False) return NaN (which we want) will complicate the explanation of all_na_is_na, so perhaps they're not easier to explain after all.

@shoyer
Copy link
Member

shoyer commented Dec 13, 2017

Indeed, do we really need two new keywords?

I think empty_is_na should suffice, to satisfy both:

  1. pd.Series([np.nan]).sum(skipna=True, empty_is_na=True) -> NaN
  2. pd.Series([]).sum(skipna=True, empty_is_na=True) -> NaN

For skipna=False, which we only need for completeness:
3. pd.Series([np.nan]).sum(skipna=False, empty_is_na=True) -> NaN
4. pd.Series([]).sum(skipna=False, empty_is_na=True) -> NaN

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

The potentially compelling argument in favor of two different keyword arguments is that that gives you a way to use the legacy sum from pandas <0.21 (without bottleneck). It would be an easy switch to use when migrating code, but even then I suspect there are only a very small number of applications that need to different behavior between an empty and all-NaN sum. If need be, I would prefer empty_is_na='legacy' to adding a separate all_na_is_na argument.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2017

I would just add

fill_value=0 (if we really want option 1), or fill_value=None for option 2

then this would all work and be quite simple.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 13, 2017

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

This seems strange, right? And that differs from pandas 0.20.3 with or without bottleneck, which always returned 0 for pd.Series([]).sum(skipna=False)

EDIT: sorry I misread. Ignore this.

@TomAugspurger
Copy link
Contributor Author

@jreback would your fill_value only apply when skipna=True? Otherwise, we'd break pd.Series([np.nan]).sum(skipna=False) -> NaN.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 13, 2017

For reference, here's the 0.20.3 behavior on two series pd.Series([]) and pd.Series([np.nan])

use_bn skipna length result
case
0 True True 0 0.0
1 True True 1 0.0
2 True False 0 0.0
3 True False 1 NaN
4 False True 0 0.0
5 False True 1 NaN
6 False False 0 0.0
7 False False 1 NaN
import pandas as pd

import itertools

use_bn = [True, False]

skipna = [True, False]

series = [pd.Series(), pd.Series([np.nan])]

values = []

for bn, skip, s in itertools.product(use_bn, skipna, series):
    with pd.option_context("compute.use_bottleneck", bn):
        result = s.sum(skipna=skip)
        values.append((bn, skip, len(s), result))

r = pd.DataFrame(values, columns=['use_bn', 'skipna', 'length', 'result'])
r.index.name = 'case'

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 13, 2017

Could people give thoughts on this table for the desired behavior?

case skipna length empty_is_na result
A True empty False 0
B True empty True NaN
C True all-na False 0
D True all-na True NaN?
E False empty False 0?
F False empty True NaN
G False all-na False NaN
H False all-na True NaN

I think if we're OK with D being NaN, then we only need the one keyword (whatever it's called). The semantics for empty_is_na=True with skipna=True is then "Is the array empty after removing NaNs? If so, they result is NaN". If we want D to be 0, then we'll need another keyword, or a 'legacy' option.

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

@shoyer
Copy link
Member

shoyer commented Dec 13, 2017

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

The other missing entry in your table is Case E=0. This is the same as the sum of an empty array in NumPy (np.sum([])).

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician, and want a rule that makes sense "most" of the time. (Of course, that's the worst sort of consistency for APIs, because it suggests a behavior that doesn't hold in all edge cases.)

@TomAugspurger
Copy link
Contributor Author

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

Sorry, I think I was staring at that table for too long and started going crazy :)

@jorisvandenbossche
Copy link
Member

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count

+1

@jreback In the end, a empty_is_na=True/False or fill_value=np.nan/0 would be equivalent in behaviour I think, so it is a question of which name we like most.
But I think in this case "empty_is_na is easier to explain" also is valid for fill_value IMO. Because fill_value would actually be the value to return in case of empty data (after skipping the NAs or not). So it would rather be something like "empty_result_value".

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

I agree too.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want? Because if so, we shouldn't call it "legacy" but something like "mixed" (although that certainly doesn't 'tell' you what it does neither).


Something else, regarding "extra keyword vs new method". The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base (of course they can make such an alternative function themselves, but then it is not a method on the dataframe).
(one problem with this, however, is that a possible name like .total() would be very confusing if we choose for option 1 as the default, as then total() would be option 2, which makes it the other way around as in SQL)

@shoyer
Copy link
Member

shoyer commented Dec 13, 2017

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want?

In all the user feedback on this issue, I do not recall any requests for this mixed behavior. So I think we can safely leave it out.

@TomAugspurger
Copy link
Contributor Author

The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base

Yes, I was thinking that this morning as well...

@TomAugspurger
Copy link
Contributor Author

Turning to groupby with categoricals, if we implement option 1, we'll want to have categoricals with no observations return 0 for sum.

In [32]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'],
    ...:                                         categories=['a', 'b', 'c']),
    ...:                    'B': [1, 2, 1]})
    ...: result = df.groupby("A").B.sum()
    ...:

In [33]: result
Out[33]:
A
a    2
b    2
c    0
Name: B, dtype: int64

Currently, aggregations like np.nansum don't behave as I would expect:

In [3]: pd.__version__
Out[3]: '0.20.3'

In [8]: df.groupby("A").B.agg(lambda x: np.nansum(x))
Out[8]:
A
a    2.0
b    2.0
c    NaN
Name: B, dtype: float64

Perhaps custom aggregation functions aren't called on an empty array to see what the return type is?

@TomAugspurger
Copy link
Contributor Author

Another (large) place this affects is upsampling:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.20.3'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0.0
2017-01-01 00:30:00    NaN
2017-01-01 01:00:00    1.0
2017-01-01 01:30:00    NaN
2017-01-01 02:00:00    2.0
2017-01-01 02:30:00    NaN
2017-01-01 03:00:00    3.0
Freq: 30T, dtype: float64

On my branch:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.22.0.dev0+364.g70e9bf903.dirty'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0
2017-01-01 00:30:00    0
2017-01-01 01:00:00    1
2017-01-01 01:30:00    0
2017-01-01 02:00:00    2
2017-01-01 02:30:00    0
2017-01-01 03:00:00    3
Freq: 30T, dtype: int64

which is I suspect what we want, but another thing I'll make sure to document.

@TomAugspurger
Copy link
Contributor Author

which is I suspect what we want

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

@jreback
Copy link
Contributor

jreback commented Dec 19, 2017

i am -1 i’m changing anything groupby / resample

sum of NaN is NaN; anything else is completely confusing

@shoyer
Copy link
Member

shoyer commented Dec 19, 2017

A sum of empty/all-NaN values in resample should be no different than a sum of empty/all-NaN values elsewhere. So that means if we make sum of empty/all-NaN be 0, then it should be 0 for resample, too.

The fact that it's not entirely clear whether resampling is over all-NaN or empty values is another indication that it is important to keep the empty sum and all-NaN sum consistent.

It would be extremely confusing to use different rules for resample/groupby than elsewhere.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2017

well sum of all-NaN should be NaN as discussed many times
sure sum of empty can be 0

@TomAugspurger
Copy link
Contributor Author

well sum of all-NaN should be NaN as discussed many times

I think the majority opinion in https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html is that the sum of all-NaN should be 0.

@jorisvandenbossche
Copy link
Member

I agree we should be consistent, so if we choose for option empty/all-NaN to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one of the cases where this would actually pop up quite often in user code and have a possible larger impact on users when we change it. We could maybe consider raising a warning first that it will change (and can already pass the keyword to silence it)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 19, 2017 via email

@gfyoung
Copy link
Member

gfyoung commented Dec 20, 2017

Dipping my toes into this: I personally would be in favor of a separate function along similar lines to what @jorisvandenbossche said above.

In addition, the difference between the two sum / prod functions would be handling of these zero-NaN cases discussed in the issue, while the "core" (private) sum / prod functionality would handle all other cases (that's my current thinking ATM)

One might consider calling these functions np_sum and np_prod because the results would mirror those from when you call np.array([]).sum and np.array([nan]).sum() depending on whether you want to skip-over NaN or not.

For example, if you want all-NaN to sum to NaN, pass in skipna=False. Otherwise, the sum is zero, as the element set that you're summing over would technically be empty.

@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

@TomAugspurger to avoid confusion, should rename v0.22.0 in whatsnew to v0.23.0 and add a new v0.22.0

@toobaz
Copy link
Member

toobaz commented Dec 22, 2017

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

I just came across this behaviour today, and I was surprised/annoyed I didn't get 0s: because (as I maybe repeated too much already) that's what I expected, but also because I needed to build a histogram, so I had to manually fillna() (and astype(int), as I was working with integers). If I had wanted NaNs, it would have been more natural to reindex.

@jorisvandenbossche
Copy link
Member

@toobaz Yes, I think we decided anyway that it should be consistent, so once we change to have 0 the default, it will also be for resample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
8 participants