Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

sam-s · 2017-11-09T19:34:35Z

Code Sample, a copy-pastable example if possible

Current (0.21) incorrect behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.21.0'
>>> pd.Series().sum()
nan

Old (0.20.3) correct behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.20.3'
>>> pd.Series().sum()
0

Problem description

Sum of an empty series should be 0, not nan, because otherwise the following invariant is violated:

pd.concat([s1,s2]).sum() == s1.sum() + s2.sum()

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: C LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2017-11-09T19:45:21Z

#17630

jreback · 2017-11-09T19:45:28Z

pls read the what’s new http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#sum-prod-of-all-nan-series-dataframes-is-now-consistently-nan

pandas has always had this behavior

jreback · 2017-11-09T19:46:33Z

note that your invariant holds with sum of empty = NaN
otherwise you lose information

sam-s · 2017-11-09T20:06:04Z

The invariant no longer holds:

>>> pd.concat([pd.Series([1]),pd.Series()]).equals(pd.Series([1]))
True

thus should be

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

jorisvandenbossche · 2017-11-09T20:20:40Z

pandas has always had this behavior

@jreback this is not true, pandas always had the 0 behaviour for empty series (we broke this behaviour on purpose, for sure, but it was a breaking change)

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

This only evaluates to False because the + between scalars (0 + np.nan) is not skipping NaNs as the sum of pandas series does (if you have a series with [0, np.nan] and would sum this, you would get 0).
So I am not sure your comparison really holds.

@sam-s we had long discussions about it, and both 0 and NaN behaviours have pros and cons, but in the end we needed to make a decision, which became NaN. You can read this discussion in #9422

sam-s · 2017-11-09T20:25:51Z

Okay, I am sure it's too late for me to weep and scream.
However, how do I check that there are no NaNs in a DataFrame?
I used df.isnull().sum().sum() == 0 before.
What do I do now?

jorisvandenbossche · 2017-11-09T20:31:03Z

I used df.isnull().sum().sum() == 0 before.

Doesn't that still work?

In [30]: pd.Series([1, 2, 3]).isnull().sum()
Out[30]: 0

In [31]: pd.Series([1, 2, np.nan]).isnull().sum()
Out[31]: 1

Or can you give a concrete example?

jorisvandenbossche · 2017-11-09T20:33:17Z

Okay, I am sure it's too late for me to weep and scream.

Probably yes, but you can still raise your voice in #9422. However I think it is mainly interesting to hear how it affects code (what do you need to do to work around the change) and how this can be easified.

sam-s · 2017-11-12T15:42:16Z

`+` vs `sum`

@jorisvandenbossche : you appear to be saying that + for scalars is somehow a different beast than sum for collections (lists/sets/series &c). This is so, well, unexpected, that I neglected to address it until you wrote in #9422 "... two different kinds of sums ...". I cannot comment there anymore, so I will reply here.
Again, I understand that math is not your only rationale, but I beg you to remember that the bulk of your customers are applied mathematicians and what I am about to say is the way we think about these issues, and this is why we will be screaming bloody murder on sum([])==null till the end of the world and back.

When we have an associative binary operation f, such as +, we can define it on lists with at least two elements like this:

f(a,b,c,d,...z) := f(...(f(f(f(a,b),c),d),...),z)

or, in a more familiar infix notation

a+b+c+d+...+z := (((((a+b)+c)+d)+...)+z)

When the operation has a unit u (e.g., 0 for + or 1 for *): f(u,x)=f(x,u)=x for any x, we can extend the list operation to any list because

f(a,b,c,...z) = f(...(f(f(f(u,a),b),c),...),z)

and now f(a)=f(u,a)=a and f()=u.
(((Note, parenthetically, that, strictly speaking, while + is a binary operation, but you are not arguing that a sum of a list of one element is undefined. Why? How is it different from a sum of a list of no argument from this POV?!)))

This definition is natural for any associative binary operation OP with a neutral element UNIT, and it works like this (this is a mathematical definition identical to the above):

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      result = OP(result,x)
    return result

Clearly [x].apply(+)==x and [].apply(+)==0 .

This is fully applicable to all such associative binary operations with a neutral element, e.g.,:

+ and 0
* and 1
max and -inf
min and inf
and/all and True
or/any and False

The bottom line: OP([])=null is wrong because it breaks associativity and your customers expect associativity the same way you expect your C compiler to be able to compile a+=0 so that a does not change.

sam-s · 2017-11-12T15:53:27Z

Any vs All

The NaN/null behavior should be defined for collections which have some bad (NaN/null) data.
Whether it has some valid data or not is irrelevant.
E.g., in the presence of null, the above definition becomes

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      if x.isnull() and propagate_null:
        return null
      else:
        result = OP(result,x)
    return result

Clearly apply will return null if there IS a null, not if there are NO non-nulls.

@shoyer writes in #9422 : This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction.

The choice made by databases is a mistake made by the designers of SQL, as explained by @kenahoo. There is no reason for pandas to repeat the mistake.

Please do not perpetuate a mistake made by others.
You are not beholden to them.

sam-s · 2017-11-12T16:09:21Z

Finally, note that the above argument is applicable only to associative operations.
std and mean are not such operations.
You can make an argument for both

mean([]) = std([]) = std([x]) = NaN

and

mean([]), std([]), std([x]) --> exception

Both are legitimate design choices and, while, personally, I prefer the second one, I defer to you.

However, sum([])=nan is not a legitimate design decision because it breaks the contract of addition (yes, associativity is a part of the contract).

jreback closed this as completed Nov 9, 2017

jorisvandenbossche added this to the No action milestone Nov 9, 2017

pandas-dev locked and limited conversation to collaborators Nov 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

sam-s commented Nov 9, 2017

max-sixty commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

sam-s commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

sam-s commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

sam-s commented Nov 12, 2017 •

edited

Loading

sam-s commented Nov 12, 2017

sam-s commented Nov 12, 2017

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

Comments

sam-s commented Nov 9, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

max-sixty commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

sam-s commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

sam-s commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

jorisvandenbossche commented Nov 9, 2017

sam-s commented Nov 12, 2017 • edited Loading

+ vs sum

sam-s commented Nov 12, 2017

Any vs All

sam-s commented Nov 12, 2017

Output of `pd.show_versions()`

sam-s commented Nov 12, 2017 •

edited

Loading

`+` vs `sum`