Skip to content

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sam-s opened this issue Nov 9, 2017 · 11 comments
Closed

Comments

@sam-s
Copy link

sam-s commented Nov 9, 2017

Code Sample, a copy-pastable example if possible

Current (0.21) incorrect behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.21.0'
>>> pd.Series().sum()
nan

Old (0.20.3) correct behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.20.3'
>>> pd.Series().sum()
0

Problem description

Sum of an empty series should be 0, not nan, because otherwise the following invariant is violated:

pd.concat([s1,s2]).sum() == s1.sum() + s2.sum()

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: C LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@max-sixty
Copy link
Contributor

#17630

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

@jreback jreback closed this as completed Nov 9, 2017
@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

note that your invariant holds with sum of empty = NaN
otherwise you lose information

@sam-s
Copy link
Author

sam-s commented Nov 9, 2017

The invariant no longer holds:

>>> pd.concat([pd.Series([1]),pd.Series()]).equals(pd.Series([1]))
True

thus should be

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

@jorisvandenbossche
Copy link
Member

pandas has always had this behavior

@jreback this is not true, pandas always had the 0 behaviour for empty series (we broke this behaviour on purpose, for sure, but it was a breaking change)

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

This only evaluates to False because the + between scalars (0 + np.nan) is not skipping NaNs as the sum of pandas series does (if you have a series with [0, np.nan] and would sum this, you would get 0).
So I am not sure your comparison really holds.

@sam-s we had long discussions about it, and both 0 and NaN behaviours have pros and cons, but in the end we needed to make a decision, which became NaN. You can read this discussion in #9422

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 9, 2017
@sam-s
Copy link
Author

sam-s commented Nov 9, 2017

Okay, I am sure it's too late for me to weep and scream.
However, how do I check that there are no NaNs in a DataFrame?
I used df.isnull().sum().sum() == 0 before.
What do I do now?

@jorisvandenbossche
Copy link
Member

I used df.isnull().sum().sum() == 0 before.

Doesn't that still work?

In [30]: pd.Series([1, 2, 3]).isnull().sum()
Out[30]: 0

In [31]: pd.Series([1, 2, np.nan]).isnull().sum()
Out[31]: 1

Or can you give a concrete example?

@jorisvandenbossche
Copy link
Member

Okay, I am sure it's too late for me to weep and scream.

Probably yes, but you can still raise your voice in #9422. However I think it is mainly interesting to hear how it affects code (what do you need to do to work around the change) and how this can be easified.

@sam-s
Copy link
Author

sam-s commented Nov 12, 2017

+ vs sum

@jorisvandenbossche : you appear to be saying that + for scalars is somehow a different beast than sum for collections (lists/sets/series &c). This is so, well, unexpected, that I neglected to address it until you wrote in #9422 "... two different kinds of sums ...". I cannot comment there anymore, so I will reply here.
Again, I understand that math is not your only rationale, but I beg you to remember that the bulk of your customers are applied mathematicians and what I am about to say is the way we think about these issues, and this is why we will be screaming bloody murder on sum([])==null till the end of the world and back.

When we have an associative binary operation f, such as +, we can define it on lists with at least two elements like this:

f(a,b,c,d,...z) := f(...(f(f(f(a,b),c),d),...),z)

or, in a more familiar infix notation

a+b+c+d+...+z := (((((a+b)+c)+d)+...)+z)

When the operation has a unit u (e.g., 0 for + or 1 for *): f(u,x)=f(x,u)=x for any x, we can extend the list operation to any list because

f(a,b,c,...z) = f(...(f(f(f(u,a),b),c),...),z)

and now f(a)=f(u,a)=a and f()=u.
(((Note, parenthetically, that, strictly speaking, while + is a binary operation, but you are not arguing that a sum of a list of one element is undefined. Why? How is it different from a sum of a list of no argument from this POV?!)))

This definition is natural for any associative binary operation OP with a neutral element UNIT, and it works like this (this is a mathematical definition identical to the above):

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      result = OP(result,x)
    return result

Clearly [x].apply(+)==x and [].apply(+)==0 .

This is fully applicable to all such associative binary operations with a neutral element, e.g.,:

  • + and 0
  • * and 1
  • max and -inf
  • min and inf
  • and/all and True
  • or/any and False

The bottom line: OP([])=null is wrong because it breaks associativity and your customers expect associativity the same way you expect your C compiler to be able to compile a+=0 so that a does not change.

@sam-s
Copy link
Author

sam-s commented Nov 12, 2017

Any vs All

The NaN/null behavior should be defined for collections which have some bad (NaN/null) data.
Whether it has some valid data or not is irrelevant.
E.g., in the presence of null, the above definition becomes

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      if x.isnull() and propagate_null:
        return null
      else:
        result = OP(result,x)
    return result

Clearly apply will return null if there IS a null, not if there are NO non-nulls.

@shoyer writes in #9422 : This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction.

The choice made by databases is a mistake made by the designers of SQL, as explained by @kenahoo. There is no reason for pandas to repeat the mistake.

Please do not perpetuate a mistake made by others.
You are not beholden to them.

@sam-s
Copy link
Author

sam-s commented Nov 12, 2017

Finally, note that the above argument is applicable only to associative operations.
std and mean are not such operations.
You can make an argument for both

  • mean([]) = std([]) = std([x]) = NaN

and

  • mean([]), std([]), std([x]) --> exception

Both are legitimate design choices and, while, personally, I prefer the second one, I defer to you.

However, sum([])=nan is not a legitimate design decision because it breaks the contract of addition (yes, associativity is a part of the contract).

@pandas-dev pandas-dev locked and limited conversation to collaborators Nov 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants