-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls read the what’s new http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#sum-prod-of-all-nan-series-dataframes-is-now-consistently-nan pandas has always had this behavior |
note that your invariant holds with sum of empty = NaN |
The invariant no longer holds:
thus should be
which was |
@jreback this is not true, pandas always had the 0 behaviour for empty series (we broke this behaviour on purpose, for sure, but it was a breaking change)
This only evaluates to False because the @sam-s we had long discussions about it, and both 0 and NaN behaviours have pros and cons, but in the end we needed to make a decision, which became NaN. You can read this discussion in #9422 |
Okay, I am sure it's too late for me to weep and scream. |
Doesn't that still work?
Or can you give a concrete example? |
Probably yes, but you can still raise your voice in #9422. However I think it is mainly interesting to hear how it affects code (what do you need to do to work around the change) and how this can be easified. |
|
Any vs AllThe NaN/null behavior should be defined for collections which have some bad (NaN/null) data.
Clearly @shoyer writes in #9422 : This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction. The choice made by databases is a mistake made by the designers of SQL, as explained by @kenahoo. There is no reason for pandas to repeat the mistake. Please do not perpetuate a mistake made by others. |
Finally, note that the above argument is applicable only to associative operations.
and
Both are legitimate design choices and, while, personally, I prefer the second one, I defer to you. However, |
Code Sample, a copy-pastable example if possible
Current (
0.21
) incorrect behavior:Old (
0.20.3
) correct behavior:Problem description
Sum of an empty series should be
0
, notnan
, because otherwise the following invariant is violated:Expected Output
Output of
pd.show_versions()
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: