Skip to content

BUG: Summation of NaT in a DataFrame with axis=1 does not return NaT #17235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
s-celles opened this issue Aug 12, 2017 · 13 comments · Fixed by #37148
Closed

BUG: Summation of NaT in a DataFrame with axis=1 does not return NaT #17235

s-celles opened this issue Aug 12, 2017 · 13 comments · Fixed by #37148
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@s-celles
Copy link
Contributor

s-celles commented Aug 12, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

dat = """s1;s2
1000;2000
3000;1500
500"""

df = pd.read_csv(io.StringIO(dat), sep=";")
df.index = df.index + 1
df = df.apply(lambda x: pd.to_timedelta(x, unit='ms'))

df['laptime'] = df.sum(axis=1, skipna=False)

print(df)

Problem description

This code displays

               s1              s2                       laptime
1        00:00:01        00:00:02               0 days 00:00:03
2        00:00:03 00:00:01.500000        0 days 00:00:04.500000
3 00:00:00.500000             NaT -106752 days +00:12:43.645223
In [81]: df['laptime'][3]
Out[81]: Timedelta('-106752 days +00:12:43.645223')

Expected Output

               s1              s2                       laptime
1        00:00:01        00:00:02               0 days 00:00:03
2        00:00:03 00:00:01.500000        0 days 00:00:04.500000
3 00:00:00.500000             NaT                           NaT
In [81]: df['laptime'][3]
Out[81]: NaT

That's very strange because pd.to_timedelta(500, unit='ms') + pd.NaT returns NaT (which is expected) but it's doesn't work as expected when summing Series)

Output of pd.show_versions()

: pd.show_versions()
/Users/scls/anaconda/lib/python3.6/site-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version.
from pandas.tslib import OutOfBoundsDatetime

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.20.1
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.5.0

@gfyoung gfyoung added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 12, 2017
@jreback
Copy link
Contributor

jreback commented Aug 12, 2017

you are specifying skipna=False, which is not the default, thus it is NOT skipping nans.

@jreback jreback closed this as completed Aug 12, 2017
@jreback jreback added this to the No action milestone Aug 12, 2017
@s-celles
Copy link
Contributor Author

00:00:00.500000 + NaT should return NaT (not -106752 days +00:12:43.645223)

@s-celles
Copy link
Contributor Author

with skipna=True

In [110]: df.sum(axis=1, skipna=True)
Out[110]:
1          00:00:03
2   00:00:04.500000
3   00:00:00.500000
dtype: timedelta64[ns]

so whatever skipna value is, I never get NaT which is expected result

@jreback
Copy link
Contributor

jreback commented Aug 12, 2017

In [8]: pd.Timedelta('00:00:00.500000') + pd.NaT 
Out[8]: NaT

again you are specifying skipna=True. This by-definition skips NA (and NaT)

@s-celles
Copy link
Contributor Author

and if I don't skip NA (and NaT) , so skipna=False I should get NaT for laptime 3... not -106752 days +00:12:43.645223

What means -106752 days +00:12:43.645223?

According doc

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

so if skipna=False I expect to not exclude NA/null values and as you point it I expect a kind of "absorptivity" of NaT (see row 3)

               s1              s2                       laptime
   ...
3 00:00:00.500000             NaT -106752 days +00:12:43.645223

@s-celles
Copy link
Contributor Author

df.sum(axis=1, skipna=False) should return same result than df['s1'] + df['s2'] which is not the case

@chris-b1
Copy link
Contributor

@jreback - I think this is buggy, we're not respecting NaT sematics, but instead including the sentinel value in the sum.

In [3]: df['s2']
Out[3]: 
1          00:00:02
2   00:00:01.500000
3               NaT
Name: s2, dtype: timedelta64[ns]

In [4]: df['s2'].sum()
Out[4]: Timedelta('0 days 00:00:03.500000')

In [5]: df['s2'].sum(skipna=False)
Out[5]: Timedelta('-106752 days +00:12:46.645224')

In [6]: df['s2'].values.sum()
Out[6]: numpy.timedelta64('NaT','ns')

@chris-b1 chris-b1 reopened this Aug 13, 2017
@jreback
Copy link
Contributor

jreback commented Aug 13, 2017

these are converted to i8 before adding (and then masked which is not happening i guess)

@mroeschke
Copy link
Member

Looks to be fixed on master now. Could use a regression test.

In [20]: df['s2'].sum(skipna=False)
Out[20]: NaT

In [22]: pd.__version__
Out[22]: '0.26.0.dev0+519.gdf2e0813e'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Oct 9, 2019
@gfyoung gfyoung removed this from the No action milestone Oct 24, 2019
@ainsleyto
Copy link

Looks to be fixed on master now. Could use a regression test.

In [20]: df['s2'].sum(skipna=False)
Out[20]: NaT

In [22]: pd.__version__
Out[22]: '0.26.0.dev0+519.gdf2e0813e'

'''

df["s2"].sum(skipna=False)
NaT

df["s1"]+df["s2"]
1 00:00:03
2 00:00:04.500000
3 NaT

df[["s1","s2"]].sum(axis=1,skipna=False)
1 0 days 00:00:03
2 0 days 00:00:04.500000
3 -106752 days +00:12:43.645223

pd.version
'0.26.0.dev0+734.g0de99558b'
'''

sum of s2 gives NaT but sum(axis=1,skipna=False) still does not give NaT. I believe this still needs a fix?

@mroeschke
Copy link
Member

Ah good catch @ainsleyto. Yes looks like axis=1 case still needs a fix.

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Datetime Datetime data dtype and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Nov 3, 2019
@mroeschke mroeschke changed the title Odd summation of NaT BUG: Summation of NaT in a DataFrame with axis=1 does not return NaT Apr 1, 2020
@mroeschke mroeschke added the Bug label Apr 1, 2020
@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 22, 2020
@jbrockmendel
Copy link
Member

df['s2'] works i think bc that dispatches to the TimedeltaArray implementation, which im pretty sure the dataframe version does not do (for either axis=0 or axis=1)

From poking at this, this can be fixed by removing from nanops.nansum

    if is_float_dtype(dtype):
        dtype_sum = dtype
-    elif is_timedelta64_dtype(dtype):
-        dtype_sum = np.float64

and adjusting nanops._wrap_results to not expect float64. The trouble is that doing so disables overflow checking.

@s-celles
Copy link
Contributor Author

Thanks @jbrockmendel and all for this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants