Skip to content

dt.total_seconds() stores float with appending 0.00000000000001 #34290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sjvdm opened this issue May 21, 2020 · 8 comments · Fixed by #48218
Closed

dt.total_seconds() stores float with appending 0.00000000000001 #34290

sjvdm opened this issue May 21, 2020 · 8 comments · Fixed by #48218
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Timedelta Timedelta data type

Comments

@sjvdm
Copy link

sjvdm commented May 21, 2020

Simple test to show that if I have two datetime columns and use dt.total_seconds() to calc the difference, values are stored with an offset of 0.00000000000001.

print(pd.__version__)
print(pd.show_versions())
1.0.3

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.0.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1062.18.1.el7.x86_64
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.18.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 41.4.0
Cython           : 0.29.16
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.13.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.4.1
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

iPython code:

import pandas as pd
import datetime

data = {'start':datetime.datetime(2020,1,1,12),
       'end':datetime.datetime(2020,1,1,12,2) 
       }

df = pd.DataFrame(data,index=[0])

#try to parse to pd.to_datetime
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])

print('print calc differences')
print((df['end'] - df['start']).dt.total_seconds())
print((df['end'] - df['start']).dt.total_seconds().values[0])
print('')

print('testing on normal float')
print(float(120))
print('')

Output:

print calc differences
0    120.0
dtype: float64
120.00000000000001

testing on normal float
120.0
@jbrockmendel
Copy link
Member

I cant reproduce this on OSX (py37) or Ubuntu (py36). @sjvdm can you get this in any other python versions?

@mroeschke mroeschke added the Needs Info Clarification about behavior needed to assess issue label May 22, 2020
@sjvdm
Copy link
Author

sjvdm commented May 22, 2020

Ok so I reproduced this at 2 other versions:

The result of total_seconds() seems to be stored as 120.00000000000001

Python 2.7.5 (default, Apr  2 2020, 13:16:51) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
import datetime

data = {'start':datetime.datetime(2020,1,1,12),
       'end':datetime.datetime(2020,1,1,12,2) 
       }

df = pd.DataFrame(data,index=[0])

#try to parse to pd.to_datetime
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])

print('print calc differences')
print((df['end'] - df['start']).dt.total_seconds())
print((df['end'] - df['start']).dt.total_seconds().values[0])
print('')

print('testing on normal float')
print(float(120))
>>> import datetime
>>> 
>>> data = {'start':datetime.datetime(2020,1,1,12),
...        'end':datetime.datetime(2020,1,1,12,2) 
...        }
>>> 
>>> df = pd.DataFrame(data,index=[0])
>>> 
>>> #try to parse to pd.to_datetime
... df['end'] = pd.to_datetime(df['end'])
>>> df['start'] = pd.to_datetime(df['start'])
>>> 
>>> print('print calc differences')
print calc differences
>>> print((df['end'] - df['start']).dt.total_seconds())
print('')0    120.0
dtype: float64
>>> print((df['end'] - df['start']).dt.total_seconds().values[0])
120.00000000000001
>>> print('')

>>> 
>>> print('testing on normal float')
testing on normal float
>>> print(float(120))
120.0
>>> print('')

@sjvdm
Copy link
Author

sjvdm commented May 22, 2020

May also be related to floating point precision as a colleague pointed out: https://www.python.org/dev/peps/pep-0485/

@jorisvandenbossche
Copy link
Member

I can "reproduce" this (on linux). Using a simpler example, with Timedelta scalar it doesn't give the issue, with TimedeltaArray ti does:

In [27]: td = pd.Timedelta("2 min")  

In [28]: td.total_seconds()  
Out[28]: 120.0

In [29]: print(td.total_seconds()) 
120.0

In [30]: arr = pd.array([pd.Timedelta("2 min")])  

In [31]: arr.total_seconds()   
Out[31]: array([120.])

In [32]: print(arr.total_seconds()[0])
120.00000000000001

But, it's quite probably indeed a floating point precision issue that can be ignored. total_seconds is implemented like:

In [33]: 1e-9 * arr.asi8   
Out[33]: array([120.])

In [34]: (1e-9 * arr.asi8)[0] 
Out[34]: 120.00000000000001

@jorisvandenbossche
Copy link
Member

The Timedelta scalar version (inherited from datetime.timedelta is implemented with a division instead of multiplicatio. That seems to make the difference:

# how it is implemented for Timedelta under the hood
In [43]: td.value / 1e9  
Out[43]: 120.0

# what basically happens in TimedeltaArray
In [44]: td.value * 1e-9 
Out[44]: 120.00000000000001

@mroeschke mroeschke added Bug Timedelta Timedelta data type and removed Needs Info Clarification about behavior needed to assess issue labels Mar 26, 2021
@fpavogt
Copy link

fpavogt commented Jul 12, 2022

Looks like I just hit the same issue with pandas 1.4.3 under OSX 12.4 and Anaconda. Below a MWE adapted from the case that got me to realize this.

Looks like this issue is still relevant !

Sample code:

import pandas as pd

out = pd.Series([Timestamp('2022-03-16 08:32:26'), Timestamp('2022-03-16 08:32:41')])

# Apply dt.total_seconds, and extract the second value
(out - out.iloc[0]).dt.total_seconds()[1]
# Returns 15.000000000000002

# Extract the Timedelta, and apply its total_seconds() method 
((out - out.iloc[0])[1]).total_seconds()
# Returns 15.0

# Same workaround, but via apply
(out - out.iloc[0]).apply(lambda x: x.total_seconds())[1]
# Returns 15.0

@jbrockmendel
Copy link
Member

I get 15.0 in main. IIRC there was a PR last week that touched DatetimeArray.total_seconds that might have fixed this

@mroeschke
Copy link
Member

I guess could use a unit test to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants