Skip to content

BUG: Timedelta created by to_timedelta does not add correctly to datetime in Pandas 2 #53643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
Flix6x opened this issue Jun 13, 2023 · 10 comments
Open
3 tasks done
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution Numeric Operations Arithmetic, Comparison, and Logical operations Timedelta Timedelta data type

Comments

@Flix6x
Copy link
Contributor

Flix6x commented Jun 13, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from datetime import datetime
import pandas as pd

# Create a datetime object (midnight)
dt = datetime(2015, 1, 1, 0)

# Create a pd.Timedelta object (1 hour)
delta_1 = pd.Timedelta("1H")

# Create a pd.Timedelta object from a DateTimeIndex frequency (1 hour)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1H").freq)

# First check that the pd.Timedelta objects are equal
assert(delta_1 == delta_2)  # True

# Then check whether adding each pd.Timedelta to the datetime object moves the datetime by one hour: this does not behave correctly since Pandas 2.0
assert(dt + delta_1 == dt + delta_2)  # False -> AssertionError

# Left-hand side looks right: the hour was added
dt + delta_1  # datetime.datetime(2015, 1, 1, 1, 0)

# Right-hand side appears to be wrong: the hour was not added
dt + delta_2  # datetime.datetime(2015, 1, 1, 0, 0)

Issue Description

It seems that the pd.Timedelta object produced by calling pd.to_timedelta on a pd.DateTimeIndex frequency can no longer be added to a datetime.datetime object since pandas>1.5.3.

It's not evident to me why the two pd.Timedelta objects (appearing to be equal) are behaving differently.

Expected Behavior

Running the above example in pandas==1.5.3 (and also 1.4.4) works as expected.

Installed Versions

INSTALLED VERSIONS

commit : 4149f32
python : 3.9.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.11-76051511-generic
Version : #202112220937164018548121.04~b3a2c21-Ubuntu SMP Mon Jan 3 16:5
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.0.dev0+956.g4149f32388
numpy : 1.22.4
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.2.0
pip : 23.1.2
Cython : 0.29.30
pytest : 7.0.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : 0.56.0
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.31
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2022.5
qtpy : None
pyqt5 : None

@Flix6x Flix6x added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2023
@jbrockmendel
Copy link
Member

I have an explanation but not a solution.

delta_2 here has unit="s". Timedeltas with unit="s" can go outside the range supported by datetime.timedelta, so the way they are initialized passes seconds=0 to the parent class (datetime.timedelta) constructor. So the pytimedelta struct (which pandas ignores) looks like timedelta(0). When you do dt + delta_2, that calls the pydatetime's __add__ method, which treats delta_2 as a regular pytimedelta, not a pd.Timedelta. So it uses the 0, giving the incorrect result.

Reversing the operation and doing delta_2 + dt gives the correct result, as does pd.Timestamp(dt) + delta_2.

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

@jbrockmendel jbrockmendel added Timedelta Timedelta data type Non-Nano datetime64/timedelta64 with non-nanosecond resolution labels Nov 1, 2023
@miccoli
Copy link
Contributor

miccoli commented May 25, 2024

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

No way! pd.Timedelta claims to be a datetime.timedelta subclass:

>>> import datetime
>>> import pandas as pd
>>> issubclass(pd.Timedelta, datetime.timedelta)
True

The Liskov substitution principle requires that datetime.__add__(self, other) is semantically equivalent, whenever other is a datetime.timedelta subclass.

By the way, why is pd.Timedelta a datetime.timedelta subclass? They are mostly incompatible. The only way forward seems to me to completely drop the claim that pd.Timedelta should be a datetime.timedelta subclass.

@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024
@GiorgioBalestrieri
Copy link

Just caught a bug caused by this while trying to upgrade to pandas 2.0. This is both quite concerning (as mentioned above, pd.Timedelta should behave like dt.timedelta, since it claims to be a subclass), and potentially a blocker, because we'll have to go and ensure that each single pd.Timedelta is converted to dt.timedelta before doing any arithmetics on it.

I really don't want to sound like I'm demanding a fix for a bug/issue in an open source library/project, but I think solving this should be priority.

@Flix6x
Copy link
Contributor Author

Flix6x commented Mar 31, 2025

Maybe a silly suggestion, but could we not simply have pd.Timedelta.__add__ conditionally switch the order of self and other?

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

Could you reformulate this solution as an issue with the stdlib? As it stands, I don't understand what would be wrong there.

They are mostly incompatible.

Since pd.Timedelta is subclassing datetime.timedelta I would actually expect they are mostly compatible. Could you elaborate on this?

@miccoli
Copy link
Contributor

miccoli commented Mar 31, 2025

There is no easy solution, right now: the problem is with datetime.datetime.__add__ (and not not with pd.Timedelta.__add__).

See this example:

import datetime
import pandas as pd

dt = datetime.datetime(2015, 1, 1, 0)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1h").freq)
assert isinstance(delta_2, datetime.timedelta)

print(dt.__add__(delta_2))  # 2015-01-01 00:00:00
print(delta_2.__add__(dt))  # 2015-01-01 01:00:00

So python stdlib datetime.datetime.__add__ method will see that delta_2 is a datetime.timedelta instance and assume that it behaves like datetime.timedelta(seconds=3600), which is not true.

As a general comment, maybe datetime.datetime.__add__ is a little too optimistic, since for efficiency it assumes that the pytimedelta struct holds a valid value, while of course it could also query the higher level property pd.Timedelta.seconds...

By the way, there is a generale feature of Python that allows subclasses to override their ancestors’ operations, see the note at the end of https://docs.python.org/3/reference/datamodel.html#emulating-numeric-types.

In fact

delta_1 = datetime.timedelta(seconds=3600)
print(delta_1 + delta_2)  # 0 days 02:00:00

works as expected, because delta_1 + delta_2 will call the reflected delta_2.__radd__(delta_1) first, even if delta_1.__add__(delta_2) does not return NotImplemented

repr(delta_1 + delta_2)  # "Timedelta('0 days 02:00:00')"
repr(delta_1.__add__(delta_2))  # 'datetime.timedelta(seconds=3600)'
repr(delta_2.__radd__(delta_1))  # "Timedelta('0 days 02:00:00')"

So maybe a way forward is to convince the CPython developers that special code should be added to datetime.datetime.__add__(x) to check first for a reflected operator if type(x) is a subclass of datetime.timedelta... but I think that it is highly improbable that a fix will be accepted form the side of CPython.

From the side of pandas, the obvious solution would be to ensure that issubclass(pandas._libs.tslibs.timedeltas.Timedelta, datetime.timedelta) returns false, so that CPython is forced to use the reflected

delta_2.__radd__(dt)  # Timestamp('2015-01-01 01:00:00')

which works as expected.

Bottom line: do not mix python datetime.* objects with pandas time objects, because they are largely incompatible, and could give raise to strange behaviour.

@giuliobeseghi
Copy link

giuliobeseghi commented Apr 11, 2025

Do we understand why this wasn't an issue in pandas<2.0?

@GiorgioBalestrieri
Copy link

GiorgioBalestrieri commented Apr 11, 2025

Timedeltas with unit="s" can go outside the range supported by datetime.timedelta

curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway, wouldn't it be reasonable to constraint pd.Timedelta to the same range of values of datetime.timedelta, and represent the timedelta object "correctly" internally, which would effectively solve this issue?

Code in question:

# For millisecond and second resos, we cannot actually pass int(value) because
# many cases would fall outside of the pytimedelta implementation bounds.
# We pass 0 instead, and override seconds, microseconds, days.
# In principle we could pass 0 for ns and us too.
if reso == NPY_FR_ns:
td_base = _Timedelta.__new__(cls, microseconds=int(value) // 1000)
elif reso == NPY_DATETIMEUNIT.NPY_FR_us:
td_base = _Timedelta.__new__(cls, microseconds=int(value))
elif reso == NPY_DATETIMEUNIT.NPY_FR_ms:
td_base = _Timedelta.__new__(cls, milliseconds=0)
elif reso == NPY_DATETIMEUNIT.NPY_FR_s:
td_base = _Timedelta.__new__(cls, seconds=0)

This was introduced in: #47641

@miccoli
Copy link
Contributor

miccoli commented Apr 11, 2025

curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway,

Pandas datetime arithmetic is guaranteed to work: it is mixed python and pandas arithmetic which is problematic. Simply put, the pytimedelta structure could overflow, so it is simply not initialised for resolutions grater than µs. Note that this structure is not needed for Pandas temporal arithmetic, but only to claim that pd.Timedelta is a datetime.timedelta subclass...

Here a comparison to numpy behaviour, which was the model for Pandas implementation.

>>> issubclass(np.timedelta64, datetime.timedelta)
False
>>> np.timedelta64(1, 'ns').astype(datetime.timedelta)
1
>>> np.timedelta64(1, 'ms').astype(datetime.timedelta)
datetime.timedelta(microseconds=1000)
>>> np.timedelta64(10**9*24*3600*1000-1, 'ms').astype(datetime.timedelta)
datetime.timedelta(days=999999999, seconds=86399, microseconds=999000)
>>> np.timedelta64(10**9*24*3600*1000, 'ms').astype(datetime.timedelta)
86400000000000000

Main takes:

  • np.timedelta64 does not claims to be a datetime.timedelta subclass
  • when an explicit cast to datetime.timedelta is possible without resolution loss, overflow, underflow, a datetime.timedelta object is returned, otherwise ant int without resolution.

Maybe a mitigation could be to initialise the pytimedelta structure when this is possible without overflow, even if the resolution is greater than µs?

Moreover, here an other problem, which shows up on the other end of the spectrum:

>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D'))
Traceback (most recent call last):
  File "timedeltas.pyx", line 1682, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
  File "np_datetime.pyx", line 683, in pandas._libs.tslibs.np_datetime.convert_reso
OverflowError: result would overflow

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "timedeltas.pyx", line 806, in pandas._libs.tslibs.timedeltas._binary_op_method_timedeltalike.f
  File "timedeltas.pyx", line 1685, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
pandas._libs.tslibs.np_datetime.OutOfBoundsTimedelta: Cannot cast 999999999 days 00:00:00 to unit='us' without overflow.

but

>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D')).to_pytimedelta()
datetime.timedelta(days=999999999)

In fact

>>> datetime.timedelta.max
datetime.timedelta(days=999999999, seconds=86399, microseconds=999999)

is 86399999999999999999 µs, which is 67 bit long.

And here we are again at the starting point:

  • python has a datetime.timedelta object with a fixed 1µs resolution, and about 67 (plus sign) bits range.
  • numpy.timedelta64 are 64 bits (with sign) with a user chosen resolution, from 1as (one attosecond) to 1Y (one year).

So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.

@GiorgioBalestrieri
Copy link

GiorgioBalestrieri commented Apr 14, 2025

So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.

I'm probably missing some context here, but it sounds like making pd.Timedelta not a subclass of dt.timedelta is the least bad option here, and actually reflects what's going on? This would be a breaking change (in terms of type checking, at least, and any logic using isinstance, I guess), but at least it should alleviate the concerns that there can be silent bugs in upgrading to pandas 2.0?

Could any devs chime in on the feasibility of this? Could it be rolled out as part of pandas 2.0, or would it have to wait for pandas 3.0?

One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for dt.timedelta if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).

@miccoli
Copy link
Contributor

miccoli commented Apr 15, 2025

One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for dt.timedelta if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).

I agree: silently clipping at 86399999999999999999 µs would be a bad option.

I fear that there is no easy way forward here, as also

>>> issubclass(pd.Timestamp, datetime.datetime)
True

so that a complete overhaul of the pandas datetime system should be implemented.
But here the opinion of a pandas developer is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution Numeric Operations Arithmetic, Comparison, and Logical operations Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

6 participants