Skip to content

BUG: inconsistent behavior of DateOffset #47953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
colomb8 opened this issue Aug 3, 2022 · 15 comments · Fixed by #53681
Closed
3 tasks done

BUG: inconsistent behavior of DateOffset #47953

colomb8 opened this issue Aug 3, 2022 · 15 comments · Fixed by #53681
Assignees
Labels
Bug good first issue Timedelta Timedelta data type Timestamp pd.Timestamp and associated methods

Comments

@colomb8
Copy link

colomb8 commented Aug 3, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({
   'S':[pd.Timestamp('2019-04-30')],
   'A':[pd.DateOffset(months=1)]
   })

# FIRST TEST
>>> df['S'] + 26 * df['A']
0   2021-06-30
dtype: datetime64[ns]

# SECOND TEST
>>> df['S'].iloc[0] + 26 * df['A'].iloc[0]
Timestamp('2021-06-28 00:00:00')

Issue Description

It seems like multiplying a DateOffset by constant leads to an inconsistent behavior: the first test gives 30/6 while the second 28/6

Expected Behavior

In any case it should be 2021-06-30

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.10.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Italian_Italy.1252

pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.1.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@colomb8 colomb8 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022
@pratyushsharan
Copy link
Contributor

While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021).
But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp.
This would require a bug fix and maybe some test cases.

@pratyushsharan
Copy link
Contributor

I had another look into this, looks like the problem is inside cython code. Applying offset todf['S'] does this:
months = (kwds.get("years", 0) * 12 + kwds.get("months", 0)) * self.n inside offsets.pyx, so in essence it calculates months=1*26=26months and then applies the offset as months=26 rather than applying months=1 offset 26 times (which is what happens when offset is applied to a Timestamp rather than DatetimeArray).

Summary: When applying multiple offsets to DatetimeArray, Cython tries to accumulate months/weeks/days first and then applies one single offset. But when offset is applied to a Timestamp object, the offsets are applied iteratively, which is more accurate when providing multiple offsets.

Solution: changes to offsets.pyx along with test cases.

Would like to know your views @jreback, @jbrockmendel

@colomb8
Copy link
Author

colomb8 commented Aug 9, 2022

While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021). But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp. This would require a bug fix and maybe some test cases.

hi, I would say that the correct result should be 2021-06-30, since it is coherent with datatime relativedelta.
Moreover, please consider the use case when you have a column with number of months and a column with a monthly dateoffset and let's imagine I have to obtain the 2021-06-30 effect; in such case I have to multiply the two columns and I have no other alternatives...(except creating different dateoffset rowwise, that is not so clean..)

@colomb8
Copy link
Author

colomb8 commented Aug 9, 2022

hi I would add that the dataframe gives 2021-06-30 if it has 1 row and 2021-06-28 in case of 2 identical rows...

there are several bugs here, however imho the right result should be 2021-06-30 (coherent with relativedelta). "2 * 1 month" intuitively should be equal to "2 months", no?

@pratyushsharan
Copy link
Contributor

It all comes down to how one interprets multiplying a offset/relativedelta by a scalar. I'll try to show by examples:

import pandas as pd
from dateutil.relativedelta import relativedelta

base_date = pd.Timestamp('2020-01-30')

offset_1month = pd.DateOffset(months=1)
offset_2month = pd.DateOffset(months=2)
delta_1month = relativedelta(months=1)
delta_2month = relativedelta(months=2)

base_date + 2 * offset_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + offset_1month + offset_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + offset_2month
>>> Timestamp('2020-03-30 00:00:00')

base_date + 2 * delta_1month
>>> Timestamp('2020-03-30 00:00:00')

base_date + delta_1month + delta_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + delta_2month
>>>Timestamp('2020-03-30 00:00:00')

As you can see, multiplying pd.DateOffset by scalar n basically means applying the pd.Offset n times. However, relativedelta operates in a different way, where if you multiply relativedelta by scalar n, they first re-calculate the relativedelta and then apply to the base_date, hence you see different values for 4th and 5th statements.

I think we need to first decide what we want to do when a pd.Offset is multiplied by a scalar n (apply pd.Offset n times or recalculate pd.Offset and then move the base_date).

@srotondo
Copy link
Contributor

The difference arises due to the fact that in the offset.pyx file, where the addition is done, the scalar case uses a loop with a variable n in the RelativeDeltaOffset function _apply to calculate with multiplication, and an array simply multiplies the whole offset value by n in the RelativeDeltaOffset function _apply_array before adding.

@srotondo
Copy link
Contributor

take

@srotondo
Copy link
Contributor

Other similar instances of this operation in other related classes to DateOffset use the method where they calculate the combined offset of all n multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.

@colomb8
Copy link
Author

colomb8 commented Aug 25, 2022

Other similar instances of this operation in other related classes to DateOffset use the method where they calculate the combined offset of all n multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.

great.

please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal

>>> df = pd.DataFrame({
   'S':[pd.Timestamp('2019-04-30')],
   'A':[pd.DateOffset(months=1)]
   })

>>> df['S'] + 26 * df['A']

0   2021-06-30
dtype: datetime64[ns]

>>> df2 = df.append(df)

>>> df2['S'] + 26 * df2['A']

0   2021-06-28
0   2021-06-28
dtype: datetime64[ns]

@jbrockmendel
Copy link
Member

please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal

I expect this is bc of a fastpath in _addsub_object_array. id be OK with removing that fastpath.

@jbrockmendel
Copy link
Member

for RelativeDeltaOffset._apply let's just multiply _offset by n instead of the loop. Will that be consistent with apply_array? If we can share anyhting with apply_array that'd be ideal

srotondo pushed a commit to srotondo/pandas that referenced this issue Aug 26, 2022
srotondo pushed a commit to srotondo/pandas that referenced this issue Aug 29, 2022
srotondo pushed a commit to srotondo/pandas that referenced this issue Oct 14, 2022
@gfyoung gfyoung added Timestamp pd.Timestamp and associated methods Timedelta Timedelta data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2023
@MarcoGorelli
Copy link
Member

this PR looked pretty close, if anyone wants to take it forward #50542

@rsm-23
Copy link
Contributor

rsm-23 commented Jun 13, 2023

take

@rsm-23
Copy link
Contributor

rsm-23 commented Jun 13, 2023

@MarcoGorelli this is my first open source contribution. Is it okay to follow the same steps as in #50542 and resolve the comments in my branch?

@MarcoGorelli
Copy link
Member

yeah probably

MarcoGorelli pushed a commit that referenced this issue Jun 23, 2023
* BUG: Fixed inconsistent multiplication #47953

* Fixed release note

* Fixed attribute name

* Changed offset apply logic

* Addressed #46877 re-occurence

* Removed dulicate code

* Addressed comments

* Added unit test cases

* Added mistakenly removed comment
Daquisu pushed a commit to Daquisu/pandas that referenced this issue Jul 8, 2023
…#53681)

* BUG: Fixed inconsistent multiplication pandas-dev#47953

* Fixed release note

* Fixed attribute name

* Changed offset apply logic

* Addressed pandas-dev#46877 re-occurence

* Removed dulicate code

* Addressed comments

* Added unit test cases

* Added mistakenly removed comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Timedelta Timedelta data type Timestamp pd.Timestamp and associated methods
Projects
None yet
7 participants