-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG fixing module first() and DateOffset #52487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
I'm not super-familiar with offsets but I left a couple of comments, will get back to this
pandas/core/generic.py
Outdated
offset = to_offset(offset) | ||
if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]): | ||
if type(offset) is offsets.DateOffset and offset.is_on_offset(self.index[0]): | ||
end_date = self.index[0] - Timedelta(1, unit="d") + offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can hard-code Timedelta(1, unit='d')
, e.g. it would break DateOffset(minutes=2)
:
(Pdb) p self.index[0]
Timestamp('2018-04-09 00:00:00')
(Pdb) p self.index[0] - Timedelta(1, unit='d')
Timestamp('2018-04-08 00:00:00')
(Pdb) p self.index[0] - Timedelta(1, unit='d') + offset
Timestamp('2018-04-08 00:02:00')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review!
Yes, we should not hard-code like that, I did it as a draft. I wanted to show how different the behavior of DateOffset is from the "Tick" object.
The whole first()
was not implemented for DateOffset, there were no tests either.
Then there are bugs on it (on DateOffset itself).
In the end, I wanted to know if patching first()
was correct, if DateOffset will be deprecated or if it should be fixed.
Some of the issues of DateOffset:
- DateOffset.is_on_offset is different than the offset done with a string.
- DateOffset.name doesn't work
- At some point adding an offset to it, was going backwards! (as in self.index[0] + offset = earlier date)
do you want to start by making a docs PR to fix this? I think this would need doing regardless of whether |
Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this further, would it work to just do
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index e16ef3580d..274ab41260 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -43,6 +43,7 @@ from pandas._libs.tslibs import (
Timestamp,
to_offset,
)
+from pandas._libs.tslibs.offsets import DateOffset
from pandas._typing import (
AlignJoin,
AnyArrayLike,
@@ -9073,6 +9074,13 @@ class NDFrame(PandasObject, indexing.IndexingMixin):
if len(self.index) == 0:
return self.copy(deep=False)
+ if isinstance(offset, DateOffset):
+ end_date = end = self.index[0] + offset
+ if end_date in self.index:
+ end = self.index.searchsorted(end_date, side="left")
+ return self.iloc[:end]
+ return self.loc[:end]
+
offset = to_offset(offset)
if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]):
# GH#29623 if first value is end of period, remove offset with n = 1
? (I'm showing the diff of generic.py
compared with upstream/main
)
No... look:
It doesn't need to start with an "end of the month" to fail.
If offset is string:
We need to substract the |
Isn't that fine though? Because In [4]: pd.Timestamp.now() + pd.DateOffset(months=1)
Out[4]: Timestamp('2023-05-11 16:26:01.139005')
In [5]: pd.Timestamp.now() + pd.tseries.offsets.MonthEnd(1)
Out[5]: Timestamp('2023-04-30 16:26:14.195128') I think it makes sense that
would take the rows within one calendar month of the first element, whereas |
Oh, OK!
Does that mean that both (string offset and Should I add examples with
I don't need to speak @ the meeting, yaaay! |
Sorry I was unclear, what I meant was that
Sounds good! |
No reason to apologise. Thanks for explaining. I will add examples in the docs on another PR if that's fine. |
Would somebody very kind review this "annoying" PR @MarcoGorelli ? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, thanks! We can do a follow-up for last
as well after
My pleasure... Yes, we can do It's unclear to me why you only commented about using Before I removed @MarcoGorelli : it looks green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just left a minor comment on the whatsnew note
I'll leave open a bit in case others (@jbrockmendel ? @phofl ?) have comments, else can merge
It's unclear to me why you only commented about using date_range on one test, but not the others. I mean, I use it on the other tests as well.
Just to not repeat myself :)
doc/source/whatsnew/v2.0.1.rst
Outdated
@@ -33,6 +33,7 @@ Bug fixes | |||
- Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`) | |||
- Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`) | |||
- Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`) | |||
- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this close both issues? if so:
- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`) | |
- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had done this, but changed again later (per later request)
Thanks @MarcoGorelli for the review (and the patience). |
doc/source/whatsnew/v2.0.1.rst
Outdated
@@ -33,6 +33,7 @@ Bug fixes | |||
- Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`) | |||
- Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`) | |||
- Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`) | |||
- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no pd.first. Maybe :meth:DataFrame.first
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change done
@@ -9070,7 +9071,15 @@ def first(self, offset) -> Self: | |||
if len(self.index) == 0: | |||
return self.copy(deep=False) | |||
|
|||
if isinstance(offset, offsets.DateOffset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't the offset = to_offset(offset)
below make this check unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is that doing this check after to_offset
would make it impossible to distinguish between MonthEnd
, as that would also satisfy isinstance(offset, DateOffset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they need to be handled differently because something like DateOffset(days=15)
would always return True
for if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0])
, and so would be shifted backward by offset.base
(this is why in the linked issues, the result only has a single row for DateOffset
input)
pandas/core/generic.py
Outdated
if end_date in self.index: | ||
end = self.index.searchsorted(end_date, side="left") | ||
return self.iloc[:end] | ||
return self.loc[:end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of this looks really similar to what we have below. can any of this be de-duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Friendly ping @MarcoGorelli @jbrockmendel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait sorry, I gave a bad review here - this introduces an inconsistency between first('M')
and first(MonthEnd())
(which should mean the same thing - reminder that 'M' means 'month end'):
In [1]: ser = pd.Series([1,2,3], index=pd.DatetimeIndex(['2020-01-31', '2020-02-01', '2020-03-01']))
In [2]: ser.first(MonthEnd())
Out[2]:
2020-01-31 1
2020-02-01 2
dtype: int64
In [3]: ser.first('M')
Out[3]:
2020-01-31 1
dtype: int64
@jbrockmendel is there a way to tell if an offset is DateOffset
but not MonthEnd
? I was a bit surprised to see
In [11]: isinstance(MonthEnd(), DateOffset)
Out[11]: True
. type(offset) is DateOffset
would technically work, but I don't see it being done anywhere else
(as an aside, the fact that the tests still pass indicates the relatively poor test coverage that this function had to begin with)
# GH#29623 if first value is end of period, remove offset with n = 1 | ||
# before adding the real offset | ||
end_date = end = self.index[0] - offset.base + offset | ||
else: | ||
end_date = end = self.index[0] + offset | ||
|
||
# Tick-like, e.g. 3 weeks | ||
if isinstance(offset, Tick) and end_date in self.index: | ||
if isinstance(offset, Tick) or input_is_offset and end_date in self.index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can never remember if a or b and c
is evaluated as (a or b) and c
or a or (b and c)
, could we add parantheses please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change that.
I think the original issue #29623 was not a bug but If
Adding for example 15 days means going 15 days forward
Here we go to mid February with a 15-day offset. In this example, the timespan covered by an offset of month end is thus longer than that of an offset of 15 days. It would be completely unexpected for the user if the time period returned by For example:
So I think the options should be the following :
I'm not sure if I have to (friendly) ping you :) @MarcoGorelli |
I'm inclined to agree, and am pretty tempted to suggest reverting the PR which changed it I'll add it to Wednesday's agenda |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I will work on this shortly |
Closing this one because |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Also fixes part of #51284 (I have not worked on
last()
)Edge cases were only fixed for offsets passed as string.Trying to make it work forDateOffset
would make it more complicated. Should this be fixed or shouldDateOffset
be removed from the function (and docs)?WillDataOffset
be deprecated or fixed? it has bugs or the documentation is incorrect.As commented by @MarcoGorelli:
Functions
first()
andlast()
assume that theSeries
orDataFrame
have a sortedDatetimeIndex
. Onlylast()
has it specified in the documentation.first()
could be easily done as follows (in case it's deprecated):df[df.index < df.index[0] + DateOffset(days= 15)]
I also tried to use
first()
withrelativedelta
but I couldn’t make it work.Edit:
first()
doesn’t work forDateOffset
. With this PRDateOffset
works, but it now seems that there are inconsistencies when given a string orMonthEnd
.I think that since the PRs to fix #29623,
first()
with a string doesn’t work as normal offset arithmetic.