-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REF/PERF: PeriodDtype decouple from DateOffset #34499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/PERF: PeriodDtype decouple from DateOffset #34499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable but need to look again
pandas/_libs/tslibs/timedeltas.pyx
Outdated
@@ -907,19 +907,19 @@ cdef class _Timedelta(ABCTimedelta): | |||
Examples | |||
-------- | |||
>>> td = pd.Timedelta('1 days 2 min 3 us 42 ns') | |||
>>> td.resolution | |||
>>> td.resolution_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a public method, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated to the rest of the PR, just incorrect ATM. we changed the method name a while back and apparently didnt update the docstring
BTW i commented on one of your PRs to ask: where did the one-letter abbreviations currently used in this function come from? id like a name for them to distinguish them from the 2-3 letter abbrevs used elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not idea where the 1 letter names came from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
pandas/_libs/tslibs/period.pyx
Outdated
# Note: this is more performant than PeriodDtype.from_date_offset(freq) | ||
# because from_date_offset cannot be made a cdef method (until cython | ||
# supported cdef classmethods) | ||
self._dtype = PeriodDtype(freq._period_dtype_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you basically only use this as self._dtype.dtype_code
(from a quick look at the diff), why not storing the dtype_code
itself instead of wrapped in a PeriodDtype?
(btw, the PeriodDtype is also not a dtype, so we need to find another name for it, I would say)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not storing the dtype_code itself instead of wrapped in a PeriodDtype?
We could do this. My expectation is that it will get more attributes before long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other attributes are you thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we frequently split this integer code up into two pieces, corresponding to
def decompose(code):
head, anchor = divmod(code, 1000)
grp = head * 1000
return grp, anchor
im looking at storing grp and anchor instead of just the code. The upside of this is that grp
corresponds to the FreqGroup
class in libfrequencies, so we could re-use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't mind too much. I'll let it to you to judge whether it is already worth adding the full class instead of just the dtype_code
attribute.
But, when keeping it:
- The other methods of the PeriodDtype class don't seem used and/or tested?
- It needs another name IMO, since we already have another PeriodDtype class that is something else
pandas/_libs/tslibs/period.pyx
Outdated
# Note: this is more performant than PeriodDtype.from_date_offset(freq) | ||
# because from_date_offset cannot be made a cdef method (until cython | ||
# supported cdef classmethods) | ||
self._dtype = PeriodDtype(freq._period_dtype_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't mind too much. I'll let it to you to judge whether it is already worth adding the full class instead of just the dtype_code
attribute.
But, when keeping it:
- The other methods of the PeriodDtype class don't seem used and/or tested?
- It needs another name IMO, since we already have another PeriodDtype class that is something else
yah not yet. i can remove them until they are actually used i guess |
Any suggestions? (I plan to mix it into the core.dtypes PeriodDtype before long) |
updated with a placeholder name PeriodPseudoDtype |
Alternative could also be |
|
||
|
||
cdef class PeriodPseudoDtype: | ||
# cdef readonly: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring explaining for what / where this is used?
updated with docstring. im increasingly confident this will (eventually, soon-ish) let us de-duplicate Resolution, FreqGroup, and some cdef constants we define in libperiod. Maybe even NPY_DATETIMEUNIT depending on what numpy does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. pls rebase and ping on green.
ping |
ATM we define PeriodDtype in terms of DateOffsets, but this is a misnomer. In fact, virtually every Period/PeriodArray method has to start off by taking its
.freq
and finding the corresponding integer code. This makes the integer code itself into a dtype. We lose a little bit of ground on the constructor, then make it back up in subsequent calls.The constructor perf I think we can improve by eventually cutting the DateOffset out of the process altogether.
We'll also be able to de-duplicate a bunch of other stuff: FreqGroup can be defined in terms of PeriodDypeCode, Resolution can be defined in terms of FreqGroup (xref #34462), we can avoid redundant definitions of the dtype codes in period.pyx, and a lot of the rest of libfrequencies becomes unnecessary.
In a follow-up I plan to mix the cython-space PeriodDtype into the core.dtypes PeriodDtype and we can get the same perf improvements in the PeriodArray methods.