Skip to content

REF/PERF: PeriodDtype decouple from DateOffset #34499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 4, 2020

Conversation

jbrockmendel
Copy link
Member

ATM we define PeriodDtype in terms of DateOffsets, but this is a misnomer. In fact, virtually every Period/PeriodArray method has to start off by taking its .freq and finding the corresponding integer code. This makes the integer code itself into a dtype. We lose a little bit of ground on the constructor, then make it back up in subsequent calls.

In [2]: per = pd.Period("2016Q1")                                                                                                                                                                   

In [3]: %timeit per.year
556 ns ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)    # <-- master
94.5 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)  # <-- PR

In [4]: %timeit pd.Period("2016Q1")                   
21.2 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
25.8 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

The constructor perf I think we can improve by eventually cutting the DateOffset out of the process altogether.

We'll also be able to de-duplicate a bunch of other stuff: FreqGroup can be defined in terms of PeriodDypeCode, Resolution can be defined in terms of FreqGroup (xref #34462), we can avoid redundant definitions of the dtype codes in period.pyx, and a lot of the rest of libfrequencies becomes unnecessary.

In a follow-up I plan to mix the cython-space PeriodDtype into the core.dtypes PeriodDtype and we can get the same perf improvements in the PeriodArray methods.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable but need to look again

@@ -907,19 +907,19 @@ cdef class _Timedelta(ABCTimedelta):
Examples
--------
>>> td = pd.Timedelta('1 days 2 min 3 us 42 ns')
>>> td.resolution
>>> td.resolution_string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a public method, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated to the rest of the PR, just incorrect ATM. we changed the method name a while back and apparently didnt update the docstring

BTW i commented on one of your PRs to ask: where did the one-letter abbreviations currently used in this function come from? id like a name for them to distinguish them from the 2-3 letter abbrevs used elsewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not idea where the 1 letter names came from

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

# Note: this is more performant than PeriodDtype.from_date_offset(freq)
# because from_date_offset cannot be made a cdef method (until cython
# supported cdef classmethods)
self._dtype = PeriodDtype(freq._period_dtype_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you basically only use this as self._dtype.dtype_code (from a quick look at the diff), why not storing the dtype_code itself instead of wrapped in a PeriodDtype?

(btw, the PeriodDtype is also not a dtype, so we need to find another name for it, I would say)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not storing the dtype_code itself instead of wrapped in a PeriodDtype?

We could do this. My expectation is that it will get more attributes before long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other attributes are you thinking of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we frequently split this integer code up into two pieces, corresponding to

def decompose(code):
    head, anchor = divmod(code, 1000)
    grp = head * 1000
    return grp, anchor

im looking at storing grp and anchor instead of just the code. The upside of this is that grp corresponds to the FreqGroup class in libfrequencies, so we could re-use that.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't mind too much. I'll let it to you to judge whether it is already worth adding the full class instead of just the dtype_code attribute.

But, when keeping it:

  • The other methods of the PeriodDtype class don't seem used and/or tested?
  • It needs another name IMO, since we already have another PeriodDtype class that is something else

@jreback jreback added Frequency DateOffsets Period Period data type labels Jun 1, 2020
# Note: this is more performant than PeriodDtype.from_date_offset(freq)
# because from_date_offset cannot be made a cdef method (until cython
# supported cdef classmethods)
self._dtype = PeriodDtype(freq._period_dtype_code)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't mind too much. I'll let it to you to judge whether it is already worth adding the full class instead of just the dtype_code attribute.

But, when keeping it:

  • The other methods of the PeriodDtype class don't seem used and/or tested?
  • It needs another name IMO, since we already have another PeriodDtype class that is something else

@jbrockmendel
Copy link
Member Author

The other methods of the PeriodDtype class don't seem used and/or tested?

yah not yet. i can remove them until they are actually used i guess

@jbrockmendel
Copy link
Member Author

It needs another name IMO, since we already have another PeriodDtype class that is something else

Any suggestions? (I plan to mix it into the core.dtypes PeriodDtype before long)

@jbrockmendel
Copy link
Member Author

updated with a placeholder name PeriodPseudoDtype

@jorisvandenbossche
Copy link
Member

Alternative could also be PeriodCode
(but if you are still going to change it, might also not matter too much)



cdef class PeriodPseudoDtype:
# cdef readonly:
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring explaining for what / where this is used?

@jbrockmendel
Copy link
Member Author

updated with docstring. im increasingly confident this will (eventually, soon-ish) let us de-duplicate Resolution, FreqGroup, and some cdef constants we define in libperiod. Maybe even NPY_DATETIMEUNIT depending on what numpy does

@jreback jreback added this to the 1.1 milestone Jun 3, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. pls rebase and ping on green.

@jbrockmendel
Copy link
Member Author

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants