Skip to content

API: add is_beg_month/quarter/year, is_end_month/quarter/year accessors (#4565) #4823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2014

Conversation

mikebailey
Copy link
Contributor

closes #4565

@@ -1982,37 +1982,56 @@ def setUp(self):

def test_datetimeindex_accessors(self):
dti = DatetimeIndex(
freq='Q-JAN', start=datetime(1997, 12, 31), periods=100)
freq='D', start=datetime(1998, 1, 1), periods=365)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add basically a copy of this test with a different freq as well (maybe something like MonthEnd)?

@rockg
Copy link
Contributor

rockg commented Sep 12, 2013

We should add these accessors to the DatetimeIndex documentation I started with #4706

@jreback
Copy link
Contributor

jreback commented Sep 12, 2013

sure...you can add some docs on this (and for all accessors like this for that matter)

@rockg
Copy link
Contributor

rockg commented Sep 12, 2013

All the other accessors are already there. I can add these.

@cancan101
Copy link
Contributor

I don't think the quarter accessors are well defined. There can be various quarterly cycles other than the one in this commit.

if dtindex[i] == NPY_NAT: out[i] = -1; continue

pandas_datetime_to_datetimestruct(dtindex[i], PANDAS_FR_ns, &dts)
out[i] = (monthrange(dts.year, dts.month)[1] == dts.day and dts.month in [3, 6, 9, 12])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cancan101 is right, these months need to be a function of the frequency of the index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also make the same argument for end of year which can have various meanings for other "years".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's right, need to look at the freq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cancan101 @jreback Just to make sure I understand, these accessors need to be frequency aware for the freq of the Timestamp/DatetimeIndex. e.g.

x = pd.Timestamp('2013-06-30', offset='Q-JUN')
x.is_end_year
1

x = pd.Timestamp('2013-06-28', offset='BQ-JUN')
x.is_end_year #29th, 30th are not business days
1 

Annual and quarterly frequencies define the year end month. Weekday freqs force beginning/end of periods to be on weekdays. Should I add is_beg/end_week accessors for weekly freqs? Do I need to worry about custom business days? Anything I am not covering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your example looks right for biz days (if custom is easy do it, if not, just raise)

@mikebailey
Copy link
Contributor Author

Is this a bug?

from pandas.tseries.frequencies import get_offset
get_offset('Q').freqstr
'Q-DEC'
get_offset('QS').freqstr
'QS-JAN'

@cancan101
Copy link
Contributor

I don't think so. One is quarterly periods defined by year ending in
December and the other is quarterly periods defined by year starting in
January.
On Sep 18, 2013 3:38 AM, "Michael Bailey" [email protected] wrote:

Is this a bug?

from pandas.tseries.frequencies import get_offset
get_offset('Q').freqstr
'Q-DEC'
get_offset('QS').freqstr
'QS-JAN'


Reply to this email directly or view it on GitHubhttps://github.com//pull/4823#issuecomment-24646338
.

@mikebailey
Copy link
Contributor Author

In the docs it says:

(B)Q(S)-JAN quarterly frequency, year ends in January

So that is incorrect then, Q-JAN is different form QS-JAN in terms of ending year.

I was also thrown by the fact that QuarterBegin() and QuarterEnd() share the same keyword 'startingMonth' and assumed they both referenced the last month of the year. It also seems odd that we instantiate QuarterBegin() with startingMonth = 3?

x=pd.Timestamp('2012-01-01', offset='QS')
y=pd.Timestamp('2012-03-01', offset='QS')
print pd.tseries.offsets.QuarterBegin().onOffset(x)
False
print pd.tseries.offsets.QuarterBegin().onOffset(y)
True

Is there a reason we don't have 'staringMonth' and 'endingMonth' offsets to distinguish them? Or even better, use the 'endingMonth' offset for all offsets with end months.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@mikebailey I think you pointed out some related bugs here...see #4069, #4109

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@mikebailey see also #2885

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

ping!

@mikebailey
Copy link
Contributor Author

@jreback, I've looked into the bugs you've posted, I can wrap these all up in this PR without too much trouble I believe - I'll be off the grid for a couple of weeks and probably won't be able to finish before I leave, but will take this on when I get back.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

gr8

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

pushing to 0.14

@mikebailey
Copy link
Contributor Author

@jreback @cancan101 working on these offset bugs now. What are your thoughts on changing the default behavior of these offsets? Ideally all quarter and annual offsets would default to beginning jan 1st - but that would break backwards compatibility, should I forge ahead with the change?

I also want to unify on a single month keyword rather than having some with startingMonth and others using month which is confusing. My preference would be deprecate the startingMonth keyword and have month always mean the ending month for the period. We could also implement startingMonth and endingMonth keywords for all relevant period offsets to avoid confusion and pick a default way to resolve month. thoughts?

@jreback
Copy link
Contributor

jreback commented Oct 24, 2013

@mikebailey first pls rebase rebase on master, a big PR by @cancan101 was recently merged which add some new offset stuff (and @jtratner did a lot of cleaning on how the offsets are created).

I believe @cancan101 has some thoughts on the naming scheme internal to offsets?

we don't want to break back compat unless have a really good reason.

@jtratner
Copy link
Contributor

@cancan101 was suggesting the same thing about changing the month kwarg
name. My preference is to push until the next release so we can improve the
documentation of DateOffset at the same time that we change its API.

It's trivial to make this change because all the constructors are single
positional + keyword argument. This should be clearer after you rebase.

Either you or @cancan101 should make the change, but it needs to be in a
separate PR from this one.

@cancan101
Copy link
Contributor

@mikebailey I agree that the startingMonth keyword is both misleading and actually wrong (see #5307). I am all for consistency. In this case I would value consistency with industry convention over consistency within Pandas. What I mean by that is not changing all periods to be specified based on when they end by either by what would make sense or what is an industry norm. See here #4591.

I would vote to having periods like YearEnd and QuarterEnd (and the new FY5253) having an endingMonth param and periods like YearStart keeping the startingMonth. I would not change everything to month since that is ambiguous and not self-documenting.

I definitely would not change existing the default behavior.

@jtratner
Copy link
Contributor

Would it be possible to just define a function that dispatches based upon
arguments, instead of redefining the constructors? Particularly in terms of
backwards compatibility, I'd rather add a new function that translates user
arguments into appropriate offsets than change the constructors for ten
different objects. Eventually we could deprecate direct access to the
existing offset classes. It's just much more awkward to maintain so many
constructors.

@cancan101
Copy link
Contributor

@jtratner I think thanks to your refactoring there are far fewer constructors than in the past.

@jtratner
Copy link
Contributor

@cancan101 there are fewer __init__ methods, but still a bunch of different entry points.

Btw - really trivial to change startingMonth if you think it's necessary, just make sure you accept but warn if startingMonth is passed to the constructor, and put in one really quick test for each class to check that it's actually warning.

@cancan101
Copy link
Contributor

@mikebailey Any interest in tackling that?

@mikebailey
Copy link
Contributor Author

Interest is high, but feel free to do this if you want it done sooner rather than later. I'll first fix up this PR with the recent changes.

So it seems like the consensus is to keep the startingMonth and endingMonth keywords and default behavior, but fix them so that endingMonth means the ending month and starting month means the starting month. Ideally this logic would be pushed into a function instead of separately into each constructor.

@mikebailey
Copy link
Contributor Author

I still need to make this work for FY5253 changes - if I get ambitious enough I'll tackle custom business dates as well.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@mikebailey can you review this and address some of the above comments....thanks

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

@mikebailey you will need to refactor this a bit after I merge #6382 (and pay special attention that these could also be defined for PeriodIndex too)

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@mikebailey still working on this?

@mikebailey
Copy link
Contributor Author

@jreback, yes I am. I will get a diff in this week. Thanks for your
patience.

On Sun, Mar 9, 2014 at 7:58 AM, jreback [email protected] wrote:

@mikebailey https://github.com/mikebailey still working on this?

Reply to this email directly or view it on GitHubhttps://github.com//pull/4823#issuecomment-37128739
.

mikebailey

@jreback
Copy link
Contributor

jreback commented Mar 25, 2014

@mikebailey

all of these conversions need to be done in cython as well. Your current method is very inefficiet. Here's a hint, a month/quarter/year start must have a day=1. The startmonth must match for month,year as well. I think their are several cases but using year,month,day and the freq month number (and using the days_per_month_table / monthrange) functions you can do all of these computations.

@mikebailey
Copy link
Contributor Author

@jreback I have currently written it to support business frequencies so month/quarter/year may not necessarily be 1 for is_month_start, etc. This can definitely be done in Cython, but will take a little more work on my part as I won't be able to just call onOffset() to check everything.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2014

these cases need to be discriminated (e.g. have an if-then depending on the frequency). The key is NOT to use OnOffset directly, rather do the computation using the year,month,day and the offset data.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2014

You can write this in python for easy debugging, then move it to cython. This should be very much like get_date_field

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

any progress?

@mikebailey
Copy link
Contributor Author

@jreback uploaded recent progress - need to move this over to cython. Working on this now.

@mikebailey
Copy link
Contributor Author

@jreback can you check that my cython implementation that it looks reasonable? I don't have much experience with cython personally and copied heavily from get_date_field accessor.

I'm working on extending functionality to periods.

@jreback
Copy link
Contributor

jreback commented Apr 16, 2014

run with cython -a to produce an annotated output (it's a html file)

will show u if u r doing calls that are going back to python space (which if u can avoid is good)

best to setup a set of benchmarks
(in vb_suite) as well

u can then do a timeit/prun to see where it's spending time and see if it needs optimizing

also type all variable that u r using
even indicator variables (should be bint)

@mikebailey
Copy link
Contributor Author

@jreback thanks for the cython tips - after typing and refactoring I think it is pretty efficient now, at least compared to where it was before. It looks about the same cost as get_date_field besides the string comparisons that have to be done to determine whether there are business or anchored offsets.

I also added tests for the custom business days and a test to vb_suite as you suggested - though I haven't been able run vb_suite since there is a broken test in there right now I think.

One bug that I haven't been able to fix is that my accessor does not work for DateTimeIndex's that don't have a frequency. It will work for Timestamps without frequency and the raw call to tslib.get_start_end_field(idx, 'is_month_start', None, 12) works as well. But if try to do:

idx = pd.DatetimeIndex([pd.datetime(2012,1,1)])
getattr(idx, 'is_month_start')

TypeError Traceback (most recent call last)
in ()
----> 1 index._wrap_access_object(getattr(obj, 'is_month_start'))

I don't quite understand how IndexOpsMixIn works, but this error appears recursive to me. When I try gettatr(obj, 'is_month_start') it fails when it calls getattr(obj, 'is_month_start').... I can't yet determine what the frequency has to do with this error. Thoughts?

@jreback
Copy link
Contributor

jreback commented Apr 18, 2014

your vbench fails because 1/1/1 is not a valid Timestamp
see the FAQ for the ranges (min date is about year 1750 or so)

if the index is >= Len of 3 you can use infered_freq which will compute the freq if necessary

as far as the error with no freq goes
accessing is_month_start should call the function _is_month_start
so step they than line by line and see

@mikebailey
Copy link
Contributor Author

Also, the only other comment I haven't addressed is that it looks like one cannot use numpy boolean arrays in cython. If you have an implementation in mind to point me to I'd be happy to modify this.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2014

use

arr=np.zeros(shape,dtype=np.uint8)

set are values to 1 as needed

return arr.view(np.bool_)

look in lib.pyx for examples (iirc is_member)

@mikebailey
Copy link
Contributor Author

@jreback - actually 1/1/1 is 2001-01-01. There was another test in vbench failing but after a rebase looks like it is working. I haven't gone too deep into the vbench code, is there an easy way to only generate performance of two revisions? when I run run_suite it looks like it is traversing a long git history. Is all the relevant data stored in benchmarks.db?

I tracked down my is_month_start error - I had to debug _is_month_start instead and the error became clear. This build is now passing. Let me know if there are any more issues.

It will probably take awhile for me to get up to speed on the periodindex stuff - will definitely add it but not sure how long it will take. I can do that in a separate PR if you want to wrap this one up. Thanks,

@mikebailey
Copy link
Contributor Author

Also, I haven't added anything to documentation yet. Let me know if I need to do anything beyond documenting the revision history.

@jorisvandenbossche
Copy link
Member

Regarding the docs, apart from adding it to the release notes, I think following would also be welcome:

  • add it to the API documentation (api.rst) (maybe add a section in the datetimeindex section: http://pandas.pydata.org/pandas-docs/dev/api.html#datetimeindex)
  • add some explanation to the timeseries docs (timeseries.rst). Although I don't directly see where it can be added, it looks like the other accessors like .date, .year, etc are also not yet documented.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@mikebailey can you open an issue for docs needed (eg. which accessors in Index/DatetimeIndex) don't appear to have APi mentions / docs

can you also confirm that the quarter issues mentioned near the start of this thread are either: solved / need addressing?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@mikebailey ping when you have updated

@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

@mikebailey can you rebase and respond to my aboe comments

@mikebailey
Copy link
Contributor Author

@jreback all the accessors have mentions in the API, but none of them have a mention in the timeseries section which seems like an oversight. I've opened this issue: #6998

I've rebased and added a note to the release notes and added entries for these accessors to api.rst.

RE: the quarter issues. I did not address them here. I like @cancan101 's suggestion that we instantiate all anchored offsets with 'startingMonth' and everything else with 'endingMonth' and translate the 'month' keyword to one of these two to not break compatibility. If we make this change, I can simplify these accessors a little bit.

@jorisvandenbossche
Copy link
Member

Adding a section about this new functionality in timeseries.rst would actually be appropriate in this PR (but a subsequent is also good). Are you able to do that?

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

@mikebailey we'll address the startingMonth issue separately, see #5307

@@ -465,6 +465,10 @@ API Changes
- ``DataFrame.apply`` will use the ``reduce`` argument to determine whether a
``Series`` or a ``DataFrame`` should be returned when the ``DataFrame`` is
empty (:issue:`6007`).
- Add ``is_month_start``, ``is_month_end``, ``is_quarter_start``, ``is_quarter_end``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add into v0.14.0.txt (put right below the section where I show #4551), where the index methods on series are defined, otherwise good to go

@mikebailey
Copy link
Contributor Author

Added to v0.14.0.txt.

@jorisvandenbossche, also added the date/time properties to timeseries.rst. Did not elaborate or provide any examples, just listed the properties so someone else can add on if it is not adequate.

Time/Date Components
~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are several time/date properties that one can access from ``Timestamp`` or a collection of timestamps like a ``DateTimeIndex``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr8!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at least add that they are also available as Series/DataFrame attributes when it has a DatetimeIndex? And you copy paste relevant section of whatsnew to here I think: https://github.com/pydata/pandas/blob/master/doc/source/v0.14.0.txt#L72

(but can also for later PR)

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

gr8 ping me when green

@mikebailey
Copy link
Contributor Author

@jreback, we are green!

@jreback jreback merged commit bcf7f0d into pandas-dev:master Apr 29, 2014
@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

@mikebailey merged! thanks!

allows very nice things like:

may want to add somewhere

In [10]: s = Series(np.random.randn(50),index=date_range('20121215',freq='D',periods=50))

In [11]: s[s.is_quarter_start]
Out[11]: 
2013-01-01    0.226106
Freq: D, dtype: float64

In [12]: s[s.is_month_start]
Out[12]: 
2013-01-01    0.226106
2013-02-01   -0.459225
dtype: float64

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add is_beg_month/year, is_end_month/year accessors for a DateTimeIndex/Timestamp
6 participants