-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: add is_beg_month/quarter/year, is_end_month/quarter/year accessors (#4565) #4823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1982,37 +1982,56 @@ def setUp(self): | |||
|
|||
def test_datetimeindex_accessors(self): | |||
dti = DatetimeIndex( | |||
freq='Q-JAN', start=datetime(1997, 12, 31), periods=100) | |||
freq='D', start=datetime(1998, 1, 1), periods=365) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add basically a copy of this test with a different freq as well (maybe something like MonthEnd)?
We should add these accessors to the DatetimeIndex documentation I started with #4706 |
sure...you can add some docs on this (and for all accessors like this for that matter) |
All the other accessors are already there. I can add these. |
I don't think the quarter accessors are well defined. There can be various quarterly cycles other than the one in this commit. |
if dtindex[i] == NPY_NAT: out[i] = -1; continue | ||
|
||
pandas_datetime_to_datetimestruct(dtindex[i], PANDAS_FR_ns, &dts) | ||
out[i] = (monthrange(dts.year, dts.month)[1] == dts.day and dts.month in [3, 6, 9, 12]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cancan101 is right, these months need to be a function of the frequency of the index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also make the same argument for end of year which can have various meanings for other "years".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's right, need to look at the freq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cancan101 @jreback Just to make sure I understand, these accessors need to be frequency aware for the freq of the Timestamp/DatetimeIndex. e.g.
x = pd.Timestamp('2013-06-30', offset='Q-JUN')
x.is_end_year
1
x = pd.Timestamp('2013-06-28', offset='BQ-JUN')
x.is_end_year #29th, 30th are not business days
1
Annual and quarterly frequencies define the year end month. Weekday freqs force beginning/end of periods to be on weekdays. Should I add is_beg/end_week accessors for weekly freqs? Do I need to worry about custom business days? Anything I am not covering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your example looks right for biz days (if custom is easy do it, if not, just raise)
Is this a bug?
|
I don't think so. One is quarterly periods defined by year ending in
|
In the docs it says: (B)Q(S)-JAN quarterly frequency, year ends in January So that is incorrect then, Q-JAN is different form QS-JAN in terms of ending year. I was also thrown by the fact that QuarterBegin() and QuarterEnd() share the same keyword 'startingMonth' and assumed they both referenced the last month of the year. It also seems odd that we instantiate QuarterBegin() with startingMonth = 3? x=pd.Timestamp('2012-01-01', offset='QS') Is there a reason we don't have 'staringMonth' and 'endingMonth' offsets to distinguish them? Or even better, use the 'endingMonth' offset for all offsets with end months. |
@mikebailey I think you pointed out some related bugs here...see #4069, #4109 |
@mikebailey see also #2885 |
ping! |
@jreback, I've looked into the bugs you've posted, I can wrap these all up in this PR without too much trouble I believe - I'll be off the grid for a couple of weeks and probably won't be able to finish before I leave, but will take this on when I get back. |
gr8 |
pushing to 0.14 |
@jreback @cancan101 working on these offset bugs now. What are your thoughts on changing the default behavior of these offsets? Ideally all quarter and annual offsets would default to beginning jan 1st - but that would break backwards compatibility, should I forge ahead with the change? I also want to unify on a single |
@mikebailey first pls rebase rebase on master, a big PR by @cancan101 was recently merged which add some new offset stuff (and @jtratner did a lot of cleaning on how the offsets are created). I believe @cancan101 has some thoughts on the naming scheme internal to offsets? we don't want to break back compat unless have a really good reason. |
@cancan101 was suggesting the same thing about changing the month kwarg It's trivial to make this change because all the constructors are single Either you or @cancan101 should make the change, but it needs to be in a |
@mikebailey I agree that the I would vote to having periods like I definitely would not change existing the default behavior. |
Would it be possible to just define a function that dispatches based upon |
@jtratner I think thanks to your refactoring there are far fewer constructors than in the past. |
@cancan101 there are fewer Btw - really trivial to change startingMonth if you think it's necessary, just make sure you accept but warn if startingMonth is passed to the constructor, and put in one really quick test for each class to check that it's actually warning. |
@mikebailey Any interest in tackling that? |
Interest is high, but feel free to do this if you want it done sooner rather than later. I'll first fix up this PR with the recent changes. So it seems like the consensus is to keep the startingMonth and endingMonth keywords and default behavior, but fix them so that endingMonth means the ending month and starting month means the starting month. Ideally this logic would be pushed into a function instead of separately into each constructor. |
I still need to make this work for FY5253 changes - if I get ambitious enough I'll tackle custom business dates as well. |
@mikebailey can you review this and address some of the above comments....thanks |
@mikebailey you will need to refactor this a bit after I merge #6382 (and pay special attention that these could also be defined for PeriodIndex too) |
@mikebailey still working on this? |
@jreback, yes I am. I will get a diff in this week. Thanks for your On Sun, Mar 9, 2014 at 7:58 AM, jreback [email protected] wrote:
mikebailey |
all of these conversions need to be done in cython as well. Your current method is very inefficiet. Here's a hint, a month/quarter/year start must have a day=1. The startmonth must match for month,year as well. I think their are several cases but using year,month,day and the freq month number (and using the |
@jreback I have currently written it to support business frequencies so month/quarter/year may not necessarily be 1 for is_month_start, etc. This can definitely be done in Cython, but will take a little more work on my part as I won't be able to just call onOffset() to check everything. |
these cases need to be discriminated (e.g. have an if-then depending on the frequency). The key is NOT to use OnOffset directly, rather do the computation using the year,month,day and the offset data. |
You can write this in python for easy debugging, then move it to cython. This should be very much like |
any progress? |
@jreback uploaded recent progress - need to move this over to cython. Working on this now. |
@jreback can you check that my cython implementation that it looks reasonable? I don't have much experience with cython personally and copied heavily from get_date_field accessor. I'm working on extending functionality to periods. |
run with cython -a to produce an annotated output (it's a html file) will show u if u r doing calls that are going back to python space (which if u can avoid is good) best to setup a set of benchmarks u can then do a timeit/prun to see where it's spending time and see if it needs optimizing also type all variable that u r using |
@jreback thanks for the cython tips - after typing and refactoring I think it is pretty efficient now, at least compared to where it was before. It looks about the same cost as get_date_field besides the string comparisons that have to be done to determine whether there are business or anchored offsets. I also added tests for the custom business days and a test to vb_suite as you suggested - though I haven't been able run vb_suite since there is a broken test in there right now I think. One bug that I haven't been able to fix is that my accessor does not work for DateTimeIndex's that don't have a frequency. It will work for Timestamps without frequency and the raw call to tslib.get_start_end_field(idx, 'is_month_start', None, 12) works as well. But if try to do: idx = pd.DatetimeIndex([pd.datetime(2012,1,1)]) TypeError Traceback (most recent call last) I don't quite understand how IndexOpsMixIn works, but this error appears recursive to me. When I try gettatr(obj, 'is_month_start') it fails when it calls getattr(obj, 'is_month_start').... I can't yet determine what the frequency has to do with this error. Thoughts? |
your vbench fails because 1/1/1 is not a valid Timestamp if the index is >= Len of 3 you can use infered_freq which will compute the freq if necessary as far as the error with no freq goes |
Also, the only other comment I haven't addressed is that it looks like one cannot use numpy boolean arrays in cython. If you have an implementation in mind to point me to I'd be happy to modify this. |
use arr=np.zeros(shape,dtype=np.uint8) set are values to 1 as needed return arr.view(np.bool_) look in lib.pyx for examples (iirc is_member) |
@jreback - actually 1/1/1 is 2001-01-01. There was another test in vbench failing but after a rebase looks like it is working. I haven't gone too deep into the vbench code, is there an easy way to only generate performance of two revisions? when I run run_suite it looks like it is traversing a long git history. Is all the relevant data stored in benchmarks.db? I tracked down my is_month_start error - I had to debug _is_month_start instead and the error became clear. This build is now passing. Let me know if there are any more issues. It will probably take awhile for me to get up to speed on the periodindex stuff - will definitely add it but not sure how long it will take. I can do that in a separate PR if you want to wrap this one up. Thanks, |
Also, I haven't added anything to documentation yet. Let me know if I need to do anything beyond documenting the revision history. |
Regarding the docs, apart from adding it to the release notes, I think following would also be welcome:
|
@mikebailey can you open an issue for docs needed (eg. which accessors in Index/DatetimeIndex) don't appear to have APi mentions / docs can you also confirm that the quarter issues mentioned near the start of this thread are either: solved / need addressing? |
@mikebailey ping when you have updated |
@mikebailey can you rebase and respond to my aboe comments |
@jreback all the accessors have mentions in the API, but none of them have a mention in the timeseries section which seems like an oversight. I've opened this issue: #6998 I've rebased and added a note to the release notes and added entries for these accessors to api.rst. RE: the quarter issues. I did not address them here. I like @cancan101 's suggestion that we instantiate all anchored offsets with 'startingMonth' and everything else with 'endingMonth' and translate the 'month' keyword to one of these two to not break compatibility. If we make this change, I can simplify these accessors a little bit. |
Adding a section about this new functionality in timeseries.rst would actually be appropriate in this PR (but a subsequent is also good). Are you able to do that? |
@mikebailey we'll address the |
@@ -465,6 +465,10 @@ API Changes | |||
- ``DataFrame.apply`` will use the ``reduce`` argument to determine whether a | |||
``Series`` or a ``DataFrame`` should be returned when the ``DataFrame`` is | |||
empty (:issue:`6007`). | |||
- Add ``is_month_start``, ``is_month_end``, ``is_quarter_start``, ``is_quarter_end``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add into v0.14.0.txt (put right below the section where I show #4551), where the index methods on series are defined, otherwise good to go
Added to v0.14.0.txt. @jorisvandenbossche, also added the date/time properties to timeseries.rst. Did not elaborate or provide any examples, just listed the properties so someone else can add on if it is not adequate. |
Time/Date Components | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
There are several time/date properties that one can access from ``Timestamp`` or a collection of timestamps like a ``DateTimeIndex``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gr8!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at least add that they are also available as Series/DataFrame attributes when it has a DatetimeIndex? And you copy paste relevant section of whatsnew to here I think: https://github.com/pydata/pandas/blob/master/doc/source/v0.14.0.txt#L72
(but can also for later PR)
gr8 ping me when green |
@jreback, we are green! |
@mikebailey merged! thanks! allows very nice things like: may want to add somewhere
|
closes #4565