API: revisit adding datetime-like ops in Series #7207

jreback · 2014-05-22T15:15:41Z

related is #7146, #7206

maybe provide a to_index() function and/or a class-like accessor

.values_to_index (too long though)

The text was updated successfully, but these errors were encountered:

jreback · 2014-05-22T15:23:18Z

@cpcloud @jorisvandenbossche

cc @rosnfeld
cc @mtkni

continue the discussion here (but we'll consider for 0.14.1)

cpcloud · 2014-05-22T17:30:35Z

i think to_index good. never really use these that much. agree that the Index type dependence could be a bit a special case and possibly confusing to new users.

jorisvandenbossche · 2014-05-22T19:39:27Z

I agree with @mtkni that adding a to_index() method does not really solve the issue (but apart from this issue, it can be a good enhancement to add such a method).

The problem in mindset with the datetime accessors, when you have a datetime column in a dataframe, is that you have to first make an index of it before you can access datetime fields. If you don't really know the background, this is not logical. "Why should you make an index of it? If I only need the day values?" Whether this is with DatetimeIndex(s) or with s.to_index() does not really matter that much in this sense (although I agree it makes it a bit easier/shorter).

I also agree that you 'expect' at first sight that a series method acts on the values, not on the index (for most methods, you have of course index-specific methods). So I think it is good for now to disable this (as you did in #7206) until we decide on the API.

jorisvandenbossche · 2014-05-23T12:55:27Z

What would you actually think of having the attributes act on the Series values? (so having s.year and s.index.year) or would that be to dangerous to confuse them?

jreback · 2014-05-23T12:58:47Z

that's what I had originally; but it IS ambiguous if you have a datetime index AND datetime values on a Series (not that common but common enough I guess).

jorisvandenbossche · 2014-05-23T13:11:28Z

In that case it is indeed a bit ambiguous, but I think less ambiguous as the other way around (the attributes acting on the index). And it would also be useful functionality to have such attributes on the datetime64 values of a series/column.

jreback · 2014-05-23T13:19:34Z

ahh..you want them to act on the VALUES? hmm, don't liek the s.to_index().day approach? (which IMHO is pretty explicity)....maybe could make a property like s.vindex.day or something

rosnfeld · 2014-05-23T14:36:35Z

Just a comment on my motivation here (I can't speak for @mtkni, though curious to hear other use cases):

I have worked with various datasets with multiple datetime columns, where none of them should really be the "index", so I find this to be somewhat common. E.g. humanitarian funding data might have various "state transition" dates for each entry (date funding was pledged, date it was transferred, date it was deployed towards a project, etc). Or a big dataset of projects might have start and end dates. So it's not that I'm creating standalone Series with date values, but the columns of a DataFrame often fit the bill.

Maybe I want to extract the year or month from each of these columns, or maybe somebody would want to tz_localize/tz_convert. Writing projects.end_date.year seems natural and yet it doesn't work, and the error is confusing - why would that operation have anything to do with the index? It looks like it's an operation on the column.

Requiring conversion to an index is surprising, as I (and likely others who don't understand the code as well) don't understand what is "special" about an index (or special about datetimes).

I'd be curious to know how others view this.

jorisvandenbossche · 2014-05-23T14:46:02Z

@jreback yes indeed the values :-)

And that having to convert it to an index is strange, is what I wanted to say in my comment above (#7207 (comment)), as also @rosnfeld argues.

For me it is also rather common to have datetime values in columns, eg start and end times of measurements.

I agree that having a datelike index and a non-datelike values is probably by far the more common use case, for that case it is not that much more work (and more explicit) to type s.index.day

To make it concrete, this would give something like this:

In [1]: s = pd.Series(pd.date_range("2013-01-01 09:00:00", periods=3))

In [2]: s
Out[2]:
0   2013-01-01 09:00:00
1   2013-01-02 09:00:00
2   2013-01-03 09:00:00
dtype: datetime64[ns]

In [3]: s.day
0    1
1    2
2    3
dtype: int64

jreback · 2014-05-23T14:46:48Z

@rosnfeld ok, so you like the idea of using datetime-like ops (e.g. year,month,seconds...etc)

on the VALUES of a series (e.g. s.year)

rather that having them act on the index (which s.index.year will accomplish in any event)

so going back to common methods/properties for index/series (so they act similary, a good thing).

Note that their are currently several methods that DO NOT DO this, e.g. at_time/between_time/tz_localize/tz_convert which act on the index! (and the removed Series.weekday).

note also that tshift/asfreq acts on the index (but is more of a reshaping operation so that seems ok)

TomAugspurger · 2014-05-23T15:16:59Z

Just to confirm, with a Series with datetime values and a datetime Index, the ops will still operate on the values right?

If so then +1 from me on operating on the values.

shoyer · 2014-05-24T18:22:41Z

Instead of adding to_index() just so you can write s.to_index().day, what about creating some sort of "DatetimeArray" which separates out the part of DatetimeIndex which fixes np.datetime64 quirks from the Index part? This seems like the cleaner, more modular solution. DatetimeArray would implement properties like .day without the need to create an Index.

Not having looked at DatetimeIndex internals very carefully, I'm not sure how painful this would be. Of course, better would be to try to fix datetime64 upstream in numpy...

jreback · 2014-05-24T18:29:41Z

@shoyer not sure what u mean by datetime64 quirks?
DatetimeIndex already fixes everything (numpy quirks) and provides quite a lot of functionality
the issue is that Series and Index are really quite similar (and share a hierarchy to some extent in. 0.14.0)

only issue is how to unambiguously do things with the Series values. (eg Series.day) or the index
Series.index.day or have the user be more explicit by doing something like
Series.to_index().day or Series.values_as_index.day
(as Series.values returns the raw numpy data and cannot be used to this in a nice clean way) and relying on numpy for that functionality is not possible ATM

shoyer · 2014-05-24T18:51:46Z

I know DatetimeIndex fixes the numpy quirks (comparison, casting, NaT, etc.) and provides some nice extra functionality (like the .day properties). My point is that most of these features are actually entirely distinct from the index part, so you can imagine exposing those in something like a "DatetimeArray" which makes the index part optional. DatetimeIndex would then be a relatively shallow wrapper around DatetimeArray.

I guess this would be API breaking, but if s.values could be an nd-array-like DatetimeArray (possibly even an ndarray subclass), then it would simplify this design issue because it would be obvious that Series would just pass on the "day" property from "s.values", like how it already does for ndarray properties.

Maybe this is not terribly useful for pandas because creating an index larger than a certain size only bothers to create the index lookup table on demand anyways.

jreback · 2014-05-24T19:09:38Z

it's possible that I could simply have Series.values return a DatetimeIndex if the dtype is correct (in fact other dtypes eg Categorical and Sparse do this) This might work and is not really a breaking API change - not need to create another class for that reason

that said it IS natural to simply use Series.day which applies to the values of the Series
just like virtually any other function (with a couple of exceptions as noted above)

the only issue I have is that doing Series.day on say a float dtype series will have to raise a TypeError
but I think that's ok

shoyer · 2014-05-24T19:29:15Z

I guess the reason to (possibly) make another class would be so this same sort of thing works even with multi-dimensional datetime arrays, e.g., on a DataFrame with datetime values (df.day). That said, I'm 👍 on Series.values returning a DatetimeIndex as a step in the right direction. Bare np.datetime64 arrays are not terribly useful.

jorisvandenbossche · 2014-05-24T19:33:40Z

I am not too fond of returning a DatetimeIndex for s.values, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes, you can always use apply and use the (possible) series attribute.

jonblunt · 2014-05-26T01:23:06Z

FWIW how about s.values.asdatetime().day .

For the uniinitiated the "index" information is not interesting, and
confusing. We have column that is datetime, the values are a not very
useful datetime64 but we want to use thme as a pandas datetime.

I assume that if the underlying array is not datetime64 this throws an
error.

On Sat, May 24, 2014 at 2:33 PM, Joris Van den Bossche <
[email protected]> wrote:

I am not too fond of returning a DatetimeIndex for s.values, as this is
well known to return a numpy array with the values. That would also be
strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes,
you can always use apply and use the (possible) series attribute.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/7207#issuecomment-44097276
.

shoyer · 2014-05-27T01:32:18Z

@jonblunt Unfortunately s.values.asdatetime() can't work unless s.values changes from being a raw numpy.ndarray -- it would need to be some sort of custom object for it to be possible to add a new method like asdatetime().

I agree that seeing a DatetimeIndex could be somewhat confusing, which is why I suggested a more generic DatetimeArray. However, pandas.Index objects do act almost exactly like ndarrays with a few extra methods, so from a functionality perspective it would be almost equivalent -- just with a slightly confusing repr when you print the values.

hayd · 2014-05-28T03:23:32Z

Would it make sense to make a Series with datetime values a subclass of Series (with DatetimeIndex date-y methods?)... Or perhaps a property as datetime? That way we tab complete the methods.

@property
def as_datetime(self):
    # assert it's a datetime dtype
    return pd.DatetimeIndex(self)

pd.Series.as_datetime = as_datetime

s.as_datetime.<tab>  # shows datetimeindex methods if datetime

jreback · 2014-06-10T15:42:18Z

pushing to 0.15; would be an API change

jankatins · 2014-06-13T09:38:06Z

Not sure if this is related, but #7217 just introduced Series.cat.<categorical functions> (if Series is of type Categorical).

jreback · 2014-08-06T16:08:47Z

http://stackoverflow.com/questions/25129144/pandas-return-hour-from-datetime-column-directly?noredirect=1#comment39179371_25129144

I think easiest / best way is simply add:

Series.to_index(), nice clean and consistent.

any takers for a PR?

jankatins · 2014-08-07T14:41:21Z

I like s.dt.{minute|year|...}, it's short and s.date would be not intuitive if you want a time.

There are now at least s.str and s.cat so I don't see why there shouldn't be a s.dt...

jreback · 2014-08-07T14:46:58Z

its also easy to make .dt (or other namespaces) raise if its actually accessed by an incorrect dtype (and not show up in tab completion)

jreback · 2014-08-07T16:59:43Z

ok, pls take a look at #7953 which implements the .dt namespace

jorisvandenbossche · 2014-08-07T17:07:05Z

s.cat is not really a namespace, it is just an attribute to access the Categorical (which has of course special attributes/methods).
And I don't think that this is exactly what we want in this case, providing a DatetimeIndex via an attribute. Because in this way, the argument of "grouping of methods/attributes" from above does not apply (you have just all index methods, not the specific datetime attributes), and then I don't see an advantage of this nesting above just direct attributes?

@jreback this is at once also my comment on the PR

rosnfeld · 2014-08-07T17:14:47Z

I had the same reaction - it would be nice if the tab-complete just showed the timeseries-specific attributes.

jreback · 2014-08-07T17:24:03Z

not hard to just show only datetimelike attributes....

jankatins · 2014-08-07T17:35:49Z

s.cat could become a namespace, if numpy starts providing a categorical datatype itself, that was at least my understanding when that was discussed. Maybe that should be more clear in the documentation, that only a few methods on that "namespace" are guaranteed to work in the future.

jankatins · 2014-08-07T17:45:00Z

Reading the tab completion in #7953, it would be nice if there would be a way to hide the "none-Api" methods. Kind of like a wrapper object which would created by passing in a list of attributes which are accessable (see this SO question) + adding these methods to tab completion (no idea how that works but someone mentioned it above)

jreback · 2014-08-07T18:15:04Z

This does look nicer (latest version of #7953)

In [1]: s = Series(date_range('20130101',periods=3))

In [2]: s.dt.
s.dt.date              s.dt.dayofweek         s.dt.hour              s.dt.is_month_start    s.dt.is_quarter_start  s.dt.is_year_start     s.dt.minute            s.dt.nanosecond        s.dt.second            s.dt.week              s.dt.year              
s.dt.day               s.dt.dayofyear         s.dt.is_month_end      s.dt.is_quarter_end    s.dt.is_year_end       s.dt.microsecond       s.dt.month             s.dt.quarter           s.dt.time              s.dt.weekofyear        

In [2]: s.dt.hour
Out[2]: array([0, 0, 0])

In [3]: s.dt.year
Out[3]: array([2013, 2013, 2013])

In [4]: s.dt.day
Out[4]: array([1, 2, 3])

And its specific to the type of wrapped delegate

In [5]: p = Series(period_range('20130101',periods=3,freq='D').asobject)

In [6]: p.dt.
p.dt.day         p.dt.dayofweek   p.dt.dayofyear   p.dt.hour        p.dt.minute      p.dt.month       p.dt.quarter     p.dt.qyear       p.dt.second      p.dt.week        p.dt.weekofyear  p.dt.year

rosnfeld · 2014-08-07T19:18:55Z

Should the return types be arrays? I kind of thought they would be Series, like with the 'str' methods.

jreback · 2014-08-07T19:20:34Z

they are arrays now (that is unchanged)

jreback · 2014-08-07T19:24:00Z

e.g.

generally

s = Series(date_range('20130101',periods=3))

s[s.dt.day==1]

versus currently

s[pd.DatetimeIndex(s.values).day==1]

jreback · 2014-08-07T19:24:23Z

we could probably return an Index but that's a separate issue

shoyer · 2014-08-07T19:26:42Z

Why not return a new Series with the same index as the original series, as @rosnfeld suggests? That does seem a little more consistent with how most Series operations work.

jreback · 2014-08-07T19:27:42Z

hmm. These are actually index attributes, but easy enough to return a Series. sure that makes sense.

jreback · 2014-08-07T19:33:45Z

easy enough

In [3]: s
Out[3]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

In [4]: s.dt.day
Out[4]: 
0    1
1    2
2    3
dtype: int64

In [5]: s[s.dt.day==1]
Out[5]: 
0   2013-01-01
dtype: datetime64[ns]

shoyer · 2014-08-07T19:36:52Z

@jreback yes, the idea is to hide the (surprising) DatetimeIndex under the hood. (which, IMHO, is still a bit of a hack -- it would be nice to have the non-index functionality separated out in a DatetimeArray object like Categorical)

jorisvandenbossche · 2014-08-07T19:38:12Z

+1 on returning Series. That is the point of these methods I think: it are attributes/methods that act on the Series, so should return Series. That it does a pass-by through DatetimeIndex under the hood is an implementation detail.

jreback · 2014-08-07T19:38:45Z

@shoyer I think DatetimeArray could be made to work, but might not be necessary.

If you have a sketch of it, pls post.

shoyer · 2014-08-07T19:52:56Z

@jreback Here is the general idea: DatetimeArray would wrap a np.datetime64, but include all the necessary numpy fixes (e.g., NaT, comparison) and the non-index specific properties/methods of DatetimeIndex (e.g., all these datetime components like day, week, month, etc.). This would leave DatetimeIndex as an Index + DatetimeArray adding the index specific methods.

As I mentioned above, you could even return a DatetimeArray from series.values (although that might be a surprising, given how everyone is used to values being a ndarray).

I think that it would be a nice idea in theory.... but again it's not my top priority.

jankatins · 2014-08-07T22:53:20Z

I would suggest going the other way by adding a CategoricalNamespace (or CategoricalSeriesAPI or whatever it is named) object which would wrap Categorical. If category support ever changes (-> numpy implementation), this wouldn't need a API change.

jreback · 2014-08-07T23:39:51Z

@JanSchulz Categorical IS already a namespace. No need for further indirection

jankatins · 2014-08-08T08:53:15Z

@jreback: I disagree: it has the same problem as an index, lot's of methods which are not API (basically the API as Series.cat is reorder_levels, levels, and remove_unused_levels). Should we comment that in #7768?

jreback · 2014-08-08T12:24:01Z

@JanSchulz let's leave the Categorical stuff to #7768

jreback · 2014-08-08T14:30:21Z

pls have a look at updated #7953. removed a lot of the clutter from the previous impl (which did a form of redirection), and added docs.

jreback added this to the 0.14.1 milestone May 22, 2014

jreback added the API Design label May 22, 2014

jreback mentioned this issue May 22, 2014

Feedback on v0.14.0 RC1 #7146

Closed

jreback added the Enhancement label May 23, 2014

hayd mentioned this issue May 29, 2014

Accessing fields in datetime64 Series #2315

Closed

jreback modified the milestones: 0.15.0, 0.14.1 Jun 10, 2014

jreback mentioned this issue Aug 7, 2014

API: add Series.dt delegator for datetimelike methods (GH7207) #7953

Merged

jreback modified the milestones: 0.15.1, 0.15.0 Aug 7, 2014

jankatins mentioned this issue Aug 7, 2014

Categorical fixups #7768

Closed

5 tasks

jreback closed this as completed in #7953 Aug 10, 2014

API: revisit adding datetime-like ops in Series #7207

API: revisit adding datetime-like ops in Series #7207

Comments

jreback commented May 22, 2014

jreback commented May 22, 2014

cpcloud commented May 22, 2014

jorisvandenbossche commented May 22, 2014

jorisvandenbossche commented May 23, 2014

jreback commented May 23, 2014

jorisvandenbossche commented May 23, 2014

jreback commented May 23, 2014

rosnfeld commented May 23, 2014

jorisvandenbossche commented May 23, 2014

jreback commented May 23, 2014

TomAugspurger commented May 23, 2014

shoyer commented May 24, 2014

jreback commented May 24, 2014

shoyer commented May 24, 2014

jreback commented May 24, 2014

shoyer commented May 24, 2014

jorisvandenbossche commented May 24, 2014

jonblunt commented May 26, 2014

shoyer commented May 27, 2014

hayd commented May 28, 2014

jreback commented Jun 10, 2014

jankatins commented Jun 13, 2014

jreback commented Aug 6, 2014

jankatins commented Aug 7, 2014

jreback commented Aug 7, 2014

jreback commented Aug 7, 2014

jorisvandenbossche commented Aug 7, 2014

rosnfeld commented Aug 7, 2014

jreback commented Aug 7, 2014

jankatins commented Aug 7, 2014

jankatins commented Aug 7, 2014

jreback commented Aug 7, 2014

rosnfeld commented Aug 7, 2014

jreback commented Aug 7, 2014

jreback commented Aug 7, 2014

jreback commented Aug 7, 2014

shoyer commented Aug 7, 2014

jreback commented Aug 7, 2014

jreback commented Aug 7, 2014

shoyer commented Aug 7, 2014

jorisvandenbossche commented Aug 7, 2014

jreback commented Aug 7, 2014

shoyer commented Aug 7, 2014

jankatins commented Aug 7, 2014

jreback commented Aug 7, 2014

jankatins commented Aug 8, 2014

jreback commented Aug 8, 2014

jreback commented Aug 8, 2014