Skip to content

API: revisit adding datetime-like ops in Series #7207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue May 22, 2014 · 59 comments · Fixed by #7953
Closed

API: revisit adding datetime-like ops in Series #7207

jreback opened this issue May 22, 2014 · 59 comments · Fixed by #7953

Comments

@jreback
Copy link
Contributor

jreback commented May 22, 2014

related is #7146, #7206

maybe provide a to_index() function and/or a class-like accessor

.values_to_index (too long though)

@jreback jreback added this to the 0.14.1 milestone May 22, 2014
@jreback
Copy link
Contributor Author

jreback commented May 22, 2014

@cpcloud @jorisvandenbossche

cc @rosnfeld
cc @mtkni

continue the discussion here (but we'll consider for 0.14.1)

@cpcloud
Copy link
Member

cpcloud commented May 22, 2014

i think to_index good. never really use these that much. agree that the Index type dependence could be a bit a special case and possibly confusing to new users.

@jorisvandenbossche
Copy link
Member

I agree with @mtkni that adding a to_index() method does not really solve the issue (but apart from this issue, it can be a good enhancement to add such a method).

The problem in mindset with the datetime accessors, when you have a datetime column in a dataframe, is that you have to first make an index of it before you can access datetime fields. If you don't really know the background, this is not logical. "Why should you make an index of it? If I only need the day values?" Whether this is with DatetimeIndex(s) or with s.to_index() does not really matter that much in this sense (although I agree it makes it a bit easier/shorter).

I also agree that you 'expect' at first sight that a series method acts on the values, not on the index (for most methods, you have of course index-specific methods). So I think it is good for now to disable this (as you did in #7206) until we decide on the API.

@jorisvandenbossche
Copy link
Member

What would you actually think of having the attributes act on the Series values? (so having s.year and s.index.year) or would that be to dangerous to confuse them?

@jreback
Copy link
Contributor Author

jreback commented May 23, 2014

that's what I had originally; but it IS ambiguous if you have a datetime index AND datetime values on a Series (not that common but common enough I guess).

@jorisvandenbossche
Copy link
Member

In that case it is indeed a bit ambiguous, but I think less ambiguous as the other way around (the attributes acting on the index). And it would also be useful functionality to have such attributes on the datetime64 values of a series/column.

@jreback
Copy link
Contributor Author

jreback commented May 23, 2014

ahh..you want them to act on the VALUES? hmm, don't liek the s.to_index().day approach? (which IMHO is pretty explicity)....maybe could make a property like s.vindex.day or something

@rosnfeld
Copy link
Contributor

Just a comment on my motivation here (I can't speak for @mtkni, though curious to hear other use cases):

I have worked with various datasets with multiple datetime columns, where none of them should really be the "index", so I find this to be somewhat common. E.g. humanitarian funding data might have various "state transition" dates for each entry (date funding was pledged, date it was transferred, date it was deployed towards a project, etc). Or a big dataset of projects might have start and end dates. So it's not that I'm creating standalone Series with date values, but the columns of a DataFrame often fit the bill.

Maybe I want to extract the year or month from each of these columns, or maybe somebody would want to tz_localize/tz_convert. Writing projects.end_date.year seems natural and yet it doesn't work, and the error is confusing - why would that operation have anything to do with the index? It looks like it's an operation on the column.

Requiring conversion to an index is surprising, as I (and likely others who don't understand the code as well) don't understand what is "special" about an index (or special about datetimes).

I'd be curious to know how others view this.

@jorisvandenbossche
Copy link
Member

@jreback yes indeed the values :-)

And that having to convert it to an index is strange, is what I wanted to say in my comment above (#7207 (comment)), as also @rosnfeld argues.

For me it is also rather common to have datetime values in columns, eg start and end times of measurements.

I agree that having a datelike index and a non-datelike values is probably by far the more common use case, for that case it is not that much more work (and more explicit) to type s.index.day

To make it concrete, this would give something like this:

In [1]: s = pd.Series(pd.date_range("2013-01-01 09:00:00", periods=3))

In [2]: s
Out[2]:
0   2013-01-01 09:00:00
1   2013-01-02 09:00:00
2   2013-01-03 09:00:00
dtype: datetime64[ns]

In [3]: s.day
0    1
1    2
2    3
dtype: int64

@jreback
Copy link
Contributor Author

jreback commented May 23, 2014

@rosnfeld ok, so you like the idea of using datetime-like ops (e.g. year,month,seconds...etc)

on the VALUES of a series (e.g. s.year)

rather that having them act on the index (which s.index.year will accomplish in any event)

so going back to common methods/properties for index/series (so they act similary, a good thing).

Note that their are currently several methods that DO NOT DO this, e.g. at_time/between_time/tz_localize/tz_convert which act on the index! (and the removed Series.weekday).

note also that tshift/asfreq acts on the index (but is more of a reshaping operation so that seems ok)

@TomAugspurger
Copy link
Contributor

Just to confirm, with a Series with datetime values and a datetime Index, the ops will still operate on the values right?

If so then +1 from me on operating on the values.

@shoyer
Copy link
Member

shoyer commented May 24, 2014

Instead of adding to_index() just so you can write s.to_index().day, what about creating some sort of "DatetimeArray" which separates out the part of DatetimeIndex which fixes np.datetime64 quirks from the Index part? This seems like the cleaner, more modular solution. DatetimeArray would implement properties like .day without the need to create an Index.

Not having looked at DatetimeIndex internals very carefully, I'm not sure how painful this would be. Of course, better would be to try to fix datetime64 upstream in numpy...

@jreback
Copy link
Contributor Author

jreback commented May 24, 2014

@shoyer not sure what u mean by datetime64 quirks?
DatetimeIndex already fixes everything (numpy quirks) and provides quite a lot of functionality
the issue is that Series and Index are really quite similar (and share a hierarchy to some extent in. 0.14.0)

only issue is how to unambiguously do things with the Series values. (eg Series.day) or the index
Series.index.day or have the user be more explicit by doing something like
Series.to_index().day or Series.values_as_index.day
(as Series.values returns the raw numpy data and cannot be used to this in a nice clean way) and relying on numpy for that functionality is not possible ATM

@shoyer
Copy link
Member

shoyer commented May 24, 2014

I know DatetimeIndex fixes the numpy quirks (comparison, casting, NaT, etc.) and provides some nice extra functionality (like the .day properties). My point is that most of these features are actually entirely distinct from the index part, so you can imagine exposing those in something like a "DatetimeArray" which makes the index part optional. DatetimeIndex would then be a relatively shallow wrapper around DatetimeArray.

I guess this would be API breaking, but if s.values could be an nd-array-like DatetimeArray (possibly even an ndarray subclass), then it would simplify this design issue because it would be obvious that Series would just pass on the "day" property from "s.values", like how it already does for ndarray properties.

Maybe this is not terribly useful for pandas because creating an index larger than a certain size only bothers to create the index lookup table on demand anyways.

@jreback
Copy link
Contributor Author

jreback commented May 24, 2014

it's possible that I could simply have Series.values return a DatetimeIndex if the dtype is correct (in fact other dtypes eg Categorical and Sparse do this) This might work and is not really a breaking API change - not need to create another class for that reason

that said it IS natural to simply use Series.day which applies to the values of the Series
just like virtually any other function (with a couple of exceptions as noted above)

the only issue I have is that doing Series.day on say a float dtype series will have to raise a TypeError
but I think that's ok

@shoyer
Copy link
Member

shoyer commented May 24, 2014

I guess the reason to (possibly) make another class would be so this same sort of thing works even with multi-dimensional datetime arrays, e.g., on a DataFrame with datetime values (df.day). That said, I'm 👍 on Series.values returning a DatetimeIndex as a step in the right direction. Bare np.datetime64 arrays are not terribly useful.

@jorisvandenbossche
Copy link
Member

I am not too fond of returning a DatetimeIndex for s.values, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes, you can always use apply and use the (possible) series attribute.

@jonblunt
Copy link

FWIW how about s.values.asdatetime().day .

For the uniinitiated the "index" information is not interesting, and
confusing. We have column that is datetime, the values are a not very
useful datetime64 but we want to use thme as a pandas datetime.

I assume that if the underlying array is not datetime64 this throws an
error.

On Sat, May 24, 2014 at 2:33 PM, Joris Van den Bossche <
[email protected]> wrote:

I am not too fond of returning a DatetimeIndex for s.values, as this is
well known to return a numpy array with the values. That would also be
strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes,
you can always use apply and use the (possible) series attribute.


Reply to this email directly or view it on GitHubhttps://github.com//issues/7207#issuecomment-44097276
.

@shoyer
Copy link
Member

shoyer commented May 27, 2014

@jonblunt Unfortunately s.values.asdatetime() can't work unless s.values changes from being a raw numpy.ndarray -- it would need to be some sort of custom object for it to be possible to add a new method like asdatetime().

I agree that seeing a DatetimeIndex could be somewhat confusing, which is why I suggested a more generic DatetimeArray. However, pandas.Index objects do act almost exactly like ndarrays with a few extra methods, so from a functionality perspective it would be almost equivalent -- just with a slightly confusing repr when you print the values.

@hayd
Copy link
Contributor

hayd commented May 28, 2014

Would it make sense to make a Series with datetime values a subclass of Series (with DatetimeIndex date-y methods?)... Or perhaps a property as datetime? That way we tab complete the methods.

@property
def as_datetime(self):
    # assert it's a datetime dtype
    return pd.DatetimeIndex(self)

pd.Series.as_datetime = as_datetime

s.as_datetime.<tab>  # shows datetimeindex methods if datetime

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2014

pushing to 0.15; would be an API change

@jankatins
Copy link
Contributor

Not sure if this is related, but #7217 just introduced Series.cat.<categorical functions> (if Series is of type Categorical).

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

http://stackoverflow.com/questions/25129144/pandas-return-hour-from-datetime-column-directly?noredirect=1#comment39179371_25129144

I think easiest / best way is simply add:

Series.to_index(), nice clean and consistent.

any takers for a PR?

@jankatins
Copy link
Contributor

I like s.dt.{minute|year|...}, it's short and s.date would be not intuitive if you want a time.

There are now at least s.str and s.cat so I don't see why there shouldn't be a s.dt...

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

its also easy to make .dt (or other namespaces) raise if its actually accessed by an incorrect dtype (and not show up in tab completion)

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

ok, pls take a look at #7953 which implements the .dt namespace

@jorisvandenbossche
Copy link
Member

s.cat is not really a namespace, it is just an attribute to access the Categorical (which has of course special attributes/methods).
And I don't think that this is exactly what we want in this case, providing a DatetimeIndex via an attribute. Because in this way, the argument of "grouping of methods/attributes" from above does not apply (you have just all index methods, not the specific datetime attributes), and then I don't see an advantage of this nesting above just direct attributes?

@jreback this is at once also my comment on the PR

@rosnfeld
Copy link
Contributor

rosnfeld commented Aug 7, 2014

I had the same reaction - it would be nice if the tab-complete just showed the timeseries-specific attributes.

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

not hard to just show only datetimelike attributes....

@jankatins
Copy link
Contributor

s.cat could become a namespace, if numpy starts providing a categorical datatype itself, that was at least my understanding when that was discussed. Maybe that should be more clear in the documentation, that only a few methods on that "namespace" are guaranteed to work in the future.

@jankatins
Copy link
Contributor

Reading the tab completion in #7953, it would be nice if there would be a way to hide the "none-Api" methods. Kind of like a wrapper object which would created by passing in a list of attributes which are accessable (see this SO question) + adding these methods to tab completion (no idea how that works but someone mentioned it above)

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

This does look nicer (latest version of #7953)

In [1]: s = Series(date_range('20130101',periods=3))

In [2]: s.dt.
s.dt.date              s.dt.dayofweek         s.dt.hour              s.dt.is_month_start    s.dt.is_quarter_start  s.dt.is_year_start     s.dt.minute            s.dt.nanosecond        s.dt.second            s.dt.week              s.dt.year              
s.dt.day               s.dt.dayofyear         s.dt.is_month_end      s.dt.is_quarter_end    s.dt.is_year_end       s.dt.microsecond       s.dt.month             s.dt.quarter           s.dt.time              s.dt.weekofyear        

In [2]: s.dt.hour
Out[2]: array([0, 0, 0])

In [3]: s.dt.year
Out[3]: array([2013, 2013, 2013])

In [4]: s.dt.day
Out[4]: array([1, 2, 3])

And its specific to the type of wrapped delegate

In [5]: p = Series(period_range('20130101',periods=3,freq='D').asobject)

In [6]: p.dt.
p.dt.day         p.dt.dayofweek   p.dt.dayofyear   p.dt.hour        p.dt.minute      p.dt.month       p.dt.quarter     p.dt.qyear       p.dt.second      p.dt.week        p.dt.weekofyear  p.dt.year        

@rosnfeld
Copy link
Contributor

rosnfeld commented Aug 7, 2014

Should the return types be arrays? I kind of thought they would be Series, like with the 'str' methods.

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

they are arrays now (that is unchanged)

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

e.g.

generally

s = Series(date_range('20130101',periods=3))

s[s.dt.day==1]

versus currently

s[pd.DatetimeIndex(s.values).day==1]

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

we could probably return an Index but that's a separate issue

@shoyer
Copy link
Member

shoyer commented Aug 7, 2014

Why not return a new Series with the same index as the original series, as @rosnfeld suggests? That does seem a little more consistent with how most Series operations work.

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

hmm. These are actually index attributes, but easy enough to return a Series. sure that makes sense.

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

easy enough

In [3]: s
Out[3]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

In [4]: s.dt.day
Out[4]: 
0    1
1    2
2    3
dtype: int64

In [5]: s[s.dt.day==1]
Out[5]: 
0   2013-01-01
dtype: datetime64[ns]

@shoyer
Copy link
Member

shoyer commented Aug 7, 2014

@jreback yes, the idea is to hide the (surprising) DatetimeIndex under the hood. (which, IMHO, is still a bit of a hack -- it would be nice to have the non-index functionality separated out in a DatetimeArray object like Categorical)

@jorisvandenbossche
Copy link
Member

+1 on returning Series. That is the point of these methods I think: it are attributes/methods that act on the Series, so should return Series. That it does a pass-by through DatetimeIndex under the hood is an implementation detail.

@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

@shoyer I think DatetimeArray could be made to work, but might not be necessary.

If you have a sketch of it, pls post.

@shoyer
Copy link
Member

shoyer commented Aug 7, 2014

@jreback Here is the general idea: DatetimeArray would wrap a np.datetime64, but include all the necessary numpy fixes (e.g., NaT, comparison) and the non-index specific properties/methods of DatetimeIndex (e.g., all these datetime components like day, week, month, etc.). This would leave DatetimeIndex as an Index + DatetimeArray adding the index specific methods.

As I mentioned above, you could even return a DatetimeArray from series.values (although that might be a surprising, given how everyone is used to values being a ndarray).

I think that it would be a nice idea in theory.... but again it's not my top priority.

@jankatins
Copy link
Contributor

I would suggest going the other way by adding a CategoricalNamespace (or CategoricalSeriesAPI or whatever it is named) object which would wrap Categorical. If category support ever changes (-> numpy implementation), this wouldn't need a API change.

@jankatins jankatins mentioned this issue Aug 7, 2014
5 tasks
@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

@JanSchulz Categorical IS already a namespace. No need for further indirection

@jankatins
Copy link
Contributor

@jreback: I disagree: it has the same problem as an index, lot's of methods which are not API (basically the API as Series.cat is reorder_levels, levels, and remove_unused_levels). Should we comment that in #7768?

@jreback
Copy link
Contributor Author

jreback commented Aug 8, 2014

@JanSchulz let's leave the Categorical stuff to #7768

@jreback
Copy link
Contributor Author

jreback commented Aug 8, 2014

pls have a look at updated #7953. removed a lot of the clutter from the previous impl (which did a form of redirection), and added docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants