Skip to content

Non-inclusive partial string indexing on DatetimeIndex #16571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
linar-jether opened this issue Jun 1, 2017 · 6 comments
Open

Non-inclusive partial string indexing on DatetimeIndex #16571

linar-jether opened this issue Jun 1, 2017 · 6 comments
Labels
Datetime Datetime data dtype Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@linar-jether
Copy link

linar-jether commented Jun 1, 2017

Currently, the only way to query a time series object in the form of start_time < t < end_time, is by using a boolean mask array, as using loc[slice] includes both the start and end of the range.

Querying using a boolean array is very slow compared to a slice on large dataframes, and pretty much makes no sense, as it always returns an array the size of the dataframe when my query only looks at a fraction of the data (which is the reason for indexing).

How does df.query perform range queries?
Is there another method for closed/open ranges?

I'm looking for something similar to df.between_time that allows to optionally include the start/end


[from @TomAugspurger]
I think we're reluctant to add more complexity / options to indexing with .loc and friends, but this would be a good doc example of how to, achieve it using lower-level methods:

In [24]: import pandas.util.testing as tm

In [25]: ts = tm.makeTimeSeries()

In [26]: ts
Out[26]:
2000-01-03    0.804101
2000-01-04    0.042160
2000-01-05   -0.580078
2000-01-06    0.757864
2000-01-07   -0.349766
2000-01-10   -0.058222
2000-01-11   -0.274172
2000-01-12   -1.539538
2000-01-13    0.505398
2000-01-14    0.665445
2000-01-17    0.998438
...
Freq: B, dtype: float64

Say you want to slice [2000-01-04, 2000-01-10) (so excluding the right endpoint)

In [27]: lo = ts.index.get_slice_bound("2000-01-04", "left", "loc")

In [28]: hi = ts.index.get_slice_bound("2000-01-10", "left", "loc")

In [29]: ts.iloc[lo:hi]
Out[29]:
2000-01-04    0.042160
2000-01-05   -0.580078
2000-01-06    0.757864
2000-01-07   -0.349766
Freq: B, dtype: float64
@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

pls read the docs: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#indexing

These are the recommended way to query. These by-definition support slicing when used via .loc

you are in fact using a convenience method, which only supports boolean indexing, powered by numexpr.

@jreback jreback closed this as completed Jun 1, 2017
@jreback jreback added Datetime Datetime data dtype Usage Question Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 1, 2017
@jreback jreback added this to the No action milestone Jun 1, 2017
@linar-jether
Copy link
Author

Yes, I've read the docs...
Note that the issue is about slicing with open/closed ranges, similar to df.between_time

Currently the .loc slicing is always inclusive at both ends of the range

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 1, 2017

I think we're reluctant to add more complexity / options to indexing with .loc and friends.

Here's an example of how you can achieve it, using lower-level methods:

In [24]: import pandas.util.testing as tm

In [25]: ts = tm.makeTimeSeries()

In [26]: ts
Out[26]:
2000-01-03    0.804101
2000-01-04    0.042160
2000-01-05   -0.580078
2000-01-06    0.757864
2000-01-07   -0.349766
2000-01-10   -0.058222
2000-01-11   -0.274172
2000-01-12   -1.539538
2000-01-13    0.505398
2000-01-14    0.665445
2000-01-17    0.998438
...
Freq: B, dtype: float64

Say you want to slice [2000-01-04, 2000-01-10) (so excluding the right endpoint)

In [27]: lo = ts.index.get_slice_bound("2000-01-04", "left", "loc")

In [28]: hi = ts.index.get_slice_bound("2000-01-10", "left", "loc")

In [29]: ts.iloc[lo:hi]
Out[29]:
2000-01-04    0.042160
2000-01-05   -0.580078
2000-01-06    0.757864
2000-01-07   -0.349766
Freq: B, dtype: float64

@jorisvandenbossche
Copy link
Member

I think that would be a very nice "advanced indexing trick" for somewhere in the docs.

@TomAugspurger TomAugspurger changed the title Timeseries range query Non-inclusive partial datetime indexing Jun 1, 2017
@TomAugspurger TomAugspurger changed the title Non-inclusive partial datetime indexing Non-inclusive partial string indexing on DatetimeIndex Jun 1, 2017
@TomAugspurger
Copy link
Contributor

Added the example to the original.

I think a cookbook example is probably most appropriate? @linar-jether would you mind submitting a PR?

@TomAugspurger TomAugspurger reopened this Jun 1, 2017
@linar-jether
Copy link
Author

@TomAugspurger Yea sure, I'll add a cookbook example

But as time series data is a major use case for pandas I think something as basic as a range query should be implemented in the API, maybe extend between_time to support timestamp objects instead of just time?

And thanks for your response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

6 participants