Skip to content

timeseries branch, intervals w/ alternate durations #901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamklein opened this issue Mar 11, 2012 · 11 comments
Closed

timeseries branch, intervals w/ alternate durations #901

adamklein opened this issue Mar 11, 2012 · 11 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Datetime Datetime data dtype Enhancement Ideas Long-Term Enhancement Discussions

Comments

@adamklein
Copy link
Contributor

Can the scikits.timeseries based interval code be extended to handle multiples of base durations?

Is it worth considering dropping the existing implementation and use, eg, start + end points (either two arrays, or one structured array), also have for fuzzier indexing some sort of

http://en.wikipedia.org/wiki/Interval_tree

"Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point"

ie log(n) operations for querying range intersections

Another possibility is an interval skip list, probabilistic log(n) behavior.

@alvorithm
Copy link

Hey, I am also very interested in this: http://permalink.gmane.org/gmane.comp.python.pydata/53

@hbredin
Copy link

hbredin commented Nov 27, 2012

+1. I would love to see support for (possibly overlapping) time intervals with variable duration.

@jreback jreback modified the milestones: Someday, 0.14.0 Feb 14, 2014
@shoyer
Copy link
Member

shoyer commented Jan 23, 2015

Just stumbled across this issue. I have most of an implementation for an IntervalIndex up in #8707, but it currently relies on sortedness for both starts and ends which is not ideal (and not always possible). It looks like an interval-tree could solve that problem... I may give that a try.

@hbredin
Copy link

hbredin commented Jan 23, 2015

I rely on the interval tree implementation available at https://pypi.python.org/pypi/Banyan for one of my projects. It might be useful here...

@cpcloud
Copy link
Member

cpcloud commented Jan 23, 2015

In fact at SciPy 2014 I was messing around with just this idea: using banyan to back an interval index

@shoyer
Copy link
Member

shoyer commented Jan 28, 2015

I have most of a Cython implementation of a centered interval-tree together (based off the description on Wikipedia). I'm pretty pleased with the initial performance numbers vs banyan:

x = np.random.rand(1e5)
y = x + 1e-4 * np.random.rand(1e5)
tree = IntervalTree(x, y, leaf_size=100)

%timeit tree = IntervalTree(x, y, leaf_size=100)
%timeit tree.query(0.5)
# 10 loops, best of 3: 41.9 ms per loop
# 10000 loops, best of 3: 47.2 µs per loop

import banyan

intervals = [((xi, yi), n) for n, (xi, yi) in enumerate(zip(x, y))]
d = banyan.FrozenSortedDict(intervals, updator=banyan.OverlappingIntervalsUpdator)

%timeit d = banyan.FrozenSortedDict(intervals, updator=banyan.OverlappingIntervalsUpdator)
%timeit d.overlap_point(0.5)
# 1 loops, best of 3: 301 ms per loop
# 100 loops, best of 3: 15.3 ms per loop

Yes, queries are 300x faster.

I'll have more details + code (maybe a blog post?) up soon. The downside vs the augmented tree approach in banyan is that the centered interval tree only works for numeric values, not strings.

Although I would like to use this for IntervalIndex (#7640), I would rather not bury this in pandas -- I think it makes to release it in an external project, which pandas can depend on, either explicitly or bundled via a git submodule.

@jreback How we feel about adding external deps to pandas? My general feeling is that this would be a net plus for the community, but it does make packaging harder. This less of an issue with tools like conda and pip, though.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

closing is favor of #7640

@jreback jreback closed this as completed Jan 28, 2015
@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

@shoyer external impl are already good. But prob makes sense to ship this directly with pandas. Having more dependencies is not a +.

(maybe think about contributing this to cytoolz?) might be in-scope there.

@shoyer
Copy link
Member

shoyer commented Jan 28, 2015

@jreback you should ask @mrocklin about his approach to factoring out dependencies in blaze. I have given him a hard time for the rapidly multiplying sub-projects but I think he may have a point.

If pandas consisted of smaller, composable parts it would make it easier for the broader eco-system to reuse parts. I don't think this would be a good fit for cytoolz (which doesn't even have a numpy dependency), but it might make sense for its own project.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

too many sub projects breeds complexities
I don't think pandas needs more dependencies ATM
in fact we r try to reduce the scope somewhat
but a complete package IS more than the sum of its parts because it is well integrated (eg this is the motivation for not splitting out io capability)

@mrocklin
Copy link
Contributor

There are costs and benefits to either approach, both in development and in maintenance.

While I often speak out for the theoretical benefits of separable software packages I wouldn't, in general, impose recommendations to any project one way or the other.

The practice of many dependencies can be annoying during development. Hopefully improved package managers and build systems reduce these costs in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Datetime Datetime data dtype Enhancement Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests

7 participants