timeseries branch, intervals w/ alternate durations #901

adamklein · 2012-03-11T20:55:39Z

Can the scikits.timeseries based interval code be extended to handle multiples of base durations?

Is it worth considering dropping the existing implementation and use, eg, start + end points (either two arrays, or one structured array), also have for fuzzier indexing some sort of

http://en.wikipedia.org/wiki/Interval_tree

"Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point"

ie log(n) operations for querying range intersections

Another possibility is an interval skip list, probabilistic log(n) behavior.

alvorithm · 2012-04-14T19:24:26Z

Hey, I am also very interested in this: http://permalink.gmane.org/gmane.comp.python.pydata/53

hbredin · 2012-11-27T21:36:08Z

+1. I would love to see support for (possibly overlapping) time intervals with variable duration.

shoyer · 2015-01-23T08:29:04Z

Just stumbled across this issue. I have most of an implementation for an IntervalIndex up in #8707, but it currently relies on sortedness for both starts and ends which is not ideal (and not always possible). It looks like an interval-tree could solve that problem... I may give that a try.

hbredin · 2015-01-23T08:42:07Z

I rely on the interval tree implementation available at https://pypi.python.org/pypi/Banyan for one of my projects. It might be useful here...

cpcloud · 2015-01-23T11:17:45Z

In fact at SciPy 2014 I was messing around with just this idea: using banyan to back an interval index

shoyer · 2015-01-28T02:43:35Z

I have most of a Cython implementation of a centered interval-tree together (based off the description on Wikipedia). I'm pretty pleased with the initial performance numbers vs banyan:

x = np.random.rand(1e5)
y = x + 1e-4 * np.random.rand(1e5)
tree = IntervalTree(x, y, leaf_size=100)

%timeit tree = IntervalTree(x, y, leaf_size=100)
%timeit tree.query(0.5)
# 10 loops, best of 3: 41.9 ms per loop
# 10000 loops, best of 3: 47.2 µs per loop

import banyan

intervals = [((xi, yi), n) for n, (xi, yi) in enumerate(zip(x, y))]
d = banyan.FrozenSortedDict(intervals, updator=banyan.OverlappingIntervalsUpdator)

%timeit d = banyan.FrozenSortedDict(intervals, updator=banyan.OverlappingIntervalsUpdator)
%timeit d.overlap_point(0.5)
# 1 loops, best of 3: 301 ms per loop
# 100 loops, best of 3: 15.3 ms per loop

Yes, queries are 300x faster.

I'll have more details + code (maybe a blog post?) up soon. The downside vs the augmented tree approach in banyan is that the centered interval tree only works for numeric values, not strings.

Although I would like to use this for IntervalIndex (#7640), I would rather not bury this in pandas -- I think it makes to release it in an external project, which pandas can depend on, either explicitly or bundled via a git submodule.

@jreback How we feel about adding external deps to pandas? My general feeling is that this would be a net plus for the community, but it does make packaging harder. This less of an issue with tools like conda and pip, though.

jreback · 2015-01-28T11:05:37Z

closing is favor of #7640

jreback · 2015-01-28T11:06:35Z

@shoyer external impl are already good. But prob makes sense to ship this directly with pandas. Having more dependencies is not a +.

(maybe think about contributing this to cytoolz?) might be in-scope there.

shoyer · 2015-01-28T22:42:03Z

@jreback you should ask @mrocklin about his approach to factoring out dependencies in blaze. I have given him a hard time for the rapidly multiplying sub-projects but I think he may have a point.

If pandas consisted of smaller, composable parts it would make it easier for the broader eco-system to reuse parts. I don't think this would be a good fit for cytoolz (which doesn't even have a numpy dependency), but it might make sense for its own project.

jreback · 2015-01-28T22:50:48Z

too many sub projects breeds complexities
I don't think pandas needs more dependencies ATM
in fact we r try to reduce the scope somewhat
but a complete package IS more than the sum of its parts because it is well integrated (eg this is the motivation for not splitting out io capability)

mrocklin · 2015-01-28T22:54:47Z

There are costs and benefits to either approach, both in development and in maintenance.

While I often speak out for the theoretical benefits of separable software packages I wouldn't, in general, impose recommendations to any project one way or the other.

The practice of many dependencies can be annoying during development. Hopefully improved package managers and build systems reduce these costs in the future.

jreback modified the milestones: Someday, 0.14.0 Feb 14, 2014

shoyer mentioned this issue Jan 28, 2015

Proposal: New Index type for binned data (IntervalIndex) #7640

Closed

jreback closed this as completed Jan 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeseries branch, intervals w/ alternate durations #901

timeseries branch, intervals w/ alternate durations #901

adamklein commented Mar 11, 2012

alvorithm commented Apr 14, 2012

hbredin commented Nov 27, 2012

shoyer commented Jan 23, 2015

hbredin commented Jan 23, 2015

cpcloud commented Jan 23, 2015

shoyer commented Jan 28, 2015

jreback commented Jan 28, 2015

jreback commented Jan 28, 2015

shoyer commented Jan 28, 2015

jreback commented Jan 28, 2015

mrocklin commented Jan 28, 2015

timeseries branch, intervals w/ alternate durations #901

timeseries branch, intervals w/ alternate durations #901

Comments

adamklein commented Mar 11, 2012

alvorithm commented Apr 14, 2012

hbredin commented Nov 27, 2012

shoyer commented Jan 23, 2015

hbredin commented Jan 23, 2015

cpcloud commented Jan 23, 2015

shoyer commented Jan 28, 2015

jreback commented Jan 28, 2015

jreback commented Jan 28, 2015

shoyer commented Jan 28, 2015

jreback commented Jan 28, 2015

mrocklin commented Jan 28, 2015