-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
timeseries branch, intervals w/ alternate durations #901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey, I am also very interested in this: http://permalink.gmane.org/gmane.comp.python.pydata/53 |
+1. I would love to see support for (possibly overlapping) time intervals with variable duration. |
Just stumbled across this issue. I have most of an implementation for an |
I rely on the interval tree implementation available at https://pypi.python.org/pypi/Banyan for one of my projects. It might be useful here... |
In fact at SciPy 2014 I was messing around with just this idea: using banyan to back an interval index |
I have most of a Cython implementation of a centered interval-tree together (based off the description on Wikipedia). I'm pretty pleased with the initial performance numbers vs banyan:
Yes, queries are 300x faster. I'll have more details + code (maybe a blog post?) up soon. The downside vs the augmented tree approach in banyan is that the centered interval tree only works for numeric values, not strings. Although I would like to use this for IntervalIndex (#7640), I would rather not bury this in pandas -- I think it makes to release it in an external project, which pandas can depend on, either explicitly or bundled via a git submodule. @jreback How we feel about adding external deps to pandas? My general feeling is that this would be a net plus for the community, but it does make packaging harder. This less of an issue with tools like conda and pip, though. |
closing is favor of #7640 |
@shoyer external impl are already good. But prob makes sense to ship this directly with pandas. Having more dependencies is not a +. (maybe think about contributing this to |
@jreback you should ask @mrocklin about his approach to factoring out dependencies in blaze. I have given him a hard time for the rapidly multiplying sub-projects but I think he may have a point. If pandas consisted of smaller, composable parts it would make it easier for the broader eco-system to reuse parts. I don't think this would be a good fit for cytoolz (which doesn't even have a numpy dependency), but it might make sense for its own project. |
too many sub projects breeds complexities |
There are costs and benefits to either approach, both in development and in maintenance. While I often speak out for the theoretical benefits of separable software packages I wouldn't, in general, impose recommendations to any project one way or the other. The practice of many dependencies can be annoying during development. Hopefully improved package managers and build systems reduce these costs in the future. |
Can the scikits.timeseries based interval code be extended to handle multiples of base durations?
Is it worth considering dropping the existing implementation and use, eg, start + end points (either two arrays, or one structured array), also have for fuzzier indexing some sort of
http://en.wikipedia.org/wiki/Interval_tree
"Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point"
ie log(n) operations for querying range intersections
Another possibility is an interval skip list, probabilistic log(n) behavior.
The text was updated successfully, but these errors were encountered: