-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: IntervalArray/IntervalIndex Range Joins #43031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nested containment lists are usually significantly faster than IntervalTree's by an order of magnitude, so I guess it would be cool to have NCLS-based interval joins. I already played a bit with NCLS, the library that powers pyranges: def rage_join_idx(df_left, df_right, on, by):
for idx, df_left_subset in df_left.groupby(by):
# execute join for every group separately
# an index on df_right[by] would be helpful here
df_right_subset = df_right[df_right[by] == idx]
on_start_col, on_end_col = on
ncls = NCLS(
df_left_subset[on_start_col].values,
df_left_subset[on_end_col].values,
np.arange(df_left.shape[0])
)
l_idxs, r_idxs = ncls.all_overlaps_both(
df_right_subset[on_start_col].values,
df_right_subset[on_end_col].values,
np.arange(df_right_subset.shape[0])
)
yield pd.DataFrame({
"left": df_left_subset.index.iloc[l_idxs],
"right": df_right_subset.index.iloc[r_idxs],
})
def rage_join(df_left, df_right, on, by):
# ensure uniqueness of index, keep index as columns
df_left = df_left.reset_index()
df_right = df_right.reset_index()
# concatenate indices of every group in `df_left.groupby(by)`
idx_df = pd.concat(list(rage_join_idx(df_left, df_right, on, by)), axis=0)
# get indices
return pd.concat([
df_left.iloc[idx_df["left"].values],
df_right.iloc[idx_df["right"].values],
], axis=1)
df_vars # chrom, start, end, ref, alt
df_ival # chrom, start, end
# range join on ["chrom", "start", "end"]
rage_join(df_vars, df_ival, on=("start", "end"), by="chrom") Basically, I get the intersecting indices for every chromosome and then yield the matching rows. Unfortunately, I have no clue how to implement this properly inside Pandas. |
Another interesting library that seems to be even more efficient than NCLS (did not try that yet): |
Joins on IntervalIndex and IntervalArray have been implemented in piso as a stop gap until implementation in pandas. It is currently limited to either left-closed or right-closed intervals. Example here. |
I haven't had any time to work on this but it's still a background interest of mine. This paper looks relevant for anybody else interested: A Scalable and Generic Approach to Range Joins |
Is your feature request related to a problem?
I'm interested in adding additional options that would modify the behavior of merge, join, and concatenate operations when performed between IntervalArrays or IntervalIndexes.
There has been a fair amount of related discussion in a number of somewhat stale issues. I'd be happy to continue the discussion on one of those if it makes more sense.
Describe the solution you'd like
I'd like to be able to do merges on joins using Intervals where instead of finding exact matches the overlapping ranges are matched and split up as needed. Suppose we have the following two DataFrames.
Currently, an outer join would not match any of the rows at all because it only joins Intervals that are equal.
df_A.join(df_B, how='outer')
I'm interested in having an option that would change the behavior so that the overlapping portions of the intervals would be aligned. So the result would look like this.
Ideally, the behavior would extend to multiple DataFrames or Series. So concat would work as well.
API breaking implications
I imagine it would be best to avoid breaking the existing behavior of merge, join, concat, etc. on IntervalArrays. I could see either creating new functions (like
pd.merge_asof
) or adding either additional parameters to the various methods or additional options to existing parameters (such ashow
).Describe alternatives you've considered
I've considered just writing my own functions on that use IntervalArrays to achieve the following and looked at a number of other libraries.
Additional context
Some other related concepts or implementations:
The text was updated successfully, but these errors were encountered: