Skip to content

ENH: IntervalArray/IntervalIndex Range Joins #43031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sterlinm opened this issue Aug 14, 2021 · 4 comments
Open

ENH: IntervalArray/IntervalIndex Range Joins #43031

sterlinm opened this issue Aug 14, 2021 · 4 comments
Labels
Enhancement Index Related to the Index class or subclasses Interval Interval data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode setops union, intersection, difference, symmetric_difference

Comments

@sterlinm
Copy link

Is your feature request related to a problem?

I'm interested in adding additional options that would modify the behavior of merge, join, and concatenate operations when performed between IntervalArrays or IntervalIndexes.

There has been a fair amount of related discussion in a number of somewhat stale issues. I'd be happy to continue the discussion on one of those if it makes more sense.

Describe the solution you'd like

I'd like to be able to do merges on joins using Intervals where instead of finding exact matches the overlapping ranges are matched and split up as needed. Suppose we have the following two DataFrames.

idx_A = pd.IntervalIndex.from_tuples([(1, 4), (4, 6), (8, 9)])
df_A = pd.DataFrame(data={'A':['A1', 'A2', 'A3']}, index=idx_A)

idx_B = pd.IntervalIndex.from_tuples([(2, 3), (3, 5), (5, 10)])
df_B = pd.DataFrame(data={'B':['B1', 'B2', 'B3']}, index=idx_B)

Screen Shot 2021-08-13 at 8 57 04 PM

Currently, an outer join would not match any of the rows at all because it only joins Intervals that are equal.

df_A.join(df_B, how='outer')

Screen Shot 2021-08-13 at 9 00 15 PM

I'm interested in having an option that would change the behavior so that the overlapping portions of the intervals would be aligned. So the result would look like this.

df_a.join(df_B, how='overlapping') # not at all wedded to that parameter
idx_expected = pd.IntervalIndex.from_tuples([
    (1,2), (2,3), (3,4), (4,5),
    (5,6), (6,9),(9,10)])

df_expected = pd.DataFrame(
    index=idx_expected, data={
        'A': ['A1', 'A1', 'A1', 'A1', 'A2', 'A2', np.nan],
        'B': [np.nan, 'B1', 'B2', 'B2', 'B3', 'B3', 'B3']
    }
    
)

Screen Shot 2021-08-13 at 9 06 22 PM

Ideally, the behavior would extend to multiple DataFrames or Series. So concat would work as well.

idx_C = pd.IntervalIndex.from_tuples([(3, 5), (8, 12), (14, 20)])
df_C = pd.DataFrame(data={'C':['C1', 'C2', 'C3']}, index=idx_C)

concat_df = pd.concat([df_A, df_B, df_C], axis=1)

Screen Shot 2021-08-13 at 9 14 15 PM

Screen Shot 2021-08-13 at 9 14 23 PM

API breaking implications

I imagine it would be best to avoid breaking the existing behavior of merge, join, concat, etc. on IntervalArrays. I could see either creating new functions (like pd.merge_asof) or adding either additional parameters to the various methods or additional options to existing parameters (such as how).

Describe alternatives you've considered

I've considered just writing my own functions on that use IntervalArrays to achieve the following and looked at a number of other libraries.

Additional context

Some other related concepts or implementations:

@sterlinm sterlinm added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2021
@simonjayhawkins simonjayhawkins added Index Related to the Index class or subclasses Interval Interval data type setops union, intersection, difference, symmetric_difference Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 16, 2021
@Hoeze
Copy link

Hoeze commented Aug 24, 2021

Nested containment lists are usually significantly faster than IntervalTree's by an order of magnitude, so I guess it would be cool to have NCLS-based interval joins.

I already played a bit with NCLS, the library that powers pyranges:

def rage_join_idx(df_left, df_right, on, by):
    for idx, df_left_subset in df_left.groupby(by):
        # execute join for every group separately
        # an index on df_right[by] would be helpful here
        df_right_subset = df_right[df_right[by] == idx]
        
        on_start_col, on_end_col = on
        ncls = NCLS(
            df_left_subset[on_start_col].values, 
            df_left_subset[on_end_col].values, 
            np.arange(df_left.shape[0])
        )
        l_idxs, r_idxs = ncls.all_overlaps_both(
            df_right_subset[on_start_col].values,
            df_right_subset[on_end_col].values,
            np.arange(df_right_subset.shape[0])
        )
        
        yield pd.DataFrame({
            "left": df_left_subset.index.iloc[l_idxs], 
            "right": df_right_subset.index.iloc[r_idxs],
        })
        

def rage_join(df_left, df_right, on, by):
    # ensure uniqueness of index, keep index as columns
    df_left = df_left.reset_index()
    df_right = df_right.reset_index()
    
    # concatenate indices of every group in `df_left.groupby(by)`
    idx_df = pd.concat(list(rage_join_idx(df_left, df_right, on, by)), axis=0)
    
    # get indices
    return pd.concat([
        df_left.iloc[idx_df["left"].values],
        df_right.iloc[idx_df["right"].values],
    ], axis=1)

df_vars # chrom, start, end, ref, alt
df_ival # chrom, start, end

# range join on ["chrom", "start", "end"]
rage_join(df_vars, df_ival, on=("start", "end"), by="chrom")

Basically, I get the intersecting indices for every chromosome and then yield the matching rows.
This is certainly a bit slow because of the groupby-for-loop in python, but the idea behind it would be easily adoptable to some NCLS-based IntervalIndex.

Unfortunately, I have no clue how to implement this properly inside Pandas.

@Hoeze
Copy link

Hoeze commented Aug 24, 2021

Another interesting library that seems to be even more efficient than NCLS (did not try that yet):
https://github.com/kylessmith/ailist

@venaturum
Copy link
Contributor

Joins on IntervalIndex and IntervalArray have been implemented in piso as a stop gap until implementation in pandas. It is currently limited to either left-closed or right-closed intervals. Example here.

@sterlinm
Copy link
Author

sterlinm commented Oct 9, 2022

I haven't had any time to work on this but it's still a background interest of mine.

This paper looks relevant for anybody else interested: A Scalable and Generic Approach to Range Joins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Index Related to the Index class or subclasses Interval Interval data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

4 participants