ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

chelsea-lin · 2024-07-01T19:22:04Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Currently, df.explode() and s.explode() flatten lists/arrays within Series/DataFrames. However, information about the original position of each element within its list is lost. This makes it difficult to:

Easily access specific sub-values after exploding.
Reconstruct the original nested structure if needed.

Proposed Solution:
Introduce a new parameter, offset, to both df.explode() and s.explode().

Example Usage:

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object
>>> s.explode() # <- Current behavior:
0         1
0         2
0         3
1       foo
2       NaN
3         3
3         4
dtype: object

>>> s.explode(offset=True) # <- With proposed feature
0  1         1
   2         2
   3         3
1  1       foo
2  1       NaN
3  1         3
   2         4
dtype: object

Feature Description

Introduce a new parameter, offset, to both df.explode() and s.explode().

def explode(self, ..., offset: bool = False):  # Default to False for backward compatibility
    """
    Parameters:
        ...
        offset: If True, include the original array offset as a level in the resulting MultiIndex.
    """

Alternative Solutions

While it's technically possible to infer the offset in some cases, it requires additional steps and assumptions about the data. The offset parameter provides a direct, intuitive solution.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

ritwizsinha · 2024-07-02T16:38:05Z

take

ritwizsinha · 2024-07-02T17:13:06Z

@chelsea-lin with the given small utility isn't it always possible to get the column offsets

import pandas as pd

def get_col_and_row_offsets(df):
    exploded_df = df.explode(ignore_index=False).to_frame(name='col1')
    exploded_df['col_offset'] = exploded_df.groupby(level=0).cumcount()
    exploded_df['row_offset'] = exploded_df.index
    
    return exploded_df

This gives result as shown

  col1  col_offset  row_offset
0    1           0           0
0    2           1           0
0    3           2           0
1  foo           0           1
2  NaN           0           2
3    3           0           3
3    4           1           3

Am I missing some edge case here?

chelsea-lin · 2024-07-02T17:31:33Z

@ritwizsinha Thanks for tackling this!
You've got it right. The dataframe/series relies on its index and offset (implicitly) for ordering. The get_col_and_row_offset method is the alternative solution. However, it could be expensive with larger datasets. That's why I'm curious if explode could provide the offset directly, potentially with better performance.

ritwizsinha · 2024-07-03T06:29:26Z

If we need to show column_offsets of all the items in the Series/DataFrame, that would be in the best case of an order of complexity linear or O(N) where the number of items are N. I don't think we can do better than this if we need to show all offsets. For getting offset of one element it might be possible to do in constant time, but need to research more for that.

chelsea-lin · 2024-07-03T18:32:03Z

I agree that the expected time complexity is likely O(N). Intuitively, the difference is that explode(offset=True) scans the data once, while get_col_and_row_offset might require two scans. However, I'm not entirely familiar with pandas internals, so further investigation is needed.

ritwizsinha · 2024-07-04T08:04:03Z

Did some research
The explode function is defined here

There are plenty of ways of adding offset list to the explode API:

The python explode calls the reshape.explode function which is a cython function, returning the items and the count of items in each row.
It would be more efficient to calculate the offsets in cython and then passing the offset list as well, but that will change the return type of the function causing an intrusive change.
The other option is to recalculate the column offsets after we get the values and row items counts in the explode function in python. This would be slow but less intrusive.
Third option might be to add a new cython function which takes in the row item count Series and creates an offset Series out of it.

Before benchmarking all of this, I think we need to ensure that we need to support this or not.

chelsea-lin · 2024-07-08T17:15:48Z

Thank you for your research!
Given the implementation complexity, the second option works to me, especially since the ignore_index option also requires additional data scans. While the second option might perform similarly to the workaround solution (the get_col_and_row_offset function), it could be more intuitive for users.
I am not familiar with cpython function, so this is just my two cents - I'm interested to hear what others with more expertise have to say.

tswast · 2024-07-10T17:01:38Z

I think we need to ensure that we need to support this or not.

IMO post-explode, it would be great to have an option that gives a unique index that can be used to recover the original lists. This could be quite useful for joining data from multiple sources, for example.

ritwizsinha · 2024-07-14T13:44:31Z

Pinging @mroeschke to comment if this addition might be needed or not, before I further improve my current implementation.
Also do we support this in both Dataframe and Series?

chelsea-lin · 2024-09-26T21:07:44Z

@ritwizsinha It appears that this change will become stale after 30 days without activity. Can we continue it?

Regarding your earlier questions, IMO it makes sense to support this functionality in both DataFrames and Series. Additionally, any tests you could provide would be very helpful. Also would love to hear @mroeschke for any further suggestions too.

chelsea-lin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2024

github-actions bot assigned ritwizsinha Jul 2, 2024

chelsea-lin changed the title ~~ENH: ENH: Add Option to Include Array Offset as MultiIndex Level in explode()~~ ENH: Add Option to Include Array Offset as MultiIndex Level in explode() Jul 2, 2024

ritwizsinha mentioned this issue Jul 12, 2024

Add column_offset to explode #59238

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

chelsea-lin commented Jul 1, 2024

ritwizsinha commented Jul 2, 2024

ritwizsinha commented Jul 2, 2024

chelsea-lin commented Jul 2, 2024

ritwizsinha commented Jul 3, 2024

chelsea-lin commented Jul 3, 2024

ritwizsinha commented Jul 4, 2024 •

edited

Loading

chelsea-lin commented Jul 8, 2024

tswast commented Jul 10, 2024

ritwizsinha commented Jul 14, 2024

chelsea-lin commented Sep 26, 2024

ENH: Add Option to Include Array Offset as MultiIndex Level in explode() #59163

ENH: Add Option to Include Array Offset as MultiIndex Level in explode() #59163

Comments

chelsea-lin commented Jul 1, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

ritwizsinha commented Jul 2, 2024

ritwizsinha commented Jul 2, 2024

chelsea-lin commented Jul 2, 2024

ritwizsinha commented Jul 3, 2024

chelsea-lin commented Jul 3, 2024

ritwizsinha commented Jul 4, 2024 • edited Loading

chelsea-lin commented Jul 8, 2024

tswast commented Jul 10, 2024

ritwizsinha commented Jul 14, 2024

chelsea-lin commented Sep 26, 2024

ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

ritwizsinha commented Jul 4, 2024 •

edited

Loading