Skip to content

ENH: Add Option to Include Array Offset as MultiIndex Level in explode() #59163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
chelsea-lin opened this issue Jul 1, 2024 · 10 comments
Open
1 of 3 tasks
Assignees
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@chelsea-lin
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, df.explode() and s.explode() flatten lists/arrays within Series/DataFrames. However, information about the original position of each element within its list is lost. This makes it difficult to:

  • Easily access specific sub-values after exploding.
  • Reconstruct the original nested structure if needed.

Proposed Solution:
Introduce a new parameter, offset, to both df.explode() and s.explode().

Example Usage:

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object
>>> s.explode() # <- Current behavior:
0         1
0         2
0         3
1       foo
2       NaN
3         3
3         4
dtype: object

>>> s.explode(offset=True) # <- With proposed feature
0  1         1
   2         2
   3         3
1  1       foo
2  1       NaN
3  1         3
   2         4
dtype: object

Feature Description

Introduce a new parameter, offset, to both df.explode() and s.explode().

def explode(self, ..., offset: bool = False):  # Default to False for backward compatibility
    """
    Parameters:
        ...
        offset: If True, include the original array offset as a level in the resulting MultiIndex.
    """

Alternative Solutions

While it's technically possible to infer the offset in some cases, it requires additional steps and assumptions about the data. The offset parameter provides a direct, intuitive solution.

Additional Context

No response

@chelsea-lin chelsea-lin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2024
@ritwizsinha
Copy link
Contributor

take

@ritwizsinha
Copy link
Contributor

@chelsea-lin with the given small utility isn't it always possible to get the column offsets

import pandas as pd

def get_col_and_row_offsets(df):
    exploded_df = df.explode(ignore_index=False).to_frame(name='col1')
    exploded_df['col_offset'] = exploded_df.groupby(level=0).cumcount()
    exploded_df['row_offset'] = exploded_df.index
    
    return exploded_df

This gives result as shown

  col1  col_offset  row_offset
0    1           0           0
0    2           1           0
0    3           2           0
1  foo           0           1
2  NaN           0           2
3    3           0           3
3    4           1           3

Am I missing some edge case here?

@chelsea-lin
Copy link
Author

@ritwizsinha Thanks for tackling this!
You've got it right. The dataframe/series relies on its index and offset (implicitly) for ordering. The get_col_and_row_offset method is the alternative solution. However, it could be expensive with larger datasets. That's why I'm curious if explode could provide the offset directly, potentially with better performance.

@chelsea-lin chelsea-lin changed the title ENH: ENH: Add Option to Include Array Offset as MultiIndex Level in explode() ENH: Add Option to Include Array Offset as MultiIndex Level in explode() Jul 2, 2024
@ritwizsinha
Copy link
Contributor

If we need to show column_offsets of all the items in the Series/DataFrame, that would be in the best case of an order of complexity linear or O(N) where the number of items are N. I don't think we can do better than this if we need to show all offsets. For getting offset of one element it might be possible to do in constant time, but need to research more for that.

@chelsea-lin
Copy link
Author

I agree that the expected time complexity is likely O(N). Intuitively, the difference is that explode(offset=True) scans the data once, while get_col_and_row_offset might require two scans. However, I'm not entirely familiar with pandas internals, so further investigation is needed.

@ritwizsinha
Copy link
Contributor

ritwizsinha commented Jul 4, 2024

Did some research
The explode function is defined here

There are plenty of ways of adding offset list to the explode API:

  1. The python explode calls the reshape.explode function which is a cython function, returning the items and the count of items in each row.
    It would be more efficient to calculate the offsets in cython and then passing the offset list as well, but that will change the return type of the function causing an intrusive change.
  2. The other option is to recalculate the column offsets after we get the values and row items counts in the explode function in python. This would be slow but less intrusive.
  3. Third option might be to add a new cython function which takes in the row item count Series and creates an offset Series out of it.

Before benchmarking all of this, I think we need to ensure that we need to support this or not.

@chelsea-lin
Copy link
Author

Thank you for your research!
Given the implementation complexity, the second option works to me, especially since the ignore_index option also requires additional data scans. While the second option might perform similarly to the workaround solution (the get_col_and_row_offset function), it could be more intuitive for users.
I am not familiar with cpython function, so this is just my two cents - I'm interested to hear what others with more expertise have to say.

@tswast
Copy link
Contributor

tswast commented Jul 10, 2024

I think we need to ensure that we need to support this or not.

IMO post-explode, it would be great to have an option that gives a unique index that can be used to recover the original lists. This could be quite useful for joining data from multiple sources, for example.

@ritwizsinha
Copy link
Contributor

Pinging @mroeschke to comment if this addition might be needed or not, before I further improve my current implementation.
Also do we support this in both Dataframe and Series?

@chelsea-lin
Copy link
Author

@ritwizsinha It appears that this change will become stale after 30 days without activity. Can we continue it?

Regarding your earlier questions, IMO it makes sense to support this functionality in both DataFrames and Series. Additionally, any tests you could provide would be very helpful. Also would love to hear @mroeschke for any further suggestions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants