-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add Option to Include Array Offset as MultiIndex Level in explode()
#59163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
@chelsea-lin with the given small utility isn't it always possible to get the column offsets import pandas as pd
def get_col_and_row_offsets(df):
exploded_df = df.explode(ignore_index=False).to_frame(name='col1')
exploded_df['col_offset'] = exploded_df.groupby(level=0).cumcount()
exploded_df['row_offset'] = exploded_df.index
return exploded_df This gives result as shown
Am I missing some edge case here? |
@ritwizsinha Thanks for tackling this! |
explode()
explode()
If we need to show column_offsets of all the items in the Series/DataFrame, that would be in the best case of an order of complexity linear or O(N) where the number of items are N. I don't think we can do better than this if we need to show all offsets. For getting offset of one element it might be possible to do in constant time, but need to research more for that. |
I agree that the expected time complexity is likely |
Did some research There are plenty of ways of adding offset list to the explode API:
Before benchmarking all of this, I think we need to ensure that we need to support this or not. |
Thank you for your research! |
IMO post-explode, it would be great to have an option that gives a unique index that can be used to recover the original lists. This could be quite useful for joining data from multiple sources, for example. |
Pinging @mroeschke to comment if this addition might be needed or not, before I further improve my current implementation. |
@ritwizsinha It appears that this change will become stale after 30 days without activity. Can we continue it? Regarding your earlier questions, IMO it makes sense to support this functionality in both DataFrames and Series. Additionally, any tests you could provide would be very helpful. Also would love to hear @mroeschke for any further suggestions too. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Currently,
df.explode()
ands.explode()
flatten lists/arrays within Series/DataFrames. However, information about the original position of each element within its list is lost. This makes it difficult to:Proposed Solution:
Introduce a new parameter,
offset
, to bothdf.explode()
ands.explode()
.Example Usage:
Feature Description
Introduce a new parameter,
offset
, to bothdf.explode()
ands.explode()
.Alternative Solutions
While it's technically possible to infer the offset in some cases, it requires additional steps and assumptions about the data. The offset parameter provides a direct, intuitive solution.
Additional Context
No response
The text was updated successfully, but these errors were encountered: