Skip to content

ENH: Dataframe metadata preservation on join operation #47238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ajcost opened this issue Jun 5, 2022 · 1 comment
Closed

ENH: Dataframe metadata preservation on join operation #47238

ajcost opened this issue Jun 5, 2022 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement metadata _metadata, .attrs

Comments

@ajcost
Copy link

ajcost commented Jun 5, 2022

Is your feature request related to a problem?

Dataframe metadata is unstable, whereas, in other similar data manipulation packages like xarray there are options to preserve metadata on joins and other manipulations (https://xarray.pydata.org/en/stable/getting-started-guide/faq.html?highlight=keep_attrs#what-is-your-approach-to-metadata). I realize that attrs is still in test mode (attrs is experimental and may change without warning.), but I personally think improving this functionality would greatly improve pandas and allow for better documentation and data storage. Especially when other, even less mature, data packages, are beginning to implement solid metadata handling.

Example reproducible code below:

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
df.attrs = {
    "created" : "now",
    "type" : "test"
}

df.attrs # {'created': 'now', 'type': 'test'}

df.join(other, lsuffix='_caller', rsuffix='_other').attrs # {}

Describe the solution you'd like

After the join the metadata is none. There should be a flag for metadata, left, right, combine, with default behavior None perhaps.

I realize that there is currently and issue up: #28283 that may be related. Happy to help on this also! But was hoping .join, concat could be added to the list. There is the comment:

We don't have a good sense for what should happen to attrs when there are
multiple NDFrames involved with differing attrs (e.g. in concat). The safest
approach is to probably drop the attrs when they don't match, but this will
need some thought.

I think that it should be left, right, combine, and None (default = None). Drop attrs if None or if combine is unachievable given the type / values of attrs across dataframes (and print a Warning to the user if the flag was combine but the attrs were not combinable across DataFrames and thus were dropped)

API breaking implications

It shouldn't break anything else in the API I believe.

Describe alternatives you've considered

of course you can easily write a function for this

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
df.attrs = {
    "created" : "now",
    "type" : "test"
}

save_attrs = df.attrs # {'created': 'now', 'type': 'test'}

new_df = df.join(other, lsuffix='_caller', rsuffix='_other')

new_df.attrs = save_attrs

However this is, kinda ugly, and you could write your own join function as well:

def join_with_attrs(left, right, **kwargs):
    attrs = left.attrs
    new_df = left.join(right, **kwargs)
    new_df.attrs = attrs

However, this is also kinda ugly and the documentation for the join method doesn't automatically propagate to this method. So you have to re-document possible parameters in **kwargs, when often times editors like VSCode will pop-up sklearn docs while you're typing the join method.

@ajcost ajcost added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 5, 2022
@phofl
Copy link
Member

phofl commented Jun 6, 2022

Hi m,

thanks for your report. This is a duplicate of #28283

@phofl phofl closed this as completed Jun 6, 2022
@phofl phofl added Duplicate Report Duplicate issue or pull request metadata _metadata, .attrs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement metadata _metadata, .attrs
Projects
None yet
Development

No branches or pull requests

2 participants