Skip to content

method df.merge is resetting the index #51796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aviadr1 opened this issue Mar 5, 2023 · 3 comments
Open

method df.merge is resetting the index #51796

aviadr1 opened this issue Mar 5, 2023 · 3 comments

Comments

@aviadr1
Copy link

aviadr1 commented Mar 5, 2023

method df.merge will throw away / reset the index of df. I dont think this is desired result.

here's an example:

df = pd.DataFrame({'key' : [1,2,3]}, index=['very','important','index'])
df2 = pd.DataFrame({'key' : [3,2,1], 'data':['c','b','a']}, index=[100,200,300])
df.merge(df2, on='key')

the result will lose the very important index in df and instead will have a generic 0 1 2 index.

   key data
0    1    a
1    2    b
2    3    c

here is a method based on merge that will preseve the original index (which should be the intended result IMHO)

def merge_with_perfect_index(df, df2, on, how):
    original_index_name = df.index.name
    TEMP_INDEX = 'temp_index_aviad'
    return df.rename_axis(TEMP_INDEX).reset_index().merge(df2, on=on, how=how).set_index(TEMP_INDEX).rename_axis(original_index_name)

merge_with_perfect_index(df, df2, on='key', how='left')

which outputs

                    key  data
very           1      a
important      2      b
index          3      c

tested on 1.5.3

@samukweku
Copy link
Contributor

have a look at the docs regarding merge and caveats on index preservation

@aviadr1
Copy link
Author

aviadr1 commented Mar 5, 2023

I see, thanks for the quick reply!
I usually read the reference API docs and havent read through the user guide. this appears rather as a footnote in the user guide, and I found this behavior to be unexpected.

note, that in the example, the join is in practice a 1-to-1 join as the keys are unique.

IMHO having this documented in the merge reference to having a parameter to control this behavior would be great.
for instance a reindexing parameter could be handy. it could have the values:

  • left keep the index from the left dataframe. if this is impossible due to many-on-many then raise error
  • auto - current behavior: keep index if merging on unique indexes, or reset otherwise
  • reset - always reset

having the parameter available in merge would be an opportunity to reference this gotcha

in my ML project, this issue caused training data to leak into the test data because I was relying on indexes to keep them apart.

@quantoid
Copy link

The only way to preserve the left index seems to be to reset_index before the merge and then set_index after, which surely defeats the purpose of the index which is to make joins faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants