Skip to content

REF: avoid internals in merge code #48082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 17, 2022

Conversation

jbrockmendel
Copy link
Member

After this, concatenate_block_managers is only used in one place.

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Aug 16, 2022
@mroeschke mroeschke added this to the 1.5 milestone Aug 17, 2022
@mroeschke mroeschke merged commit 5d115bb into pandas-dev:main Aug 17, 2022
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the ref-concat-internals-ax0 branch August 17, 2022 00:59
@jorisvandenbossche
Copy link
Member

I haven't yet investigated why, but this has broken a lot of tests in geopandas' CI against pandas main (so if we want to do an RC tomorrow, the easiest might be to revert this for now, since it's only a refactor?)

Will try to make a minimal reproducer

@jorisvandenbossche
Copy link
Member

I can't reproduce this without geopandas, but the issue boils down to that merge now calls pd.concat (and thus _constructor / __finalize__) before adding suffices for duplicated column names.
So maybe this might be sufficient:

--- a/pandas/core/reshape/merge.py
+++ b/pandas/core/reshape/merge.py
@@ -763,9 +763,12 @@ class _MergeOperation:
         right.index = join_index
 
         from pandas import concat

+        left.columns = llabels
+        right.columns = rlabel
         result = concat([left, right], axis=1, copy=copy)
-        result.columns = llabels.append(rlabels)
         return result

One reproducer with geopandas is:

import geopandas
df1 = geopandas.GeoDataFrame({"key": [1, 2, 3]}, geometry=geopandas.points_from_xy([1, 2, 3], [1, 2, 3]))
df2 = geopandas.GeoDataFrame({"key": [2, 3, 4]}, geometry=geopandas.points_from_xy([1, 2, 3], [1, 2, 3]))

df1.merge(df2, on="key")

@jbrockmendel
Copy link
Member Author

I have no objections to the suggested patch. Can you open an issue in geopandas to figure out how they are relying on our internals and ween off of it (xref #40226)

@jorisvandenbossche
Copy link
Member

Can you open an issue in geopandas to figure out how they are relying on our internals and ween off of it (xref #40226)

To be clear, geopandas is not directly relying on the internals (we are not calling an internal method), but we are indirectly relying on the logic here, because we disallow dataframes with duplicate geometry columns (and thus the fact whether the suffixes are applied before or after the concat step matters).

Opened #48416 with the patch proposed above.

@phofl
Copy link
Member

phofl commented Sep 17, 2022

It looks like this caused a performance regression in our merge benchmarks, https://asv-runner.github.io/asv-collection/pandas/#join_merge.JoinEmpty.time_inner_join_right_empty

This is the most relevant commit in the timespan when the regression occurred

@jbrockmendel
Copy link
Member Author

Bummer. Not ideal, but won't be the end of the world if we have to revert.

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* REF: avoid internals in merge code

* fix condition

* use reindex_indexer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants