Skip to content

PERF: pd.merge(on=index) is faster than pd.merge(left_index=True, right_index=True) if index is duplicated #56564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
starhel opened this issue Dec 19, 2023 · 3 comments
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@starhel
Copy link

starhel commented Dec 19, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd  # 2.1.4 or 2.2.0.dev0+925.gac170fdc35 (current master)

df1 = pd.DataFrame({
    "key": [x // 100 for x in range(1_000_000)],
    "val1": 10
}).set_index("key")

df2 = pd.DataFrame({
    "key": [x // 100 for x in range(200_000, 800_000)],
    "val2": 20
}).set_index("key")

%timeit -r7 -n3 pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
# 2.1.4: 1.57 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.1 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

%timeit -r7 -n3 pd.merge(df1, df2, on=["key"], how="inner")
# 2.1.4: 1.25 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.06 s ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

Installed Versions

Exceptions:
ModuleNotFoundError: No module named 'PySide6'
QtBindingsNotFoundError: No Qt bindings could be found

Prior Performance

Both merges give the same output (verified using pd.testing.assert_frame_equal).

pd.merge(..., on=index)

  • 2.1.4: 20% faster, more stable (14.3 ms vs 247 ms std. dev.)
  • master: average time is almost same, but there's massive regression: it's almost 65% slower than 2.1.4.

This behaviour is observed only when index has duplicated values in both dataframes. The expected behavior is at least the same execution time regardless of the provided arguments.

Master regression may be caused by #56523 (@lukemanley @phofl FYI).

@starhel starhel added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 19, 2023
@phofl
Copy link
Member

phofl commented Dec 21, 2023

cc @lukemanley any idea where this is coming from?

@lukemanley
Copy link
Member

I took a quick look to see if #56523 is the cause and it doesn't seem to be.

@starhel - would you mind posting the timings for your example with sort=True passed? The reason I ask is that
prior to 2.2 many:many joins were always sorting which was a bug fixed in #54611. I realize the keys in your example are already sorted, but it still goes down a different path with sort=True. Given the current implementation, I believe that passing sort=True will improve performance in the case of many:many joins and might explain at least some of the 2.1.4 -> main difference you are seeing.

@starhel
Copy link
Author

starhel commented Dec 22, 2023

It's quite counter-intuitive that returning sorted data is faster than preserving the input order, but that's what the test results show.

%timeit -r7 -n3 pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
# 1.93 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
%timeit -r7 -n3 pd.merge(df1, df2, on=["key"], how="inner")
# 2.03 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
%timeit -r7 -n3 pd.merge(df1, df2, left_index=True, right_index=True, how="inner", sort=True)
# 1.2 s ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
%timeit -r7 -n3 pd.merge(df1, df2, on=["key"], how="inner", sort=True)
# 1.31 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

From the test, it also appears that perf issue is fixed on the main branch; merging by index is faster than merging by column. In my opinion, it's worth adding to the documentation that sort=True may be faster (according to my test, only for both duplicated indexes) than sort=False.

@lukemanley lukemanley added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants