-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: join/merge on subset of MultiIndex #48611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/reshape/merge.py
Outdated
if dropped_level_name in left.names: | ||
if lindexer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you pulling this inside the loop? This is harder to read than before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We can actually avoid calling take_nd when the indexers are None as that means "take everything". I made another commit which further improves times and should be clearer.
before after ratio
[a712c501] [ad3f42b5]
<main> <multiindex-join-subset>
- 43.8±1ms 10.5±1ms 0.24 join_merge.JoinMultiindexSubset.time_join_multiindex_subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merge when ready @phofl
thx @lukemanley |
…8662) * BUG: Series.getitem not falling back to positional for bool index * Update pandas/tests/series/indexing/test_getitem.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix build warning for use of `strdup` in ultrajson (#48369) * WEB: Update versions json to fix version switcher in the docs (#48655) * PERF: join/merge on subset of MultiIndex (#48611) * DOC: Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter (#48631) * Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter * Add test case for date_range construction using datetime.timedelta * TYP: tighten Axis (#48612) * TYP: tighten Axis * allow 'rows' * BUG: Fix metadata propagation in df.corr and df.cov, GH28283 (#48616) * Add finalize to df.corr and df.cov * Clean * TST: add test case for PeriodIndex in HDFStore(GH7796) (#48618) * TST: add test case for PeriodIndex in HDFStore * TST: add test case for PeriodIndex in HDFStore * use pytest.mark.parameterize instead * Add OpenSSF Scorecards GitHub Action (#48570) * Create scorecards.yml * Update scorecards.yml * Add OpenSSF Scorecards badge to README.md * Trim whitespace in scorecards.yml * Skip scorecards.yml on forks * Fix whitespace * Pin scorecards.yml dependencies to major versions * ENH: move an exception and add a prehook to check for exception place… (#48088) * ENH: move an exception and add a prehook to check for exception placement * ENH: fix import * ENH: revert moving error * ENH: add docstring and fix import for test * ENH: re-design approach based on feedback * ENH: update whatsnew rst * ENH: apply feedback changes * ENH: refactor to remove exception_warning_list and ignore _version.py * ENH: remove NotThisMethod from tests and all * REGR: TextIOWrapper raising an error in read_csv (#48651) * REGR: TextIOWrapper raising an error in read_csv * pyupgrade * do not try to seek on unseekable buffers * unseekable buffer might also have read ahead * safer alternative: do not mess with internal/private(?) buffer of TextIOWrapper (effectively applies the shortcut only to files pandas opens) * Fix scorecard.yml workflow (#48668) * Set scorecard-action to v2.0.3 scorecard-action does not have a major version tag. Temporarily disabling github.repository check to ensure action now works. * Enable github.repository check * BUG: DatetimeIndex ignoring explicit tz=None (#48659) * BUG: DatetimeIndex ignoring explicit tz=None * GH ref * Corrected pd.merge indicator type hint (#48677) * Corrected pd.merge indicator type hint https://pandas.pydata.org/docs/reference/api/pandas.merge.html It should be "str | bool" instead of just string * Update merge.py fixed type hint in merge.py * Update merge.py Update indicator type hint in _MergeOperation * Update merge.py Added type hint _MergeOperation init * DOC: Document default value for options.display.max_cols when not running in terminal (#48672) DOC: Document default value for options.display.max_cols display.max_cols has a default value of 20 when not running in a terminal such as Jupyter Notebook * ENH: DTA/TDA add datetimelike scalar with mismatched reso (#48669) * ENH: DTA/TDA add datetimelike scalar with mismatched reso * mypy fixup * REF: support reso in remaining tslibs helpers (#48661) * REF: support reso in remaining tslibs helpers * update setup.py * PERF: Avoid fragmentation of DataFrame in read_sas (#48603) * PERF: Avoid fragmentation of DataFrame in read_sas * Add whatsnew * Add warning * DOC: Add deprecation infos to deprecated functions (#48599) * DOC: Add deprecation infos to deprecated functions * Add sections * Fix * BLD: Build wheels using cibuildwheel (#48283) * BLD: Build wheels using cibuildwheel * update from code review Co-Authored-By: Matthew Roeschke <[email protected]> * fix 3.11 version * changes from code review * Update test_wheels.py * sync run time with pandas-wheels Co-authored-by: Matthew Roeschke <[email protected]> * REGR: Performance decrease in factorize (#48620) * TYP: type all arguments with str default values (#48508) * TYP: type all arguments with str default values * na_rep: back to str * na(t)_rep is always a string * add float for some functions * and the same for the few float default arguments * define a few more literal constants * avoid itertools.cycle mypy error * revert mistake * TST: Catch more pyarrow PerformanceWarnings (#48699) * REGR: to_hdf raising AssertionError with boolean index (#48696) * REGR: to_hdf raising AssertionError with boolean index * Add gh ref * REGR: Regression in DataFrame.loc when setting df with all True indexer (#48711) * BUG: pivot_table raising for nullable dtype and margins (#48714) * TST: Address MPL 3.6 deprecation warnings (#48695) * TST: Address MPL 3.6 deprecation warnings * Address min build * missing () Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: Ralf Gommers <[email protected]> Co-authored-by: Marc Garcia <[email protected]> Co-authored-by: Luke Manley <[email protected]> Co-authored-by: Siddhartha Gandhi <[email protected]> Co-authored-by: Torsten Wörtwein <[email protected]> Co-authored-by: Xiao Yuan <[email protected]> Co-authored-by: paradox-lab <[email protected]> Co-authored-by: Pedro Nacht <[email protected]> Co-authored-by: dataxerik <[email protected]> Co-authored-by: jbrockmendel <[email protected]> Co-authored-by: Pablo <[email protected]> Co-authored-by: tmoschou <[email protected]> Co-authored-by: Thomas Li <[email protected]> Co-authored-by: Richard Shadrach <[email protected]>
…ndas-dev#48662) * BUG: Series.getitem not falling back to positional for bool index * Update pandas/tests/series/indexing/test_getitem.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix build warning for use of `strdup` in ultrajson (pandas-dev#48369) * WEB: Update versions json to fix version switcher in the docs (pandas-dev#48655) * PERF: join/merge on subset of MultiIndex (pandas-dev#48611) * DOC: Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter (pandas-dev#48631) * Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter * Add test case for date_range construction using datetime.timedelta * TYP: tighten Axis (pandas-dev#48612) * TYP: tighten Axis * allow 'rows' * BUG: Fix metadata propagation in df.corr and df.cov, GH28283 (pandas-dev#48616) * Add finalize to df.corr and df.cov * Clean * TST: add test case for PeriodIndex in HDFStore(GH7796) (pandas-dev#48618) * TST: add test case for PeriodIndex in HDFStore * TST: add test case for PeriodIndex in HDFStore * use pytest.mark.parameterize instead * Add OpenSSF Scorecards GitHub Action (pandas-dev#48570) * Create scorecards.yml * Update scorecards.yml * Add OpenSSF Scorecards badge to README.md * Trim whitespace in scorecards.yml * Skip scorecards.yml on forks * Fix whitespace * Pin scorecards.yml dependencies to major versions * ENH: move an exception and add a prehook to check for exception place… (pandas-dev#48088) * ENH: move an exception and add a prehook to check for exception placement * ENH: fix import * ENH: revert moving error * ENH: add docstring and fix import for test * ENH: re-design approach based on feedback * ENH: update whatsnew rst * ENH: apply feedback changes * ENH: refactor to remove exception_warning_list and ignore _version.py * ENH: remove NotThisMethod from tests and all * REGR: TextIOWrapper raising an error in read_csv (pandas-dev#48651) * REGR: TextIOWrapper raising an error in read_csv * pyupgrade * do not try to seek on unseekable buffers * unseekable buffer might also have read ahead * safer alternative: do not mess with internal/private(?) buffer of TextIOWrapper (effectively applies the shortcut only to files pandas opens) * Fix scorecard.yml workflow (pandas-dev#48668) * Set scorecard-action to v2.0.3 scorecard-action does not have a major version tag. Temporarily disabling github.repository check to ensure action now works. * Enable github.repository check * BUG: DatetimeIndex ignoring explicit tz=None (pandas-dev#48659) * BUG: DatetimeIndex ignoring explicit tz=None * GH ref * Corrected pd.merge indicator type hint (pandas-dev#48677) * Corrected pd.merge indicator type hint https://pandas.pydata.org/docs/reference/api/pandas.merge.html It should be "str | bool" instead of just string * Update merge.py fixed type hint in merge.py * Update merge.py Update indicator type hint in _MergeOperation * Update merge.py Added type hint _MergeOperation init * DOC: Document default value for options.display.max_cols when not running in terminal (pandas-dev#48672) DOC: Document default value for options.display.max_cols display.max_cols has a default value of 20 when not running in a terminal such as Jupyter Notebook * ENH: DTA/TDA add datetimelike scalar with mismatched reso (pandas-dev#48669) * ENH: DTA/TDA add datetimelike scalar with mismatched reso * mypy fixup * REF: support reso in remaining tslibs helpers (pandas-dev#48661) * REF: support reso in remaining tslibs helpers * update setup.py * PERF: Avoid fragmentation of DataFrame in read_sas (pandas-dev#48603) * PERF: Avoid fragmentation of DataFrame in read_sas * Add whatsnew * Add warning * DOC: Add deprecation infos to deprecated functions (pandas-dev#48599) * DOC: Add deprecation infos to deprecated functions * Add sections * Fix * BLD: Build wheels using cibuildwheel (pandas-dev#48283) * BLD: Build wheels using cibuildwheel * update from code review Co-Authored-By: Matthew Roeschke <[email protected]> * fix 3.11 version * changes from code review * Update test_wheels.py * sync run time with pandas-wheels Co-authored-by: Matthew Roeschke <[email protected]> * REGR: Performance decrease in factorize (pandas-dev#48620) * TYP: type all arguments with str default values (pandas-dev#48508) * TYP: type all arguments with str default values * na_rep: back to str * na(t)_rep is always a string * add float for some functions * and the same for the few float default arguments * define a few more literal constants * avoid itertools.cycle mypy error * revert mistake * TST: Catch more pyarrow PerformanceWarnings (pandas-dev#48699) * REGR: to_hdf raising AssertionError with boolean index (pandas-dev#48696) * REGR: to_hdf raising AssertionError with boolean index * Add gh ref * REGR: Regression in DataFrame.loc when setting df with all True indexer (pandas-dev#48711) * BUG: pivot_table raising for nullable dtype and margins (pandas-dev#48714) * TST: Address MPL 3.6 deprecation warnings (pandas-dev#48695) * TST: Address MPL 3.6 deprecation warnings * Address min build * missing () Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: Ralf Gommers <[email protected]> Co-authored-by: Marc Garcia <[email protected]> Co-authored-by: Luke Manley <[email protected]> Co-authored-by: Siddhartha Gandhi <[email protected]> Co-authored-by: Torsten Wörtwein <[email protected]> Co-authored-by: Xiao Yuan <[email protected]> Co-authored-by: paradox-lab <[email protected]> Co-authored-by: Pedro Nacht <[email protected]> Co-authored-by: dataxerik <[email protected]> Co-authored-by: jbrockmendel <[email protected]> Co-authored-by: Pablo <[email protected]> Co-authored-by: tmoschou <[email protected]> Co-authored-by: Thomas Li <[email protected]> Co-authored-by: Richard Shadrach <[email protected]>
doc/source/whatsnew/v1.6.0.rst
file if fixing a bug or adding a new feature.Existing code passes a range to algos.take_nd. Passing an ndarray is faster.