Skip to content

PERF: enhance MultiIndex.remove_unused_levels when no levels are unused #20703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Apr 15, 2018

This uses np.bincount rather than algos.unique to check if there are unused levels: since levels indexes are adjacent integers, the former is faster (~50% in my tests). It then fallbacks to algos.unique if there are unused levels.

The result is that asv tests on --bench reshape give

       before           after         ratio
     [d5d5a718]       [499ae6bf]
-     1.43±0.01ms      1.19±0.01ms     0.83  reshape.SparseIndex.time_unstack

There is a penalty (of the same order of magnitude) for cases in which there are unused levels, but I think this is worth because:

  • there is room for cythonizing and bringing this penalty close to zero in the average case (see comment in code)
  • I think the case of no unused level is much more frequent
  • if people want to avoid this penalty they can just clean their index from unused levels once for all
  • in any case, asv tests on --bench multiindex_object (where the index in Duplicates always has unused levels) do not find any change in performance

@codecov
Copy link

codecov bot commented Apr 15, 2018

Codecov Report

Merging #20703 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20703      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49277    49279       +2     
==========================================
  Hits        45257    45257              
- Misses       4020     4022       +2
Flag Coverage Δ
#multiple 90.22% <100%> (-0.01%) ⬇️
#single 41.9% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.06% <100%> (ø) ⬆️
pandas/util/testing.py 84.38% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5d5a71...c61dd9a. Read the comment docs.

@jreback jreback added Performance Memory or execution speed performance MultiIndex labels Apr 15, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 15, 2018
@jreback jreback merged commit e1e1e54 into pandas-dev:master Apr 15, 2018
@jreback
Copy link
Contributor

jreback commented Apr 15, 2018

thanks @toobaz I closed the issue, but I suppose can revisit a re-factorting of unstack and/or cythonize removing unused levels at some point.

@toobaz toobaz deleted the remove_unused_level_with_bincount_19289 branch April 15, 2018 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unstack performance regression
2 participants