-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: remove_unused_levels is very slow #16556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
look at my comment here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/multi.py#L1314 you are welcome to provide another implementation if you'd like. |
Cool, I may take a swing at that. Would a replacement implementation have to produce level lists with the same order as the current implementation, or does that not matter as long as |
I think you DO need to produce the same orderings, otherwise things won't be sorted. But not 100% sure this is a strict guarantee. We have some tests which exercise this (prob need a few more). The before and after have to be equal. So technically your example doesn't meet this test. However it may be possible to recompute the missing levels, then simply reorder them. (again the internals), not the |
Hm. Then I think there's not just a performance issue, but an actual bug. Should I open a second issue or will this one serve for both? |
no these won't compare equal |
I think you're misreading me? I'm comparing the input and output of |
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Add inplace functionality * Always return (if not inplace) at least a view instead of the original index
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Add inplace functionality * Always return (if not inplace) at least a view instead of the original index
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index
Code Sample
Problem description
On my laptop,
x
takes 20 to 40 times as long asy
, despitey
doing the extra work of sorting the second level and reindexing the series in the process. The outputs, except for the sorting of the second level, are identical. Why isremove_unused_levels
so slow?Expected Output
remove_unused_levels
should be at least as fast on large indexes as thereset_index
/set_index
hack.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: