Skip to content

Commit e1e1e54

Browse files
toobazjreback
authored andcommitted
PERF: enhance MultiIndex.remove_unused_levels when no levels are unused (pandas-dev#20703)
closes pandas-dev#19289
1 parent b16974a commit e1e1e54

File tree

2 files changed

+18
-9
lines changed

2 files changed

+18
-9
lines changed

doc/source/whatsnew/v0.23.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -901,6 +901,7 @@ Performance Improvements
901901
- :func:`Series` / :func:`DataFrame` tab completion limits to 100 values, for better performance. (:issue:`18587`)
902902
- Improved performance of :func:`DataFrame.median` with ``axis=1`` when bottleneck is not installed (:issue:`16468`)
903903
- Improved performance of :func:`MultiIndex.get_loc` for large indexes, at the cost of a reduction in performance for small ones (:issue:`18519`)
904+
- Improved performance of :func:`MultiIndex.remove_unused_levels` when there are no unused levels, at the cost of a reduction in performance when there are (:issue:`19289`)
904905
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`)
905906
- Improved performance of :func:`pandas.core.groupby.GroupBy.rank` (:issue:`15779`)
906907
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`)

pandas/core/indexes/multi.py

+17-9
Original file line numberDiff line numberDiff line change
@@ -1476,27 +1476,35 @@ def remove_unused_levels(self):
14761476
changed = False
14771477
for lev, lab in zip(self.levels, self.labels):
14781478

1479-
uniques = algos.unique(lab)
1480-
na_idx = np.where(uniques == -1)[0]
1481-
1482-
# nothing unused
1483-
if len(uniques) != len(lev) + len(na_idx):
1479+
# Since few levels are typically unused, bincount() is more
1480+
# efficient than unique() - however it only accepts positive values
1481+
# (and drops order):
1482+
uniques = np.where(np.bincount(lab + 1) > 0)[0] - 1
1483+
has_na = int(len(uniques) and (uniques[0] == -1))
1484+
1485+
if len(uniques) != len(lev) + has_na:
1486+
# We have unused levels
14841487
changed = True
14851488

1486-
if len(na_idx):
1489+
# Recalculate uniques, now preserving order.
1490+
# Can easily be cythonized by exploiting the already existing
1491+
# "uniques" and stop parsing "lab" when all items are found:
1492+
uniques = algos.unique(lab)
1493+
if has_na:
1494+
na_idx = np.where(uniques == -1)[0]
14871495
# Just ensure that -1 is in first position:
14881496
uniques[[0, na_idx[0]]] = uniques[[na_idx[0], 0]]
14891497

14901498
# labels get mapped from uniques to 0:len(uniques)
14911499
# -1 (if present) is mapped to last position
1492-
label_mapping = np.zeros(len(lev) + len(na_idx))
1500+
label_mapping = np.zeros(len(lev) + has_na)
14931501
# ... and reassigned value -1:
1494-
label_mapping[uniques] = np.arange(len(uniques)) - len(na_idx)
1502+
label_mapping[uniques] = np.arange(len(uniques)) - has_na
14951503

14961504
lab = label_mapping[lab]
14971505

14981506
# new levels are simple
1499-
lev = lev.take(uniques[len(na_idx):])
1507+
lev = lev.take(uniques[has_na:])
15001508

15011509
new_levels.append(lev)
15021510
new_labels.append(lab)

0 commit comments

Comments
 (0)