Skip to content

BUG/PERF: MultiIndex.value_counts returning flat index #49558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 9, 2022

Conversation

lukemanley
Copy link
Member

I believe it is a bug that MultiIndex.value_counts returns a Series indexed by a flat index of tuples rather than a MultiIndex. By returning a flat index, we lose nullable dtypes, index names, etc.

This PR changes MultiIndex.value_counts to return a Series indexed by a MultiIndex.

This also has some perf improvements:

import pandas as pd
import numpy as np

mi = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])

%timeit mi.value_counts()

974 ms ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
182 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance MultiIndex labels Nov 6, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

Can we send this down a different path directly from MultiIndex.value_counts and what happens when bins is not None?
Edit: Missed the normalise part at the end, so forget my comment about sending it down a different path

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending green.

@lukemanley
Copy link
Member Author

lukemanley commented Nov 6, 2022

Two things:

Can we send this down a different path directly from MultiIndex.value_counts and what happens when bins is not None? Edit: Missed the normalise part at the end, so forget my comment about sending it down a different path

I also thought this path made sense since algos.union_with_duplicates calls algos.value_counts and will benefit from the performance improvements. The value_counts calls are currently bottlenecks when calling union_with_duplicates with MultiIndexes.

bins is documented to only work for numeric data. Passing anything other than None for a MultiIndex raises - this PR does not change that behavior. The error message could be improved as it simply raises when trying to construct a Series from a MultiIndex.

@phofl
Copy link
Member

phofl commented Nov 6, 2022

Thx, could you open an issue about improving the error message?

@mroeschke mroeschke added this to the 2.0 milestone Nov 7, 2022
@@ -556,6 +556,7 @@ MultiIndex
- Bug in :meth:`MultiIndex.union` not sorting when sort=None and index contains missing values (:issue:`49010`)
- Bug in :meth:`MultiIndex.append` not checking names for equality (:issue:`48288`)
- Bug in :meth:`MultiIndex.symmetric_difference` losing extension array (:issue:`48607`)
- Bug in :meth:`MultiIndex.value_counts` returning a :class:`Series` indexed by flat index of tuples (:issue:`49558`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`MultiIndex.value_counts` returning a :class:`Series` indexed by flat index of tuples (:issue:`49558`)
- Bug in :meth:`MultiIndex.value_counts` returning a :class:`Series` indexed by flat index of tuples instead of a :class:`MultiIndex` (:issue:`49558`)

Nit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

@lukemanley
Copy link
Member Author

Thx, could you open an issue about improving the error message?

will do

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @phofl

@phofl phofl merged commit 60209e5 into pandas-dev:main Nov 9, 2022
@phofl
Copy link
Member

phofl commented Nov 9, 2022

thx @lukemanley

@lukemanley lukemanley deleted the mi-value-counts branch November 10, 2022 03:18
codamuse pushed a commit to codamuse/pandas that referenced this pull request Nov 12, 2022
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants