Skip to content

BUG: Fix is_unique regression for slices of Indexes #57958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 6, 2024

Conversation

rob-sil
Copy link
Contributor

@rob-sil rob-sil commented Mar 22, 2024

Slicing a unique Index will always give another unique Index, so inheriting uniqueness flags is safe and efficient. However, the slice of a non-unique Index can end up with unique elements. Inheriting in the non-unique case caused the regression so this PR changes the code to just inherit when the base Index is marked as unique.

@rob-sil rob-sil marked this pull request as ready for review March 22, 2024 13:30
@rob-sil rob-sil requested a review from WillAyd as a code owner March 22, 2024 13:30
@rob-sil
Copy link
Contributor Author

rob-sil commented Mar 26, 2024

A note or question for reviewers: With this extra code, is pre-computing uniqueness and monotonicity for indexes still a performance boost?

I'm still seeing gains for calling is_monotonic_increasing on a sliced Index, almost as large as in #51738. However, pre-computing slows down slicing when neither is_unique nor is_monotonic_increasing is called on the resulting Index. With no benefit, pre-computing is a performance decrease. I'd expect that slicing data frames and series is more common than calling is_monotonic_increasing on a sliced index, but is it significantly more common? For a full offset, I think it would have to be more than 500x as frequent.

This PR:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=list(range(1_000_000)))

In [3]: df.index.is_monotonic_increasing
Out[3]: True

In [4]: %timeit df.index[:].is_monotonic_increasing
3.79 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit df[:]
8.23 µs ± 63.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Without pre-computing:

In [5]: %timeit df.index[:].is_monotonic_increasing
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit df[:]
4.75 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@Aloqeely
Copy link
Member

But it only slows it down by a large margin if is_unique or is_monotonic_increasing were called on the original index, right?
Thoughts @mroeschke?

@rob-sil
Copy link
Contributor Author

rob-sil commented Mar 26, 2024

The pre-computing code should run anytime Index has generated its _engine, which can come from a couple other methods in addition to is_unique and is_monotonic_increasing. I think DataFrame.loc indirectly calls one of these indexer methods.

EDIT: as a quick check, I raised in _update_from_sliced (where the pre-computing happens) and ran part of the test suite. Looks like it can run in other places too, like DataFrame.join.

@rob-sil
Copy link
Contributor Author

rob-sil commented Mar 29, 2024

I think DataFrame.join uses is_monotonic_increasing and is_unique, although I'm not sure if they're in a place to benefit from the pre-computation. On net, the code's impact on efficiency looks pretty complicated and might take awhile to fully count.

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

assert filtered_index.is_unique

def test_slice_is_montonic(self):
"""Test that is_monotonic resets on slices."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly "resets on slices" after your last commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a docstring and a comment referencing the GitHub issue too much?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function name explains it all, but it's up to you

@Aloqeely
Copy link
Member

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

Sure, sounds good to me.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 30, 2024
@Aloqeely
Copy link
Member

Aloqeely commented May 7, 2024

Sorry this is not stale, @WillAyd mind having a look? Initially looks good to me.
(I just pinged because you got assigned to this, let me know if I should ping someone else)

@rob-sil
Copy link
Contributor Author

rob-sil commented May 27, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Updated for 3.0.0 now.

@rob-sil
Copy link
Contributor Author

rob-sil commented Jun 4, 2024

Is there anything else I should do to make this PR not stale any more?

@Aloqeely Aloqeely removed the Stale label Jul 20, 2024
@Aloqeely
Copy link
Member

Is there anything else I should do to make this PR not stale any more?

I removed the label. @mroeschke mind taking a look please?

Co-authored-by: Abdulaziz Aloqeely <[email protected]>
index = Index([1, 2, 3, 3])
assert not index.is_monotonic_decreasing

filtered_index = index[2:].copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test a null slice i.e. index[:]

Co-authored-by: Matthew Roeschke <[email protected]>
@mroeschke mroeschke added the Index Related to the Index class or subclasses label Aug 5, 2024
@mroeschke mroeschke added this to the 3.0 milestone Aug 6, 2024
@mroeschke mroeschke merged commit dd6843d into pandas-dev:main Aug 6, 2024
43 of 49 checks passed
@mroeschke
Copy link
Member

Thanks @rob-sil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: regression, is_unique is incorrect since pandas 2.1.0
4 participants