BUG: Fix `is_unique` regression for slices of `Index`es #57958

rob-sil · 2024-03-22T04:17:50Z

closes BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

Slicing a unique Index will always give another unique Index, so inheriting uniqueness flags is safe and efficient. However, the slice of a non-unique Index can end up with unique elements. Inheriting in the non-unique case caused the regression so this PR changes the code to just inherit when the base Index is marked as unique.

pandas/_libs/index.pyx

doc/source/whatsnew/v2.2.2.rst

Co-authored-by: Abdulaziz Aloqeely <[email protected]>

rob-sil · 2024-03-26T04:40:00Z

A note or question for reviewers: With this extra code, is pre-computing uniqueness and monotonicity for indexes still a performance boost?

I'm still seeing gains for calling is_monotonic_increasing on a sliced Index, almost as large as in #51738. However, pre-computing slows down slicing when neither is_unique nor is_monotonic_increasing is called on the resulting Index. With no benefit, pre-computing is a performance decrease. I'd expect that slicing data frames and series is more common than calling is_monotonic_increasing on a sliced index, but is it significantly more common? For a full offset, I think it would have to be more than 500x as frequent.

This PR:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=list(range(1_000_000)))

In [3]: df.index.is_monotonic_increasing
Out[3]: True

In [4]: %timeit df.index[:].is_monotonic_increasing
3.79 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit df[:]
8.23 µs ± 63.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Without pre-computing:

In [5]: %timeit df.index[:].is_monotonic_increasing
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit df[:]
4.75 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Aloqeely · 2024-03-26T10:24:18Z

But it only slows it down by a large margin if is_unique or is_monotonic_increasing were called on the original index, right?
Thoughts @mroeschke?

rob-sil · 2024-03-26T12:58:20Z

The pre-computing code should run anytime Index has generated its _engine, which can come from a couple other methods in addition to is_unique and is_monotonic_increasing. I think DataFrame.loc indirectly calls one of these indexer methods.

EDIT: as a quick check, I raised in _update_from_sliced (where the pre-computing happens) and ran part of the test suite. Looks like it can run in other places too, like DataFrame.join.

rob-sil · 2024-03-29T04:08:54Z

I think DataFrame.join uses is_monotonic_increasing and is_unique, although I'm not sure if they're in a place to benefit from the pre-computation. On net, the code's impact on efficiency looks pretty complicated and might take awhile to fully count.

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

Aloqeely · 2024-03-30T19:57:02Z

pandas/tests/indexes/test_base.py

+        assert filtered_index.is_unique
+
+    def test_slice_is_montonic(self):
+        """Test that is_monotonic resets on slices."""


Not exactly "resets on slices" after your last commit

Is a docstring and a comment referencing the GitHub issue too much?

I think the function name explains it all, but it's up to you

Aloqeely · 2024-03-30T19:59:36Z

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

Sure, sounds good to me.

github-actions · 2024-04-30T00:05:33Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Aloqeely · 2024-05-07T03:27:31Z

Sorry this is not stale, @WillAyd mind having a look? Initially looks good to me.
(I just pinged because you got assigned to this, let me know if I should ping someone else)

rob-sil · 2024-05-27T21:52:02Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Updated for 3.0.0 now.

rob-sil · 2024-06-04T01:23:44Z

Is there anything else I should do to make this PR not stale any more?

Aloqeely · 2024-07-20T20:12:17Z

Is there anything else I should do to make this PR not stale any more?

I removed the label. @mroeschke mind taking a look please?

Co-authored-by: Abdulaziz Aloqeely <[email protected]>

pandas/_libs/index.pyx

mroeschke · 2024-07-22T20:41:04Z

pandas/tests/indexes/test_base.py

+        index = Index([1, 2, 3, 3])
+        assert not index.is_monotonic_decreasing
+
+        filtered_index = index[2:].copy()


Could you also test a null slice i.e. index[:]

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke · 2024-08-06T01:06:36Z

Thanks @rob-sil

rob-sil added 2 commits March 21, 2024 21:07

Fix is_unique for slices of Indexes

e14f3e5

Merge branch 'main' into index-is_unique-slice

78bca47

rob-sil marked this pull request as ready for review March 22, 2024 13:30

rob-sil requested a review from WillAyd as a code owner March 22, 2024 13:30

Aloqeely reviewed Mar 22, 2024

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

doc/source/whatsnew/v2.2.2.rst Outdated Show resolved Hide resolved

rob-sil and others added 3 commits March 24, 2024 15:50

Update doc/source/whatsnew/v2.2.2.rst

360aa3b

Co-authored-by: Abdulaziz Aloqeely <[email protected]>

Handle monotonic on slices

4ab184b

Restore and fix monotonic code

3fb2b4f

Merge branch 'main' into index-is_unique-slice

4916d5c

Aloqeely reviewed Mar 30, 2024

View reviewed changes

Update docstring and comment for test

512ba5c

Aloqeely mentioned this pull request Apr 29, 2024

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

Closed

3 tasks

github-actions bot added the Stale label Apr 30, 2024

rob-sil added 2 commits May 27, 2024 14:03

Merge branch 'main' into index-is_unique-slice

06e21ab

Update whatsnew for v3.0.0

8aff236

Aloqeely removed the Stale label Jul 20, 2024

Update pandas/_libs/index.pyx

1d68e8e

Co-authored-by: Abdulaziz Aloqeely <[email protected]>

mroeschke reviewed Jul 22, 2024

View reviewed changes

pandas/_libs/index.pyx Outdated Show resolved Hide resolved

mroeschke reviewed Jul 22, 2024

View reviewed changes

Update pandas/_libs/index.pyx

3f5bdca

Co-authored-by: Matthew Roeschke <[email protected]>

rob-sil and others added 2 commits August 4, 2024 14:22

Add test for a null slice

8a2fd84

Merge branch 'main' into index-is_unique-slice

dc2ce58

mroeschke added the Index Related to the Index class or subclasses label Aug 5, 2024

mroeschke approved these changes Aug 6, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Aug 6, 2024

mroeschke merged commit dd6843d into pandas-dev:main Aug 6, 2024
43 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix `is_unique` regression for slices of `Index`es #57958

BUG: Fix `is_unique` regression for slices of `Index`es #57958

rob-sil commented Mar 22, 2024 •

edited

Loading

rob-sil commented Mar 26, 2024

Aloqeely commented Mar 26, 2024

rob-sil commented Mar 26, 2024 •

edited

Loading

rob-sil commented Mar 29, 2024

Aloqeely Mar 30, 2024

rob-sil Mar 30, 2024

Aloqeely Mar 30, 2024

Aloqeely commented Mar 30, 2024

github-actions bot commented Apr 30, 2024

Aloqeely commented May 7, 2024

rob-sil commented May 27, 2024

rob-sil commented Jun 4, 2024

Aloqeely commented Jul 20, 2024

mroeschke Jul 22, 2024

mroeschke commented Aug 6, 2024

BUG: Fix is_unique regression for slices of Indexes #57958

BUG: Fix is_unique regression for slices of Indexes #57958

Conversation

rob-sil commented Mar 22, 2024 • edited Loading

rob-sil commented Mar 26, 2024

Aloqeely commented Mar 26, 2024

rob-sil commented Mar 26, 2024 • edited Loading

rob-sil commented Mar 29, 2024

Aloqeely Mar 30, 2024

Choose a reason for hiding this comment

rob-sil Mar 30, 2024

Choose a reason for hiding this comment

Aloqeely Mar 30, 2024

Choose a reason for hiding this comment

Aloqeely commented Mar 30, 2024

github-actions bot commented Apr 30, 2024

Aloqeely commented May 7, 2024

rob-sil commented May 27, 2024

rob-sil commented Jun 4, 2024

Aloqeely commented Jul 20, 2024

mroeschke Jul 22, 2024

Choose a reason for hiding this comment

mroeschke commented Aug 6, 2024

BUG: Fix `is_unique` regression for slices of `Index`es #57958

BUG: Fix `is_unique` regression for slices of `Index`es #57958

rob-sil commented Mar 22, 2024 •

edited

Loading

rob-sil commented Mar 26, 2024 •

edited

Loading