Skip to content

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion asv_bench/benchmarks/index_object.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np
import pandas.util.testing as tm
from pandas import (Series, date_range, DatetimeIndex, Index, RangeIndex,
Float64Index)
Float64Index, IntervalIndex)


class SetOperations(object):
Expand Down Expand Up @@ -181,4 +181,16 @@ def time_get_loc(self):
self.ind.get_loc(0)


class IntervalIndexMethod(object):
# GH 24813
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this where we asv on is_unique?

def setup(self):
N = 10**5
left = np.append(np.arange(N), np.array(0))
right = np.append(np.arange(1, N + 1), np.array(1))
self.intv = IntervalIndex.from_arrays(left, right)

def time_is_unique(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another benchmark with N**3, e.g. s small case

self.intv.is_unique


from .pandas_vb_common import setup # noqa: F401
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ Performance Improvements
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)
- Improved performance of :meth:`read_csv` by faster tokenizing and faster parsing of small float numbers (:issue:`25784`)
- Improved performance of :meth:`read_csv` by faster parsing of N/A and boolean values (:issue:`25804`)
- Improved performance of :meth:`IntervalIndex.is_unique` by removing conversion to `MultiIndex` (:issue:`24813`)

.. _whatsnew_0250.bug_fixes:

Expand Down
8 changes: 7 additions & 1 deletion pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,7 +463,13 @@ def is_unique(self):
"""
Return True if the IntervalIndex contains unique elements, else False
"""
return self._multiindex.is_unique
left = self.values.left
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isnt the answer just: self.left.is_unique & self.right.is_unique

Copy link
Member

@jschendel jschendel Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a little too strict; you don't necessarily need left or right uniqueness, only pairwise uniqueness, as you can have duplicate endpoints on one side as long as the other side also isn't the same, e.g. [Interval(0, 1), Interval(0, 2), Interval(0, 3)] should be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think u can simply construct a numpy array and check that then via np.unique by first reshaping

Copy link
Contributor Author

@makbigc makbigc Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried something below

arr = np.array(list(zip(self.values.left, self.values.right)))
np.unique(arr, axis=0)

But np.unique cannot filter out the duplicate np.nan

In [2]: arr = np.array([1, 3, 3, np.nan, np.nan])                               

In [3]: np.unique(arr)                                                          
Out[3]: array([ 1.,  3., nan, nan])

I​'m still looking for other way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pd.unique

right = self.values.right
for i in range(len(self)):
mask = (left[i] == left) & (right[i] == right)
if mask.sum() > 1:
return False
return True

@cache_readonly
@Appender(_interval_shared_docs['is_non_overlapping_monotonic']
Expand Down