-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Improve performance of CategoricalIndex.is_unique #21107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Improve performance of CategoricalIndex.is_unique #21107
Conversation
7cc1a45
to
6071248
Compare
Codecov Report
@@ Coverage Diff @@
## master #21107 +/- ##
=======================================
Coverage 91.83% 91.83%
=======================================
Files 153 153
Lines 49498 49498
=======================================
Hits 45458 45458
Misses 4040 4040
Continue to review full report at Codecov.
|
@@ -581,6 +581,16 @@ def test_is_monotonic(self, data, non_lexsorted_data): | |||
assert c.is_monotonic_increasing | |||
assert not c.is_monotonic_decreasing | |||
|
|||
@pytest.mark.parametrize('unique, not_unique', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be slightly cleaner to do the parametrize like:
@pytest.mark.parametrize('values, expected', [([1, 2, 3], True), ...])
Then just have one assert
in the test:
assert ci.is_unique is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is cleaner. I've changed it.
6071248
to
0b7309e
Compare
(list('aba'), False)]) | ||
def test_is_unique(self, values, expected): | ||
ci = CategoricalIndex(values) | ||
assert ci.is_unique == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use is
instead of ==
? A little bit of paranoia on my end, but it'd guard against the unlikely event that is_unique
mistakenly starts returning equivalent non-booleans (e.g. 0/1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've changed that.
0b7309e
to
06aef54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an asv for this? if not can you add (ideally for all index types)
So this PR improves first-time calls to |
how does this diminish the value of an asv? sure if the subsequent calls are cached the mean time is lower but this should still show an asv improvement yes? |
My impression has been that these neasurement are done in a “best of 3” style (like %timeit). For such tests the uncached version always loses. Haven’t looked into how asv does it, so I could be mistaken here. |
It seems to me that ASV does "best of X" style of testing. from the docs:
I could add the timing tests, but I assume these improvement is not seen well in repeated calls to |
we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)
|
Those two are in a setup method, so timing are not done on them: class DuplicatedUniqueIndex(object):
goal_time = 0.2
def setup(self):
N = 10**5
self.idx_int_dup = pd.Int64Index(np.arange(N * 5))
# cache is_unique
self.idx_int_dup.is_unique
def time_duplicated_unique_int(self): So currently there is no tests for |
hmm ok. can you open a new issue to create asv's in some form or another for this (and is_unique in general). w/o asv's someone will ineveitable change something which breaks this. maybe should just temporary turn cached_properties off with a context manager when running them. |
Is there currently a context manager or option for this in Pandas? |
no i think would need to create one for purposes of measuring cached properties |
(cherry picked from commit 9f95f7d)
(cherry picked from commit 9f95f7d)
CategoricalIndex.is_unique
creates an extraneous boolean array. By changingCategoricalIndex.is_unique
to useCategoricalIndex._engine.is_unique
instead, this array creation is avoided. We simultaneously get to setis_monotonic*
for free, and therefore will save time, if that property is called later.Demonstration
Setup:
Currently,
ci.is_unique
is about the same (disregarding@readonly_cache
) as:Notice that the
duplicated_int64()
creates an boolean array, which is not needed and slows the operation down.If we instead use
ci._engine.is_unique
to check for uniqueness, the check is roughly similar to:This is faster than the other version, as the intermediate boolean array is not created in this version. Also, is it (IMO) more idiomatic, as
index._engine
is in general supposed to be used for this kind of index content checks.