Skip to content

PERF: Improve performance of CategoricalIndex.is_unique #21107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2018

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 17, 2018

CategoricalIndex.is_unique creates an extraneous boolean array. By changing CategoricalIndex.is_unique to use CategoricalIndex._engine.is_unique instead, this array creation is avoided. We simultaneously get to set is_monotonic* for free, and therefore will save time, if that property is called later.

Demonstration

Setup:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))

Currently, ci.is_unique is about the same (disregarding@readonly_cache) as:

>>> from pandas._libs.hashtable import duplicated_int64
>>> not duplicated_int64(ci.codes.astype('int64')).any()
False
>>> %timeit duplicated_int64(ci.codes.astype('int64')).any()
46.7 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notice that the duplicated_int64() creates an boolean array, which is not needed and slows the operation down.

If we instead use ci._engine.is_unique to check for uniqueness, the check is roughly similar to:

>>> from pandas._libs.algos import is_monotonic_int64
>>> is_monotonic_int64(ci.codes.astype('int64'), False)
(True, False, False)  # (is_monotonic_inc, is_monotonic_dec, is_unique)
>>> %timeit is_monotonic_int64(ci.codes.astype('int64'), False)
23.3 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is faster than the other version, as the intermediate boolean array is not created in this version. Also, is it (IMO) more idiomatic, as index._engine is in general supposed to be used for this kind of index content checks.

@topper-123 topper-123 force-pushed the categorical_index_is_unique branch 2 times, most recently from 7cc1a45 to 6071248 Compare May 17, 2018 19:21
@codecov
Copy link

codecov bot commented May 17, 2018

Codecov Report

Merging #21107 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21107   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files         153      153           
  Lines       49498    49498           
=======================================
  Hits        45458    45458           
  Misses       4040     4040
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e033c06...06aef54. Read the comment docs.

@@ -581,6 +581,16 @@ def test_is_monotonic(self, data, non_lexsorted_data):
assert c.is_monotonic_increasing
assert not c.is_monotonic_decreasing

@pytest.mark.parametrize('unique, not_unique',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be slightly cleaner to do the parametrize like:

@pytest.mark.parametrize('values, expected', [([1, 2, 3], True), ...])

Then just have one assert in the test:

assert ci.is_unique is expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is cleaner. I've changed it.

@jschendel jschendel added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 17, 2018
@topper-123 topper-123 force-pushed the categorical_index_is_unique branch from 6071248 to 0b7309e Compare May 18, 2018 05:43
(list('aba'), False)])
def test_is_unique(self, values, expected):
ci = CategoricalIndex(values)
assert ci.is_unique == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use is instead of ==? A little bit of paranoia on my end, but it'd guard against the unlikely event that is_unique mistakenly starts returning equivalent non-booleans (e.g. 0/1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've changed that.

@topper-123 topper-123 force-pushed the categorical_index_is_unique branch from 0b7309e to 06aef54 Compare May 18, 2018 18:49
@jreback jreback added this to the 0.23.1 milestone May 19, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an asv for this? if not can you add (ideally for all index types)

@topper-123
Copy link
Contributor Author

topper-123 commented May 19, 2018

is_unique is a cached property, so I don't think an ASV makes sense in this case. (I assume setup in ASV is called once for each time method, but not once for every call to the time method, haven't checked).

So this PR improves first-time calls to is_unique and also makes calling is_unique and is_monotonic* in succesion faster, as both are set simultaneously. So e.g. ci.is_monotonic_increasing and ci.is_unique will be much faster (on the first call, on subsequent calls both are cached).

@jreback
Copy link
Contributor

jreback commented May 19, 2018

how does this diminish the value of an asv?

sure if the subsequent calls are cached the mean time is lower but this should still show an asv improvement yes?

@topper-123
Copy link
Contributor Author

My impression has been that these neasurement are done in a “best of 3” style (like %timeit). For such tests the uncached version always loses.

Haven’t looked into how asv does it, so I could be mistaken here.

@topper-123
Copy link
Contributor Author

It seems to me that ASV does "best of X" style of testing. from the docs:

The timing itself is based on the Python standard library’s timeit module, with some extensions for automatic heuristics shamelessly stolen from IPython’s %timeit magic function

I could add the timing tests, but I assume these improvement is not seen well in repeated calls to is_unique. You want them added still?

@jreback
Copy link
Contributor

jreback commented May 21, 2018

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py:        # cache is_unique
asv_bench/benchmarks//algorithms.py:        self.idx_int_dup.is_unique

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 4, 2018

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py: # cache is_unique
asv_bench/benchmarks//algorithms.py: self.idx_int_dup.is_unique

Those two are in a setup method, so timing are not done on them:

class DuplicatedUniqueIndex(object):

    goal_time = 0.2

    def setup(self):
        N = 10**5
        self.idx_int_dup = pd.Int64Index(np.arange(N * 5))
        # cache is_unique
        self.idx_int_dup.is_unique

     def time_duplicated_unique_int(self):

So currently there is no tests for is_unique in asv_bench, and it probably doesn't make sense IMO, as is_unique is a cached property and timing it will just return the cached value, and not time the speed of the algorithm.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

hmm ok. can you open a new issue to create asv's in some form or another for this (and is_unique in general). w/o asv's someone will ineveitable change something which breaks this.

maybe should just temporary turn cached_properties off with a context manager when running them.

@jreback jreback merged commit 9f95f7d into pandas-dev:master Jun 4, 2018
@topper-123
Copy link
Contributor Author

maybe should just temporary turn cached_properties off with a context manager when running them.

Is there currently a context manager or option for this in Pandas?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

no i think would need to create one for purposes of measuring cached properties

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018
TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
@topper-123 topper-123 deleted the categorical_index_is_unique branch October 27, 2018 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants