Skip to content

PERF: __contains__ method for Categorical #21022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented May 13, 2018

Calling key in Categorical(...) trivially falls back to calling __iter__ and then forces a full construction of the array by invoking get_values.
This implementation is faster in best case situations than using .e.g.any(Categorical.isin(key)) since we do not care about the complete array. Not sure if I need to handle any edge cases, though.

This obviously can't reach the performance of a simple Index but is a bit faster than before

before:

· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[100.00%] ··· Running categoricals.Slicing.time_loc_categorical                                         2.04ms

after

· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[100.00%] ··· Running categoricals.Slicing.time_loc_categorical                                       852.58μs

cc @topper-123

@topper-123
Copy link
Contributor

topper-123 commented May 13, 2018

Nice.

wrt #20395: Now I get test result of 4.1 ms as compared to 13.8 ms beforehand. So while this is much better, the test case of pd.Index is at 125 µs, so #20395 is not completely solved...

@codecov
Copy link

codecov bot commented May 13, 2018

Codecov Report

Merging #21022 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21022      +/-   ##
==========================================
+ Coverage   91.85%   91.89%   +0.04%     
==========================================
  Files         153      153              
  Lines       49549    49585      +36     
==========================================
+ Hits        45512    45565      +53     
+ Misses       4037     4020      -17
Flag Coverage Δ
#multiple 90.29% <100%> (+0.04%) ⬆️
#single 41.85% <16.66%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.73% <100%> (+0.03%) ⬆️
pandas/core/reshape/melt.py 97.34% <0%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.8% <0%> (+0.02%) ⬆️
pandas/core/series.py 94.17% <0%> (+0.05%) ⬆️
pandas/core/reshape/pivot.py 97.03% <0%> (+0.05%) ⬆️
pandas/io/formats/style.py 96.12% <0%> (+0.08%) ⬆️
pandas/core/dtypes/missing.py 92.52% <0%> (+0.57%) ⬆️
pandas/plotting/_core.py 83.53% <0%> (+1.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c65c57...a3550e8. Read the comment docs.

@topper-123
Copy link
Contributor

topper-123 commented May 13, 2018

Your code was good, though use get_loc rather than get_indexer as this is faster.

A related improvement would be to look at CategoricalIndex.__contains__. There the line

    return key in self.values

can be replaced with

        code_value = self.categories.get_loc(key)
        return code_value in self._engine

to gain some additional speedup when using CategoricalIndex rather than Categorical:

>>> n = 1_000_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n))  # using get_loc
>>> %timeit 'b' in c
4.49 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit 'b' in ci
3,11 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also in CategoricalIndex.contains you can replace return key in self.values with return key in self.

@@ -1846,6 +1846,11 @@ def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values().tolist())

def __contains__(self, key):
"""Returns True if `key` is in this Categorical."""
code_values = self.categories.get_indexer([key])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_loc is much faster than get_indexer as it returns a slice object versus a numpy array.

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 15, 2018

class Slicing(object):

def setup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that we don't have similar in indexing.py

@@ -1080,6 +1080,7 @@ Performance Improvements
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)
- Fixed a performance regression for :func:`GroupBy.nth` and :func:`GroupBy.last` with some object columns (:issue:`19283`)
- Improved performance of :meth:`Categorical.loc` (:issue:`20395`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't exist, rather its indexing on a Series/DataFrame with a CategoricalIndex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.23.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it for 0.24.0, so @fjetter you can move it to v0.24.0.txt


def setup(self):
n = 1 * 10**4
self.df = pd.DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add for a Series as well

@@ -1080,6 +1080,7 @@ Performance Improvements
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)
- Fixed a performance regression for :func:`GroupBy.nth` and :func:`GroupBy.last` with some object columns (:issue:`19283`)
- Improved performance of :meth:`Categorical.loc` (:issue:`20395`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.23.1

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ensure that we have tests directly for

  • basic case (positive and negative)
  • np.nan in categorical
  • key is in the categories but not the codes

@topper-123
Copy link
Contributor

Hey @fjetter, This is a very nice PR. Want to take this over the finishing line? I could make the CategoricalIndex case myself, if you want.

@jorisvandenbossche
Copy link
Member

, though use get_loc rather than get_indexer as this is faster.

General question, wouldn't simply val in self.categories not even be faster and simpler?

@topper-123
Copy link
Contributor

topper-123 commented Jun 4, 2018

@jorisvandenbossche: in general that wouldn't work, as some categories might not be present in the Categorical/CategoricalIndex...

@TomAugspurger
Copy link
Contributor

as some categories might not be present in the

How is that possible? The only one I can think of is NA, so the check could be something like val in self.categories or isna(val) and self.isna().any()

@topper-123
Copy link
Contributor

Maybe I wasn't communicating clearly, sorry about that. What I meant was this:

>>> c = pd.CategoricalIndex('a a'.split(), categories='a b'.split())
>>> 'b' in c
False
>>< 'b' in c.categories
True

So to check for membership in categories can give different results than checking membership in the values. Was this not the question from @jorisvandenbossche, or did I misunderstand?

@TomAugspurger
Copy link
Contributor

Oh, of course.

Though I wonder if checking val in self.categories is still worthwhile to get a quick no / maybe.

@jorisvandenbossche
Copy link
Member

Ah yes, that's a good reason!

@jorisvandenbossche
Copy link
Member

Exactly this reason actually is a bug in CategoricalIndex.__contains__ for interval categories:

In [14]: intervals = pd.interval_range(1, 4)

In [15]: intervals
Out[15]: 
IntervalIndex([(1, 2], (2, 3], (3, 4]]
              closed='right',
              dtype='interval[int64]')

In [16]: cat = pd.CategoricalIndex(list(intervals[:-1]), categories=intervals)

In [17]: cat
Out[17]: CategoricalIndex([(1, 2], (2, 3]], categories=[(1, 2], (2, 3], (3, 4]], ordered=False, dtype='category')

In [18]: intervals[-1] in cat
Out[18]: True

@fjetter fjetter force-pushed the performance/improve_contains_in_categorical branch from 8a77f9e to 40f1cac Compare June 6, 2018 22:13
@fjetter
Copy link
Member Author

fjetter commented Jun 6, 2018

Sorry, I've been swamped the past few weeks.

@TomAugspurger I extended the CategoricalIndex unit test and extended the benchmarks to cover all combinations you mentioned. Is there anything else you'd want to be covered?

@jorisvandenbossche Indeed this bug exists but I think it's unrelated to this PR. It's connected to the _defer_to_indexing attribute
see

if self.categories._defer_to_indexing:

which appears to be only True for IntervalIndex. Unfortunately I have no idea what this is for and would suggest to fix this in another PR

@jreback You asked me to check whether there are similar benchmarks in indexing.py and I believe the short answer would be yes. However, most tests/benchmark for categorical related cases are in their own module which is why I put it there. I do not have a strong opinion about this topic and would move the code wherever you feel most appropriate

The currently implemented benchmarks show a speedup between ~2-1000 depending on the task we're looking at.

BEFORE

~/workspace/pandas/asv_bench(performance/improve_contains_in_categorical*) » asv run --python=same --show-stderr --bench=categoricals.Indexing
· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[ 50.00%] ··· Running categoricals.Indexing.time_loc_df                                                                                   ok
[ 50.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a      3.90±0.7ms   1.95±0.05ms
                  c     3.41±0.02ms    2.37±0.1ms
               ======= ============= =============

[100.00%] ··· Running categoricals.Indexing.time_loc_ser                                                                                  ok
[100.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a     2.43±0.05ms    3.15±0.5ms
                  c      2.85±0.1ms   2.91±0.05ms
               ======= ============= =============

------------------------------------------------------------
~/workspace/pandas/asv_bench(performance/improve_contains_in_categorical*) » asv run --python=same --show-stderr --bench=categoricals.Contains
· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[ 50.00%] ··· Running categoricals.Contains.time_cat_isin                                                                                 ok
[ 50.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a     1.20±0.02ms   1.19±0.04ms
                  c     1.54±0.03ms   1.50±0.04ms
                  d     1.74±0.05ms   1.69±0.05ms
                  z     1.78±0.08ms   1.78±0.06ms
                 nan    1.19±0.01ms   1.84±0.04ms
               ======= ============= =============

[100.00%] ··· Running categoricals.Contains.time_contains_index                                                                           ok
[100.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a     1.21±0.05ms   1.20±0.02ms
                  c     1.54±0.02ms   1.64±0.09ms
                  d      1.83±0.3ms   1.76±0.04ms
                  z     1.72±0.05ms   1.71±0.05ms
                 nan    1.25±0.08ms   1.85±0.06ms
               ======= ============= =============

AFTER

~/workspace/pandas/asv_bench(performance/improve_contains_in_categorical*) » asv run --python=same --show-stderr --bench=categoricals.Indexing
· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[ 50.00%] ··· Running categoricals.Indexing.time_loc_df                                                                                   ok
[ 50.00%] ····
               ======= ============= ==========
               --              has_nan
               ------- ------------------------
                value       True       False
               ======= ============= ==========
                  a     1.86±0.04ms   734±20μs
                  c     1.75±0.01ms   634±20μs
               ======= ============= ==========

[100.00%] ··· Running categoricals.Indexing.time_loc_ser                                                                                  ok
[100.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a     1.25±0.07ms   1.36±0.08ms
                  c     1.30±0.07ms   1.38±0.07ms
               ======= ============= =============


~/workspace/pandas/asv_bench(performance/improve_contains_in_categorical*) » asv run --python=same --show-stderr --bench=categoricals.Contains
· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Building for existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda2_envs_pandas-dev_bin_python
[ 50.00%] ··· Running categoricals.Contains.time_cat_isin                                                                                 ok
[ 50.00%] ····
               ======= ============= =============
               --                has_nan
               ------- ---------------------------
                value       True         False
               ======= ============= =============
                  a      15.2±0.4μs    15.1±0.1μs
                  c      16.5±0.3μs     15.7±2μs
                  d      17.8±0.8μs    16.3±0.5μs
                  z     1.96±0.07μs   1.98±0.02μs
                 nan     11.5±0.1μs    11.6±0.3μs
               ======= ============= =============

[100.00%] ··· Running categoricals.Contains.time_contains_index                                                                           ok
[100.00%] ····
               ======= ============= ============
               --               has_nan
               ------- --------------------------
                value       True        False
               ======= ============= ============
                  a      17.1±0.4μs   16.3±0.2μs
                  c      17.7±0.5μs   17.9±0.4μs
                  d      18.7±0.6μs   17.4±0.3μs
                  z     3.34±0.06μs   3.37±0.1μs
                 nan     14.3±0.2μs   13.3±0.5μs
               ======= ============= ============

"""Returns True if `key` is in this Categorical."""
if key in self.categories:
return self.categories.get_loc(key) in self.codes
elif isna(key) and self.isna().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is isna implemented for Categoricals? I'm wondering if it'd be alot faster to instead do elif isna(key) and -1 in self.codes or if it'd be about the same is your version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this in my #21369 and they seem both to be the same speed and the na-version reads better, so IMO thatis ok to use.


class Contains(object):

params = (["a", "c", "d", "z", np.nan], [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob don't need all of these, maybe just a, d, np.nan

if has_nan:
obj_values = [np.nan] + obj_values[:-2] + [np.nan]

self.ci = pd.CategoricalIndex(obj_values, categories=list("abcd"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are not using then don't assign to self e.g. self.ci

is self.cat used?

@@ -64,7 +64,7 @@ Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of :func:`Series.describe` in case of numeric dtpyes (:issue:`21274`)
-
- Improved performance of indexing on a Series/DataFrame with a CategoricalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number, use double-backticks on CategoricalIndex

@topper-123
Copy link
Contributor

topper-123 commented Jun 7, 2018

Please compare my #21369 to your __contains__ implementation.

My version is slightly different:

    def __contains__(self, key):
        hash(key)
        if isna(key):
            return self.isna().any()
        elif self.categories._defer_to_indexing:  # e.g. Interval values
            loc = self.categories.get_loc(key)
            return np.isin(self.codes, loc).any()
        elif key in self.categories:
            return self.categories.get_loc(key) in self._engine
        else:
            return False

The isna section comes first to guard against passing nans to the elif self.categories._defer_to_indexing case.

In the elif self.categories._defer_to_indexing case, the key may be in several intervals, so loc may be an array. Therefor I used np.isin for the membership check.

The rest is like your version, except I use _engine, which is what gives CategoricalIndex a nice speedup. Comments/reviews welcome.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2018

since we merged @topper-123 other fix, can you confirm this (the correct fix should be in Categorical and CI can just call it)

@fjetter
Copy link
Member Author

fjetter commented Jun 9, 2018

@topper-123 Your version doesn't work since there is no Categorical._engine. The engine does only exist for the CategoricalIndex. I exchanged it with self._codes. The timing is basically the same as for my version but since you added the additional case for the _defer_to_indexing I took your version

@jreback Are you referring to #21025? Not sure how #21025 relates and I couldn't find any other merged PRs related to this. I assume you meant that the CategoricalIndex should just forward the call to the Categorical underneath. I removed any special cases in the CategoricalIndex and with the additional if case @topper-123 suggested, the bug with the IntervalIndex is now also fixed. I believe this is what #21369 is doing, isn't it? I'd hate to compete with another PR. If there are any major issues left and @topper-123 is willing to take care of it, I'll gladly step back since I wouldn't want this PR to blow up any further

@fjetter
Copy link
Member Author

fjetter commented Jun 9, 2018

Tests seem to break because np.isin was only introduced in rather recent numpy versions but CI tests also up to 1.10 where this doesn't exist. Can someone explain to me what the flag _defer_to_indexing is for and why we even require special treatment in this case?

@topper-123
Copy link
Contributor

topper-123 commented Jun 9, 2018

@fjetter , Sorry about that, I was also confused by the _defer_to_indexing. I looked more into it and the underlying issue is that self.categories.get_loc may return a array, if self.categories is a IntervalIndex.

So I suggest doing this, whích also avoids special-casing Intervals.

    def __contains__(self, key):
        hash(key)
        if isna(key):
            return self.isna().any()
        try:
            loc = self.categories.get_loc(key)
        except KeyError:
            return False
        if is_scalar(loc):
            return loc in self._codes
        else:  # if self.categories is IntervalIndex, loc is an array
            return any(loc_ in self._codes for loc_ in loc)

EDIT: An example where self.categories.get_loc returns an array:

>>> ii = pd.IntervalIndex.from_arrays((0,0,0), (1,1,2))
>>> c = pd.Categorical(ii)
>>> c.categories.get_loc(1)
array([0, 1], dtype=int64)

@topper-123
Copy link
Contributor

Hi @fjetter, I’ve gotten merged #21369, which is very similar to your PR. You only need to replace my self._engine with self.codes in yout PR, and your code should work.

Want to take it the last few steps? Otherwise I can take this, it’s not much work remaining.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

closing in favor of #21508

@jreback jreback closed this Jun 19, 2018
@jreback jreback added this to the No action milestone Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index
5 participants