-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: __contains__ method for Categorical #21022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: __contains__ method for Categorical #21022
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21022 +/- ##
==========================================
+ Coverage 91.85% 91.89% +0.04%
==========================================
Files 153 153
Lines 49549 49585 +36
==========================================
+ Hits 45512 45565 +53
+ Misses 4037 4020 -17
Continue to review full report at Codecov.
|
Your code was good, though use A related improvement would be to look at return key in self.values can be replaced with code_value = self.categories.get_loc(key)
return code_value in self._engine to gain some additional speedup when using >>> n = 1_000_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n)) # using get_loc
>>> %timeit 'b' in c
4.49 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit 'b' in ci
3,11 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Also in |
pandas/core/arrays/categorical.py
Outdated
@@ -1846,6 +1846,11 @@ def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values().tolist()) | |||
|
|||
def __contains__(self, key): | |||
"""Returns True if `key` is in this Categorical.""" | |||
code_values = self.categories.get_indexer([key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_loc
is much faster than get_indexer
as it returns a slice object versus a numpy array.
asv_bench/benchmarks/categoricals.py
Outdated
|
||
class Slicing(object): | ||
|
||
def setup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check that we don't have similar in indexing.py
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1080,6 +1080,7 @@ Performance Improvements | |||
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`) | |||
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`) | |||
- Fixed a performance regression for :func:`GroupBy.nth` and :func:`GroupBy.last` with some object columns (:issue:`19283`) | |||
- Improved performance of :meth:`Categorical.loc` (:issue:`20395`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't exist, rather its indexing on a Series/DataFrame with a CategoricalIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.23.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep it for 0.24.0, so @fjetter you can move it to v0.24.0.txt
asv_bench/benchmarks/categoricals.py
Outdated
|
||
def setup(self): | ||
n = 1 * 10**4 | ||
self.df = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add for a Series as well
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1080,6 +1080,7 @@ Performance Improvements | |||
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`) | |||
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`) | |||
- Fixed a performance regression for :func:`GroupBy.nth` and :func:`GroupBy.last` with some object columns (:issue:`19283`) | |||
- Improved performance of :meth:`Categorical.loc` (:issue:`20395`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.23.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ensure that we have tests directly for
- basic case (positive and negative)
np.nan
in categorical- key is in the categories but not the codes
Hey @fjetter, This is a very nice PR. Want to take this over the finishing line? I could make the CategoricalIndex case myself, if you want. |
General question, wouldn't simply |
@jorisvandenbossche: in general that wouldn't work, as some categories might not be present in the |
How is that possible? The only one I can think of is NA, so the check could be something like |
Maybe I wasn't communicating clearly, sorry about that. What I meant was this: >>> c = pd.CategoricalIndex('a a'.split(), categories='a b'.split())
>>> 'b' in c
False
>>< 'b' in c.categories
True So to check for membership in categories can give different results than checking membership in the values. Was this not the question from @jorisvandenbossche, or did I misunderstand? |
Oh, of course. Though I wonder if checking |
Ah yes, that's a good reason! |
Exactly this reason actually is a bug in
|
8a77f9e
to
40f1cac
Compare
Sorry, I've been swamped the past few weeks. @TomAugspurger I extended the @jorisvandenbossche Indeed this bug exists but I think it's unrelated to this PR. It's connected to the pandas/pandas/core/indexes/category.py Line 328 in 0c65c57
which appears to be only True for IntervalIndex . Unfortunately I have no idea what this is for and would suggest to fix this in another PR
@jreback You asked me to check whether there are similar benchmarks in The currently implemented benchmarks show a speedup between ~2-1000 depending on the task we're looking at. BEFORE
AFTER
|
pandas/core/arrays/categorical.py
Outdated
"""Returns True if `key` is in this Categorical.""" | ||
if key in self.categories: | ||
return self.categories.get_loc(key) in self.codes | ||
elif isna(key) and self.isna().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is isna
implemented for Categoricals? I'm wondering if it'd be alot faster to instead do elif isna(key) and -1 in self.codes
or if it'd be about the same is your version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this in my #21369 and they seem both to be the same speed and the na-version reads better, so IMO thatis ok to use.
asv_bench/benchmarks/categoricals.py
Outdated
|
||
class Contains(object): | ||
|
||
params = (["a", "c", "d", "z", np.nan], [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob don't need all of these, maybe just a, d, np.nan
if has_nan: | ||
obj_values = [np.nan] + obj_values[:-2] + [np.nan] | ||
|
||
self.ci = pd.CategoricalIndex(obj_values, categories=list("abcd")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are not using then don't assign to self e.g. self.ci
is self.cat used?
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -64,7 +64,7 @@ Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- Improved performance of :func:`Series.describe` in case of numeric dtpyes (:issue:`21274`) | |||
- | |||
- Improved performance of indexing on a Series/DataFrame with a CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number, use double-backticks on CategoricalIndex
Please compare my #21369 to your My version is slightly different: def __contains__(self, key):
hash(key)
if isna(key):
return self.isna().any()
elif self.categories._defer_to_indexing: # e.g. Interval values
loc = self.categories.get_loc(key)
return np.isin(self.codes, loc).any()
elif key in self.categories:
return self.categories.get_loc(key) in self._engine
else:
return False The In the The rest is like your version, except I use |
since we merged @topper-123 other fix, can you confirm this (the correct fix should be in Categorical and CI can just call it) |
@topper-123 Your version doesn't work since there is no @jreback Are you referring to #21025? Not sure how #21025 relates and I couldn't find any other merged PRs related to this. I assume you meant that the CategoricalIndex should just forward the call to the Categorical underneath. I removed any special cases in the |
Tests seem to break because |
@fjetter , Sorry about that, I was also confused by the So I suggest doing this, whích also avoids special-casing Intervals. def __contains__(self, key):
hash(key)
if isna(key):
return self.isna().any()
try:
loc = self.categories.get_loc(key)
except KeyError:
return False
if is_scalar(loc):
return loc in self._codes
else: # if self.categories is IntervalIndex, loc is an array
return any(loc_ in self._codes for loc_ in loc) EDIT: An example where >>> ii = pd.IntervalIndex.from_arrays((0,0,0), (1,1,2))
>>> c = pd.Categorical(ii)
>>> c.categories.get_loc(1)
array([0, 1], dtype=int64) |
closing in favor of #21508 |
Calling
key in Categorical(...)
trivially falls back to calling__iter__
and then forces a full construction of the array by invokingget_values
.This implementation is faster in best case situations than using .e.g.
any(Categorical.isin(key))
since we do not care about the complete array. Not sure if I need to handle any edge cases, though.This obviously can't reach the performance of a simple Index but is a bit faster than before
before:
after
git diff upstream/master -u -- "*.py" | flake8 --diff
cc @topper-123