-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21699 +/- ##
==========================================
+ Coverage 92.18% 92.19% +<.01%
==========================================
Files 169 169
Lines 50820 50821 +1
==========================================
+ Hits 46850 46852 +2
+ Misses 3970 3969 -1
Continue to review full report at Codecov.
|
so 2) is the prefered solution here. you can simply use the templating like we do in most modules which generates code. its not trivial but not super complicated either. |
@topper-123 what issues were you running into with Cython? As far as generating the indexes goes I would think you could just add the appropriate types below:
Not sure how you were planning to dynamically select the appropriate engine in |
Ok, I think I understand the design now, at least. What is the templating language used and how is the pxi-file generated? So would the correct approach be to:
|
Ok, got algos.common_helper.pxi.in compiling and works fine. index_helper.pxi throws a compiler error. I added to the dtype list in index_helper.pxi.in:
but get this error when compiling: cythoning pandas\_libs/index.pyx to pandas\_libs\index.c
Error compiling Cython file:
------------------------------------------------------------
...
cdef _maybe_get_bool_indexer(self, object val):
cdef:
ndarray[uint8_t, ndim=1, cast=True] indexer
ndarray[intp_t, ndim=1] found
ndarray[int8_t] values
^
------------------------------------------------------------
pandas\_libs\index_class_helper.pxi:179:26: Invalid type. It appears the compiler does not like the int8_t type. Using Any ideas what's happening? |
you need to do the imports of |
Thanks, got everything compiling with Int8Engine, which is good. However, when running, it complains there is no Int8HashTable. Seems like there isn't an easy way to create an Int8HashTable, is that correct? ~\Documents\Python\pandasdev\pandasdev\pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int8Engine._make_hash_table()
212
213 cdef _make_hash_table(self, n):
--> 214 return _hash.Int8HashTable(n)
AttributeError: module 'pandas._libs.hashtable' has no attribute 'Int8HashTable' I tried looking into using a Int64HashTable in Int8Engine, but that seems to bring a different set of tradeoffs... Is there an easy way to create a Int8HashTable, or is it reasonable to use a Int64HashTable in Int8Engine? |
you might need to create the other hash tables (definitions) as well |
fbb711f
to
db5c447
Compare
Ok, I've tried creating the int8 hash table, but got some issues. Everything compiles up till and including " add Int8Index", but that runs into the issue that no Int8HashTable existst. I've tried creating a Int8HashTable (see the commit "add Int8HashTable"), but I can't get to compile. It gives this compilation error:
I've tried messing with khash.pxd to get that kh_*_int8 working, but can't get it running. What does khash.pxd do? It seems to in turn depend on khash_python.h, which only seems to have int64, int32 anf float64 version available. Must that be changed before getting a Int8HashTable working? Guidance would be very much appreciated. |
hmm, i guess khash doesn't support this either. ok, so maybe let's do a precursor PR to do this properly |
Ok, just to confirm, so you/someone else will do the PR to get Int8HashTable etc. working, and I'll wait for that and afterwards I will implement Int8Engine etc? |
@topper-123 well I suppose someone can do a PR to make other types of hashtables work. cc @pandas-dev/pandas-core |
That would be great. If someone makes the hash tables I will take this the rest of the way. |
e28b353
to
6e1adff
Compare
I've made a solution for the problem of not having Int8HashTable, Int16HashTable and Int32HashTable by using Int64HashTable in Int8Engine etc. All the tests pass and performance is the same as above. Is this an ok approach? I could then add some tests and ASVs as needed. A nice side effect BTW is that we now have and can use >>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
# the below is called in ci.is_monotonic etc. in master
>>> %timeit pd._libs.algos.is_monotonic_int64(ci.codes.astype('int64'), False)
21.6 ms # master
# the below is called in ci.is_monotonic etc. in this PR
>>> %timeit pd._libs.algos.is_monotonic_int8(ci.codes, False)
5.6 ms # this PR So all code that calls CategoricalIndex.is_monotonic/.is_unique will benefit also. |
@@ -429,10 +437,11 @@ def get_loc(self, key, method=None): | |||
>>> non_monotonic_index.get_loc('b') | |||
array([False, True, False, True], dtype=bool) | |||
""" | |||
codes = self.categories.get_loc(key) | |||
if (codes == -1): | |||
raise KeyError(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codes can never be -1. If key is not in self.categories, a KeyError was always raised.
@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs): | |||
def _engine(self): | |||
|
|||
# we are going to look things up with the codes themselves | |||
return self._engine_type(lambda: self.codes.astype('i8'), len(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need any changes in cython if you simply so: .astype('i8', copy=False)
, which is the cause of the perf issue, no?
its still not as nice as actually using type specific hashtables though (which this PR is not addressing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing that. Still is slow, 14 ms.
In index_class_helper.pxi.in, I also tried changing
cdef _get_index_values(self):
return algos.ensure_{{dtype}}(self.vgetter())
to
cdef _get_index_values(self):
return self.vgetter()
But also slow, 14 ms.
I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you have something else going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy docs say about the copy
parameter of array.astype
:
By default, astype always returns a newly allocated array. If this is set
to false, and the dtype, order, and subok requirements are satisfied,
the input array is returned instead of a copy.
So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so try using algos.ensure_int64(values)
I find this extremely odd that you have to do the if inside cython.
pandas/core/indexes/category.py
Outdated
raise KeyError(key) | ||
return self._engine.get_loc(codes) | ||
code = self.categories.get_loc(key) | ||
# dtype must be same as dtype for codes for searchsorted not lose speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line between
6e1adff
to
912d81b
Compare
Hey, I don't see any good way forward other than someone commiting to make an Int8/16/32HashTable at some point. The question is thought if this PR needs to wait for that, or can be pulled in now and the relevant lines be changed when the hash table is implemented. I argue pulling this in now, as index operations are critical for performance and that would allow me to pursue a few more index-related performance issues. |
@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs): | |||
def _engine(self): | |||
|
|||
# we are going to look things up with the codes themselves | |||
return self._engine_type(lambda: self.codes.astype('i8'), len(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so try using algos.ensure_int64(values)
I find this extremely odd that you have to do the if inside cython.
912d81b
to
c4f74cb
Compare
A technically possible solution to avoid that if statement while keeping the speed improvements would be to input a int64 array into _engine: def _engine(self):
codes = self.codes.astype('int64')
return self._engine_type(lambda: codes, len(self)) But that would keep an int64 array in memory, in addition to an int8 array, so doesn't make any sense memorywise. We want to just use the existing int8 array. I made another solution, where I've moved this specific implementation to index_class_helper.pxi.in. This makes its location more specific to the problem it solves. |
doc/source/whatsnew/v0.24.0.txt
Outdated
- Improved performance of membership checks in :class:`Categorical` and :class:`CategoricalIndex` | ||
(i.e. ``x in cat``-style checks are much faster). :meth:`CategoricalIndex.contains` | ||
is likewise much faster (:issue:`21369`, :issue:`21508`) | ||
- Improved performance of :func:`Series.describe` in case of numeric dtypes (:issue:`21274`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally pls try to avoid changing orthogonal things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that may be confusing. I'll limit doing this from now on, but this was just a spelling error, that I corrected, as it was adjacent to my text..
pandas/tests/indexes/conftest.py
Outdated
'UInt64', 'UInt32', 'UInt16', 'UInt8', | ||
'Float64', 'Float32', | ||
]) | ||
def num_engine(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this more descriptive, numeric_indexing_engine
class TestCategoricalIndexEngine(object): | ||
|
||
@pytest.mark.parametrize('nbits', [8, 16, 32, 64]) | ||
def test_engine_type(self, nbits): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, this is very complicated. why is this necessary? can you make much more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've made it simpler.
pandas/tests/indexes/test_engine.py
Outdated
|
||
class TestNumericEngine(object): | ||
|
||
@pytest.mark.parametrize('data', [[0, 1, 2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this parameterized over data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. You start out with a certain structure and forget to check if is sensible for the end product...
pandas/tests/indexes/test_engine.py
Outdated
|
||
@pytest.mark.parametrize('data', [[0, 1, 2]]) | ||
def test_is_monotonic_ordered(self, data, num_engine): | ||
codes = np.array(data, dtype=num_engine._dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have multiple tests for monotonic? and not monotonic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there just one data to check for. Fixed.
adac8f4
to
9836666
Compare
Changes according to comments. |
Is this ok to merge in? |
ping @jreback |
will have a look |
not sure of the state of thIS PR, needs a rebase |
@jreback : I believe this PR is ready for review see above. However, time as a resource is limited, and this one got overlooked unfortunately. @topper-123 : Sorry for the delay, but can you rebase this? |
9836666
to
1a438a3
Compare
Hello @topper-123! Thanks for updating the PR.
|
Rebased and green. EDIT: ok, the whatsnew disappeared in the rebase. I've added it again, while the code itself unchanged. |
1a438a3
to
6aa94e9
Compare
Hello @topper-123! Thanks for updating the PR.
|
@topper-123 : Thanks! Can you double check if you addressed all comments? IINM, I see a few that might need attention... Feel free to resolve those that you think have completed discussion. |
I am not convinced this is actually a good solution here. I think there was a much simpler version that worked. This is adding quite a bit of code and complexity. I would like to see the performance tests split off into a new PR (and the codes can never be -1) bug fix as a first step. |
Closed in favor of #23235. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is the the final puzzle for #20395.
The problem with the current implementation is that
CategoricalIndex._engine
constantly recodes int8 arrays to int64 arrays:pandas/pandas/core/indexes/category.py
Lines 365 to 369 in dc45fba
Notice the
lambda: self.codes.astype('i8')
part, which means that every time_engine.vgetter
is called, an int64-array is created. This is very expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.A complicating issue is that it is not enough to just avoid the int64 transformation, as
array.searchsorted
apparantly needs a dtype-compatible input or else it is also very slow:Solution options
As CategoricalIndex.codes may be int8, int16, etc, the solution must be to (1) have an indexing engine for each integer dtype or an indexing engine that accepts all int types, not just int64 and (2) that the key must be translated into the same dtype as the codes array before calling searchsorted. So either:
algos.is_monotonic_int64
for checking monotonicity)I assume option 1 is not desired, and option 3 assumedly likewise. In the updated PR I've made a proposal in Cython, that attains the needed speed.
Benchmarks from
asv_bench/indexing.py