PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

topper-123 · 2018-07-01T19:35:53Z

closes PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is the the final puzzle for #20395.

The problem with the current implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

pandas/pandas/core/indexes/category.py

Lines 365 to 369 in dc45fba

    
           @cache_readonly 
        
           def _engine(self): 
        
               # we are going to look things up with the codes themselves 
        
               return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is very expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.codes.searchsorted(1)  # search for left location of 'b'
7.38 ms   # slow
>>> code = np.int8(1)
>>> %timeit ci.codes.searchsorted(code)
2.57 µs  # fast

Solution options

As CategoricalIndex.codes may be int8, int16, etc, the solution must be to (1) have an indexing engine for each integer dtype or an indexing engine that accepts all int types, not just int64 and (2) that the key must be translated into the same dtype as the codes array before calling searchsorted. So either:

Change Int64Engine to be a IntEngine (i.e. accept all integer dtypes)
Make new IntEngine classes, with the appropriate flexibility for accepting all integer dtypes, but defers to Int64 version if/when needed (e.g. if codes is int8, but we only have algos.is_monotonic_int64 for checking monotonicity)
Do everything in Python.

I assume option 1 is not desired, and option 3 assumedly likewise. In the updated PR I've made a proposal in Cython, that attains the needed speed.

Benchmarks from asv_bench/indexing.py

      before           after         ratio
     [dc45fbaf]       [bc03b8bd]
-     1.54±0.02ms         9.80±0μs     0.01  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

codecov · 2018-07-01T20:48:23Z

Codecov Report

Merging #21699 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21699      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50821       +1     
==========================================
+ Hits        46850    46852       +2     
+ Misses       3970     3969       -1

Flag	Coverage Δ
#multiple	`90.6% <100%> (ø)`	⬆️
#single	`42.38% <57.14%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/base.py	`96.39% <ø> (ø)`	⬆️
pandas/core/indexes/category.py	`97.54% <100%> (+0.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab9dbd...6aa94e9. Read the comment docs.

jreback · 2018-07-02T23:27:06Z

so 2) is the prefered solution here. you can simply use the templating like we do in most modules which generates code. its not trivial but not super complicated either.

WillAyd · 2018-07-04T03:58:53Z

@topper-123 what issues were you running into with Cython? As far as generating the indexes goes I would think you could just add the appropriate types below:

pandas/pandas/_libs/index_class_helper.pxi.in

Line 14 in 2b13605

dtypes = [('Float64', 'float64', 'float64_t'),

Not sure how you were planning to dynamically select the appropriate engine in category.py but as far as generating those engines in the Cython layer I think most of the foundation is already in place

topper-123 · 2018-07-04T09:20:43Z

Ok, I think I understand the design now, at least. What is the templating language used and how is the pxi-file generated?

So would the correct approach be to:

add int8_t, int16_t etc. to that list in index_helper_pxi.in,
similarly add them to the list in algos.common_helper.pxi.in,
ensure that CategoricalIndex._engine is of the correct type,
avoid the casting to int64 in CategoricalIndex._engine?

topper-123 · 2018-07-04T09:45:53Z

Ok, got algos.common_helper.pxi.in compiling and works fine.

index_helper.pxi throws a compiler error. I added to the dtype list in index_helper.pxi.in:

          ('Int8', 'int8', 'int8_t'),

but get this error when compiling:

cythoning pandas\_libs/index.pyx to pandas\_libs\index.c

Error compiling Cython file:
------------------------------------------------------------
...

    cdef _maybe_get_bool_indexer(self, object val):
        cdef:
            ndarray[uint8_t, ndim=1, cast=True] indexer
            ndarray[intp_t, ndim=1] found
            ndarray[int8_t] values
                         ^
------------------------------------------------------------

pandas\_libs\index_class_helper.pxi:179:26: Invalid type.

It appears the compiler does not like the int8_t type. Using ('UInt8', 'uint8', 'uint8_t'), instead compiles without errors.

Any ideas what's happening?

jreback · 2018-07-04T12:55:17Z

you need to do the imports of int8_t in index.pyx itself (see the top).

topper-123 · 2018-07-04T21:21:07Z

Thanks, got everything compiling with Int8Engine, which is good.

However, when running, it complains there is no Int8HashTable. Seems like there isn't an easy way to create an Int8HashTable, is that correct?

~\Documents\Python\pandasdev\pandasdev\pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int8Engine._make_hash_table()
    212
    213     cdef _make_hash_table(self, n):
--> 214         return _hash.Int8HashTable(n)

AttributeError: module 'pandas._libs.hashtable' has no attribute 'Int8HashTable'

I tried looking into using a Int64HashTable in Int8Engine, but that seems to bring a different set of tradeoffs...

Is there an easy way to create a Int8HashTable, or is it reasonable to use a Int64HashTable in Int8Engine?

jreback · 2018-07-04T21:52:40Z

you might need to create the other hash tables (definitions) as well
i think we only have 32 & 64 hits atm

topper-123 · 2018-07-05T12:17:35Z

Ok, I've tried creating the int8 hash table, but got some issues. Everything compiles up till and including " add Int8Index", but that runs into the issue that no Int8HashTable existst. I've tried creating a Int8HashTable (see the commit "add Int8HashTable"), but I can't get to compile.

It gives this compilation error:

Error compiling Cython file:
------------------------------------------------------------
...
        self.external_view_exists = True
        return self.ao

    cdef inline void append(self, int8_t x):

        if needs_resize(self.data):
                      ^
------------------------------------------------------------

pandas\_libs\hashtable_class_helper.pxi:287:23: no suitable method found

Error compiling Cython file:
------------------------------------------------------------
...


cdef class Int8HashTable(HashTable):

    def __cinit__(self, size_hint=1):
        self.table = kh_init_int8()
                                ^
------------------------------------------------------------

pandas\_libs\hashtable_class_helper.pxi:1291:33: undeclared name not builtin: kh_init_int8

I've tried messing with khash.pxd to get that kh_*_int8 working, but can't get it running.

What does khash.pxd do? It seems to in turn depend on khash_python.h, which only seems to have int64, int32 anf float64 version available. Must that be changed before getting a Int8HashTable working?

Guidance would be very much appreciated.

jreback · 2018-07-05T16:55:05Z

hmm, i guess khash doesn't support this either.

ok, so maybe let's do a precursor PR to do this properly

topper-123 · 2018-07-05T21:32:44Z

Ok, just to confirm, so you/someone else will do the PR to get Int8HashTable etc. working, and I'll wait for that and afterwards I will implement Int8Engine etc?

jreback · 2018-07-08T15:32:13Z

@topper-123 well I suppose someone can do a PR to make other types of hashtables work. cc @pandas-dev/pandas-core

topper-123 · 2018-07-08T21:41:06Z

That would be great. If someone makes the hash tables I will take this the rest of the way.

topper-123 · 2018-08-01T09:39:16Z

I've made a solution for the problem of not having Int8HashTable, Int16HashTable and Int32HashTable by using Int64HashTable in Int8Engine etc. All the tests pass and performance is the same as above.

Is this an ok approach? I could then add some tests and ASVs as needed.

A nice side effect BTW is that we now have and can use pd._libs.algos.is_monotonic_int8, so we don't have to convert to int64 when checking for monoticity and uniqueness:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
# the below is called in ci.is_monotonic etc. in master
>>> %timeit pd._libs.algos.is_monotonic_int64(ci.codes.astype('int64'), False)
21.6 ms  # master
# the below is called in ci.is_monotonic etc. in this PR
>>> %timeit pd._libs.algos.is_monotonic_int8(ci.codes, False)
5.6 ms  # this PR

So all code that calls CategoricalIndex.is_monotonic/.is_unique will benefit also.

topper-123 · 2018-08-01T09:42:20Z

pandas/core/indexes/category.py

@@ -429,10 +437,11 @@ def get_loc(self, key, method=None):
        >>> non_monotonic_index.get_loc('b')
        array([False,  True, False,  True], dtype=bool)
        """
-        codes = self.categories.get_loc(key)
-        if (codes == -1):
-            raise KeyError(key)


codes can never be -1. If key is not in self.categories, a KeyError was always raised.

jreback · 2018-08-01T10:28:22Z

pandas/core/indexes/category.py

@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs):
    def _engine(self):

        # we are going to look things up with the codes themselves
-        return self._engine_type(lambda: self.codes.astype('i8'), len(self))


I don't think you need any changes in cython if you simply so: .astype('i8', copy=False), which is the cause of the perf issue, no?

its still not as nice as actually using type specific hashtables though (which this PR is not addressing)

I tried changing that. Still is slow, 14 ms.

In index_class_helper.pxi.in, I also tried changing

cdef _get_index_values(self): return algos.ensure_{{dtype}}(self.vgetter())

to

cdef _get_index_values(self): return self.vgetter()

But also slow, 14 ms.

I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables.

then you have something else going on

The numpy docs say about the copy parameter of array.astype:

By default, astype always returns a newly allocated array. If this is set to false, and the dtype, order, and subok requirements are satisfied, the input array is returned instead of a copy.

So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals.

so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.

jreback · 2018-08-01T10:28:43Z

pandas/core/indexes/category.py

-            raise KeyError(key)
-        return self._engine.get_loc(codes)
+        code = self.categories.get_loc(key)
+        # dtype must be same as dtype for codes for searchsorted not lose speed


blank line between

topper-123 · 2018-08-09T17:04:03Z

Hey, array.astype('int64', copy=False) won't work as explained above.

I don't see any good way forward other than someone commiting to make an Int8/16/32HashTable at some point. The question is thought if this PR needs to wait for that, or can be pulled in now and the relevant lines be changed when the hash table is implemented.

I argue pulling this in now, as index operations are critical for performance and that would allow me to pursue a few more index-related performance issues.

jreback · 2018-08-09T22:57:36Z

pandas/core/indexes/category.py

@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs):
    def _engine(self):

        # we are going to look things up with the codes themselves
-        return self._engine_type(lambda: self.codes.astype('i8'), len(self))


so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.

topper-123 · 2018-08-10T04:36:33Z

algos.ensure_int64(values) ends up calling values.astype(np.int64), and therefore also causes the slowdown for int8/16/32 arrays that we want to avoid.

A technically possible solution to avoid that if statement while keeping the speed improvements would be to input a int64 array into _engine:

    def _engine(self):
        codes = self.codes.astype('int64')
        return self._engine_type(lambda: codes, len(self))

But that would keep an int64 array in memory, in addition to an int8 array, so doesn't make any sense memorywise. We want to just use the existing int8 array.

I made another solution, where I've moved this specific implementation to index_class_helper.pxi.in. This makes its location more specific to the problem it solves.

jreback · 2018-08-17T10:32:21Z

doc/source/whatsnew/v0.24.0.txt

 - Improved performance of membership checks in :class:`Categorical` and :class:`CategoricalIndex`
  (i.e. ``x in cat``-style checks are much faster). :meth:`CategoricalIndex.contains`
  is likewise much faster (:issue:`21369`, :issue:`21508`)
+- Improved performance of :func:`Series.describe` in case of numeric dtypes (:issue:`21274`)


generally pls try to avoid changing orthogonal things.

I can see how that may be confusing. I'll limit doing this from now on, but this was just a spelling error, that I corrected, as it was adjacent to my text..

jreback · 2018-08-17T10:35:20Z

pandas/tests/indexes/conftest.py

+        'UInt64', 'UInt32', 'UInt16', 'UInt8',
+        'Float64', 'Float32',
+    ])
+def num_engine(request):


make this more descriptive, numeric_indexing_engine

jreback · 2018-08-17T10:36:08Z

pandas/tests/indexes/test_category.py

+class TestCategoricalIndexEngine(object):
+
+    @pytest.mark.parametrize('nbits', [8, 16, 32, 64])
+    def test_engine_type(self, nbits):


woa, this is very complicated. why is this necessary? can you make much more clear

Yeah, I've made it simpler.

jreback · 2018-08-17T10:37:09Z

pandas/tests/indexes/test_engine.py

+
+class TestNumericEngine(object):
+
+    @pytest.mark.parametrize('data', [[0, 1, 2]])


why is this parameterized over data?

Fixed. You start out with a certain structure and forget to check if is sensible for the end product...

jreback · 2018-08-17T10:37:57Z

pandas/tests/indexes/test_engine.py

+
+    @pytest.mark.parametrize('data', [[0, 1, 2]])
+    def test_is_monotonic_ordered(self, data, num_engine):
+        codes = np.array(data, dtype=num_engine._dtype)


why do you have multiple tests for monotonic? and not monotonic?

Yeah, there just one data to check for. Fixed.

topper-123 · 2018-08-17T20:57:45Z

Changes according to comments.

topper-123 · 2018-08-25T09:57:40Z

Is this ok to merge in?

gfyoung · 2018-08-25T10:16:52Z

ping @jreback

jreback · 2018-08-25T10:44:24Z

will have a look

jreback · 2018-09-23T21:40:26Z

not sure of the state of thIS PR, needs a rebase

gfyoung · 2018-09-23T21:43:26Z

@jreback : I believe this PR is ready for review see above. However, time as a resource is limited, and this one got overlooked unfortunately.

@topper-123 : Sorry for the delay, but can you rebase this?

pep8speaks · 2018-09-24T18:22:26Z

Hello @topper-123! Thanks for updating the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/indexing.py !
There are no PEP8 issues in the file asv_bench/benchmarks/indexing_engines.py !
There are no PEP8 issues in the file pandas/conftest.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/category.py !
There are no PEP8 issues in the file pandas/tests/indexes/conftest.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_category.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_engine.py !

topper-123 · 2018-09-24T19:55:32Z

Rebased and green.

EDIT: ok, the whatsnew disappeared in the rebase. I've added it again, while the code itself unchanged.

pep8speaks · 2018-09-24T22:16:30Z

Hello @topper-123! Thanks for updating the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/indexing.py !
There are no PEP8 issues in the file asv_bench/benchmarks/indexing_engines.py !
There are no PEP8 issues in the file pandas/conftest.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/category.py !
There are no PEP8 issues in the file pandas/tests/indexes/conftest.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_category.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_engine.py !

gfyoung · 2018-09-24T23:06:54Z

@topper-123 : Thanks! Can you double check if you addressed all comments? IINM, I see a few that might need attention...

Feel free to resolve those that you think have completed discussion.

jreback · 2018-09-25T12:22:56Z

I am not convinced this is actually a good solution here. I think there was a much simpler version that worked. This is adding quite a bit of code and complexity. I would like to see the performance tests split off into a new PR (and the codes can never be -1) bug fix as a first step.

topper-123 · 2018-10-19T02:23:49Z

Closed in favor of #23235.

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jul 2, 2018

topper-123 force-pushed the CategoricalIndex.get_loc branch from fbb711f to db5c447 Compare July 5, 2018 12:15

topper-123 mentioned this pull request Jul 23, 2018

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

Closed

4 tasks

topper-123 force-pushed the CategoricalIndex.get_loc branch 3 times, most recently from e28b353 to 6e1adff Compare August 1, 2018 09:28

topper-123 commented Aug 1, 2018

View reviewed changes

jreback requested changes Aug 1, 2018

View reviewed changes

topper-123 mentioned this pull request Aug 1, 2018

BUG: _cython_table bug (part 2) #22156

Closed

topper-123 force-pushed the CategoricalIndex.get_loc branch from 6e1adff to 912d81b Compare August 1, 2018 13:24

topper-123 changed the title ~~PERF/WIP: CategoricalIndex.get_loc should avoid expensive cast to int64~~ PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 Aug 9, 2018

jreback requested changes Aug 9, 2018

View reviewed changes

topper-123 force-pushed the CategoricalIndex.get_loc branch from 912d81b to c4f74cb Compare August 10, 2018 04:32

jreback requested changes Aug 17, 2018

View reviewed changes

topper-123 force-pushed the CategoricalIndex.get_loc branch from adac8f4 to 9836666 Compare August 17, 2018 20:53

tp and others added 8 commits September 23, 2018 23:07

Proof of performance

297ec19

add Int8Index

91ee55d

make Int8/16/32Engine work with Int64HashTable

4d0612e

move int8/16/32 type conversions to index_class_helper.pxi.in

5a89a51

changed according to comments

5575e93

correct ncategories size + add tests

2fa526f

add ASVs

4bc74f5

clean up ASV code a bit

c9f1166

topper-123 force-pushed the CategoricalIndex.get_loc branch from 9836666 to 1a438a3 Compare September 24, 2018 18:22

adjust for that is_monotonic uses fused types

6aa94e9

topper-123 force-pushed the CategoricalIndex.get_loc branch from 1a438a3 to 6aa94e9 Compare September 24, 2018 22:16

This was referenced Oct 11, 2018

ASV: add tests for indexing engines and Uint64Engine #23090

Merged

clean CategoricalIndex.get_loc and improve error message #23091

Merged

topper-123 mentioned this pull request Oct 19, 2018

PERF: speed up CategoricalIndex.get_loc #23235

Merged

4 tasks

jreback closed this Oct 19, 2018

topper-123 deleted the CategoricalIndex.get_loc branch February 2, 2019 18:45

	@cache_readonly
	def _engine(self):

	# we are going to look things up with the codes themselves
	return self._engine_type(lambda: self.codes.astype('i8'), len(self))


		class TestNumericEngine(object):

		@pytest.mark.parametrize('data', [[0, 1, 2]])

PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

Conversation

topper-123 commented Jul 1, 2018 • edited Loading

Solution options

codecov bot commented Jul 1, 2018 • edited Loading

Codecov Report

jreback commented Jul 2, 2018

WillAyd commented Jul 4, 2018

topper-123 commented Jul 4, 2018

topper-123 commented Jul 4, 2018 • edited Loading

jreback commented Jul 4, 2018

topper-123 commented Jul 4, 2018 • edited Loading

jreback commented Jul 4, 2018

topper-123 commented Jul 5, 2018

jreback commented Jul 5, 2018

topper-123 commented Jul 5, 2018

jreback commented Jul 8, 2018

topper-123 commented Jul 8, 2018

topper-123 commented Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Aug 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Aug 9, 2018

Choose a reason for hiding this comment

topper-123 commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Aug 17, 2018

topper-123 commented Aug 25, 2018

gfyoung commented Aug 25, 2018

jreback commented Aug 25, 2018

jreback commented Sep 23, 2018

gfyoung commented Sep 23, 2018

pep8speaks commented Sep 24, 2018

topper-123 commented Sep 24, 2018 • edited Loading

pep8speaks commented Sep 24, 2018

gfyoung commented Sep 24, 2018

jreback commented Sep 25, 2018

topper-123 commented Oct 19, 2018

topper-123 commented Jul 1, 2018 •

edited

Loading

codecov bot commented Jul 1, 2018 •

edited

Loading

topper-123 commented Jul 4, 2018 •

edited

Loading

topper-123 commented Jul 4, 2018 •

edited

Loading

topper-123 commented Aug 1, 2018 •

edited

Loading

topper-123 Aug 6, 2018 •

edited

Loading

topper-123 commented Sep 24, 2018 •

edited

Loading