Skip to content

PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jul 1, 2018

This is the the final puzzle for #20395.

The problem with the current implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

@cache_readonly
def _engine(self):
# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is very expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.codes.searchsorted(1)  # search for left location of 'b'
7.38 ms   # slow
>>> code = np.int8(1)
>>> %timeit ci.codes.searchsorted(code)
2.57 µs  # fast

Solution options

As CategoricalIndex.codes may be int8, int16, etc, the solution must be to (1) have an indexing engine for each integer dtype or an indexing engine that accepts all int types, not just int64 and (2) that the key must be translated into the same dtype as the codes array before calling searchsorted. So either:

  1. Change Int64Engine to be a IntEngine (i.e. accept all integer dtypes)
  2. Make new IntEngine classes, with the appropriate flexibility for accepting all integer dtypes, but defers to Int64 version if/when needed (e.g. if codes is int8, but we only have algos.is_monotonic_int64 for checking monotonicity)
  3. Do everything in Python.

I assume option 1 is not desired, and option 3 assumedly likewise. In the updated PR I've made a proposal in Cython, that attains the needed speed.

Benchmarks from asv_bench/indexing.py

      before           after         ratio
     [dc45fbaf]       [bc03b8bd]
-     1.54±0.02ms         9.80±0μs     0.01  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@codecov
Copy link

codecov bot commented Jul 1, 2018

Codecov Report

Merging #21699 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21699      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50821       +1     
==========================================
+ Hits        46850    46852       +2     
+ Misses       3970     3969       -1
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.38% <57.14%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.39% <ø> (ø) ⬆️
pandas/core/indexes/category.py 97.54% <100%> (+0.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab9dbd...6aa94e9. Read the comment docs.

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jul 2, 2018
@jreback
Copy link
Contributor

jreback commented Jul 2, 2018

so 2) is the prefered solution here. you can simply use the templating like we do in most modules which generates code. its not trivial but not super complicated either.

@WillAyd
Copy link
Member

WillAyd commented Jul 4, 2018

@topper-123 what issues were you running into with Cython? As far as generating the indexes goes I would think you could just add the appropriate types below:

dtypes = [('Float64', 'float64', 'float64_t'),

Not sure how you were planning to dynamically select the appropriate engine in category.py but as far as generating those engines in the Cython layer I think most of the foundation is already in place

@topper-123
Copy link
Contributor Author

Ok, I think I understand the design now, at least. What is the templating language used and how is the pxi-file generated?

So would the correct approach be to:

  • add int8_t, int16_t etc. to that list in index_helper_pxi.in,
  • similarly add them to the list in algos.common_helper.pxi.in,
  • ensure that CategoricalIndex._engine is of the correct type,
  • avoid the casting to int64 in CategoricalIndex._engine?

@topper-123
Copy link
Contributor Author

topper-123 commented Jul 4, 2018

Ok, got algos.common_helper.pxi.in compiling and works fine.

index_helper.pxi throws a compiler error. I added to the dtype list in index_helper.pxi.in:

          ('Int8', 'int8', 'int8_t'),

but get this error when compiling:

cythoning pandas\_libs/index.pyx to pandas\_libs\index.c

Error compiling Cython file:
------------------------------------------------------------
...

    cdef _maybe_get_bool_indexer(self, object val):
        cdef:
            ndarray[uint8_t, ndim=1, cast=True] indexer
            ndarray[intp_t, ndim=1] found
            ndarray[int8_t] values
                         ^
------------------------------------------------------------

pandas\_libs\index_class_helper.pxi:179:26: Invalid type.

It appears the compiler does not like the int8_t type. Using ('UInt8', 'uint8', 'uint8_t'), instead compiles without errors.

Any ideas what's happening?

@jreback
Copy link
Contributor

jreback commented Jul 4, 2018

you need to do the imports of int8_t in index.pyx itself (see the top).

@topper-123
Copy link
Contributor Author

topper-123 commented Jul 4, 2018

Thanks, got everything compiling with Int8Engine, which is good.

However, when running, it complains there is no Int8HashTable. Seems like there isn't an easy way to create an Int8HashTable, is that correct?

~\Documents\Python\pandasdev\pandasdev\pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int8Engine._make_hash_table()
    212
    213     cdef _make_hash_table(self, n):
--> 214         return _hash.Int8HashTable(n)

AttributeError: module 'pandas._libs.hashtable' has no attribute 'Int8HashTable'

I tried looking into using a Int64HashTable in Int8Engine, but that seems to bring a different set of tradeoffs...

Is there an easy way to create a Int8HashTable, or is it reasonable to use a Int64HashTable in Int8Engine?

@jreback
Copy link
Contributor

jreback commented Jul 4, 2018

you might need to create the other hash tables (definitions) as well
i think we only have 32 & 64 hits atm

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from fbb711f to db5c447 Compare July 5, 2018 12:15
@topper-123
Copy link
Contributor Author

Ok, I've tried creating the int8 hash table, but got some issues. Everything compiles up till and including " add Int8Index", but that runs into the issue that no Int8HashTable existst. I've tried creating a Int8HashTable (see the commit "add Int8HashTable"), but I can't get to compile.

It gives this compilation error:

Error compiling Cython file:
------------------------------------------------------------
...
        self.external_view_exists = True
        return self.ao

    cdef inline void append(self, int8_t x):

        if needs_resize(self.data):
                      ^
------------------------------------------------------------

pandas\_libs\hashtable_class_helper.pxi:287:23: no suitable method found

Error compiling Cython file:
------------------------------------------------------------
...


cdef class Int8HashTable(HashTable):

    def __cinit__(self, size_hint=1):
        self.table = kh_init_int8()
                                ^
------------------------------------------------------------

pandas\_libs\hashtable_class_helper.pxi:1291:33: undeclared name not builtin: kh_init_int8

I've tried messing with khash.pxd to get that kh_*_int8 working, but can't get it running.

What does khash.pxd do? It seems to in turn depend on khash_python.h, which only seems to have int64, int32 anf float64 version available. Must that be changed before getting a Int8HashTable working?

Guidance would be very much appreciated.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2018

hmm, i guess khash doesn't support this either.

ok, so maybe let's do a precursor PR to do this properly

@topper-123
Copy link
Contributor Author

Ok, just to confirm, so you/someone else will do the PR to get Int8HashTable etc. working, and I'll wait for that and afterwards I will implement Int8Engine etc?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

@topper-123 well I suppose someone can do a PR to make other types of hashtables work. cc @pandas-dev/pandas-core

@topper-123
Copy link
Contributor Author

That would be great. If someone makes the hash tables I will take this the rest of the way.

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 1, 2018

I've made a solution for the problem of not having Int8HashTable, Int16HashTable and Int32HashTable by using Int64HashTable in Int8Engine etc. All the tests pass and performance is the same as above.

Is this an ok approach? I could then add some tests and ASVs as needed.

A nice side effect BTW is that we now have and can use pd._libs.algos.is_monotonic_int8, so we don't have to convert to int64 when checking for monoticity and uniqueness:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
# the below is called in ci.is_monotonic etc. in master
>>> %timeit pd._libs.algos.is_monotonic_int64(ci.codes.astype('int64'), False)
21.6 ms  # master
# the below is called in ci.is_monotonic etc. in this PR
>>> %timeit pd._libs.algos.is_monotonic_int8(ci.codes, False)
5.6 ms  # this PR

So all code that calls CategoricalIndex.is_monotonic/.is_unique will benefit also.

@@ -429,10 +437,11 @@ def get_loc(self, key, method=None):
>>> non_monotonic_index.get_loc('b')
array([False, True, False, True], dtype=bool)
"""
codes = self.categories.get_loc(key)
if (codes == -1):
raise KeyError(key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codes can never be -1. If key is not in self.categories, a KeyError was always raised.

@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs):
def _engine(self):

# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need any changes in cython if you simply so: .astype('i8', copy=False), which is the cause of the perf issue, no?

its still not as nice as actually using type specific hashtables though (which this PR is not addressing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing that. Still is slow, 14 ms.

In index_class_helper.pxi.in, I also tried changing

    cdef _get_index_values(self):
        return algos.ensure_{{dtype}}(self.vgetter())

to

    cdef _get_index_values(self):
        return self.vgetter()

But also slow, 14 ms.

I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you have something else going on

Copy link
Contributor Author

@topper-123 topper-123 Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy docs say about the copy parameter of array.astype:

By default, astype always returns a newly allocated array. If this is set
to false, and the dtype, order, and subok requirements are satisfied,
the input array is returned instead of a copy.

So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.

raise KeyError(key)
return self._engine.get_loc(codes)
code = self.categories.get_loc(key)
# dtype must be same as dtype for codes for searchsorted not lose speed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line between

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from 6e1adff to 912d81b Compare August 1, 2018 13:24
@topper-123 topper-123 changed the title PERF/WIP: CategoricalIndex.get_loc should avoid expensive cast to int64 PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 Aug 9, 2018
@topper-123
Copy link
Contributor Author

Hey, array.astype('int64', copy=False) won't work as explained above.

I don't see any good way forward other than someone commiting to make an Int8/16/32HashTable at some point. The question is thought if this PR needs to wait for that, or can be pulled in now and the relevant lines be changed when the hash table is implemented.

I argue pulling this in now, as index operations are critical for performance and that would allow me to pursue a few more index-related performance issues.

@@ -366,7 +370,7 @@ def argsort(self, *args, **kwargs):
def _engine(self):

# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from 912d81b to c4f74cb Compare August 10, 2018 04:32
@topper-123
Copy link
Contributor Author

algos.ensure_int64(values) ends up calling values.astype(np.int64), and therefore also causes the slowdown for int8/16/32 arrays that we want to avoid.

A technically possible solution to avoid that if statement while keeping the speed improvements would be to input a int64 array into _engine:

    def _engine(self):
        codes = self.codes.astype('int64')
        return self._engine_type(lambda: codes, len(self))

But that would keep an int64 array in memory, in addition to an int8 array, so doesn't make any sense memorywise. We want to just use the existing int8 array.

I made another solution, where I've moved this specific implementation to index_class_helper.pxi.in. This makes its location more specific to the problem it solves.

- Improved performance of membership checks in :class:`Categorical` and :class:`CategoricalIndex`
(i.e. ``x in cat``-style checks are much faster). :meth:`CategoricalIndex.contains`
is likewise much faster (:issue:`21369`, :issue:`21508`)
- Improved performance of :func:`Series.describe` in case of numeric dtypes (:issue:`21274`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally pls try to avoid changing orthogonal things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that may be confusing. I'll limit doing this from now on, but this was just a spelling error, that I corrected, as it was adjacent to my text..

'UInt64', 'UInt32', 'UInt16', 'UInt8',
'Float64', 'Float32',
])
def num_engine(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this more descriptive, numeric_indexing_engine

class TestCategoricalIndexEngine(object):

@pytest.mark.parametrize('nbits', [8, 16, 32, 64])
def test_engine_type(self, nbits):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, this is very complicated. why is this necessary? can you make much more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've made it simpler.


class TestNumericEngine(object):

@pytest.mark.parametrize('data', [[0, 1, 2]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this parameterized over data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. You start out with a certain structure and forget to check if is sensible for the end product...


@pytest.mark.parametrize('data', [[0, 1, 2]])
def test_is_monotonic_ordered(self, data, num_engine):
codes = np.array(data, dtype=num_engine._dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have multiple tests for monotonic? and not monotonic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there just one data to check for. Fixed.

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from adac8f4 to 9836666 Compare August 17, 2018 20:53
@topper-123
Copy link
Contributor Author

Changes according to comments.

@topper-123
Copy link
Contributor Author

Is this ok to merge in?

@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2018

ping @jreback

@jreback
Copy link
Contributor

jreback commented Aug 25, 2018

will have a look

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

not sure of the state of thIS PR, needs a rebase

@gfyoung
Copy link
Member

gfyoung commented Sep 23, 2018

@jreback : I believe this PR is ready for review see above. However, time as a resource is limited, and this one got overlooked unfortunately.

@topper-123 : Sorry for the delay, but can you rebase this?

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from 9836666 to 1a438a3 Compare September 24, 2018 18:22
@pep8speaks
Copy link

Hello @topper-123! Thanks for updating the PR.

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 24, 2018

Rebased and green.

EDIT: ok, the whatsnew disappeared in the rebase. I've added it again, while the code itself unchanged.

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc branch from 1a438a3 to 6aa94e9 Compare September 24, 2018 22:16
@pep8speaks
Copy link

Hello @topper-123! Thanks for updating the PR.

@gfyoung
Copy link
Member

gfyoung commented Sep 24, 2018

@topper-123 : Thanks! Can you double check if you addressed all comments? IINM, I see a few that might need attention...

Feel free to resolve those that you think have completed discussion.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

I am not convinced this is actually a good solution here. I think there was a much simpler version that worked. This is adding quite a bit of code and complexity. I would like to see the performance tests split off into a new PR (and the codes can never be -1) bug fix as a first step.

@topper-123
Copy link
Contributor Author

Closed in favor of #23235.

@topper-123 topper-123 deleted the CategoricalIndex.get_loc branch February 2, 2019 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index
7 participants