-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: MultiIndex._engine use smaller dtypes #58411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: MultiIndex._engine use smaller dtypes #58411
Conversation
d8fa119
to
3ab8601
Compare
c512c51
to
180098a
Compare
|
||
# Check the total number of bits needed for our representation: | ||
if lev_bits[0] > 64: | ||
# The levels would overflow a 64 bit uint - use Python integers: | ||
return MultiIndexPyIntEngine(self.levels, self.codes, offsets) | ||
return MultiIndexUIntEngine(self.levels, self.codes, offsets) | ||
if lev_bits[0] > 32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the existing engine to get resized where it would exceed e.g. int32 max
? Or would a new engine just be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify in what cases the existing engine would exceed int32 max
?
In cases where MultiIndex._set_levels()
is called internally, this would also call _reset_cache()
so the engine is recreated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly remember what path this goes through, but a snippet like
n = np.iinfo(np.uint8).max
ser = pd.Series(range(n), index=pd.MultiIndex.from_arrays([range(n]))
ser.iloc[n+1] = n+1
ser.index.some_method_that_uses_the_engine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the snippet above with some slight adjustment:
In [34]: n = np.iinfo(np.uint8).max - 1
In [35]: ser = pd.Series(range(n), index=pd.MultiIndex.from_arrays([range(n)]))
In [36]: ser.index._cache
Out[36]: {}
In [37]: ser.index._engine
Out[37]: <pandas.core.indexes.multi.MultiIndexUInt8Engine at 0x118e0fed0>
In [38]: ser.index._engine.values
Out[38]:
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,
106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196,
197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222,
223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235,
236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248,
249, 250, 251, 252, 253, 254, 255], dtype=uint8)
In [39]: n
Out[39]: 254
In [40]: ser.index._cache
Out[40]:
{'levels': FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]]),
'_engine': <pandas.core.indexes.multi.MultiIndexUInt8Engine at 0x118e0fed0>}
In [41]: ser.loc[n+1] = n+1
In [42]: ser.index._cache
Out[42]: {'levels': FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]])}
In [43]: ser.index._engine
Out[43]: <pandas.core.indexes.multi.MultiIndexUInt16Engine at 0x107288270>
In [44]: ser.index._cache
Out[44]:
{'levels': FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]]),
'_engine': <pandas.core.indexes.multi.MultiIndexUInt16Engine at 0x107288270>}
So it seems that after calling ser.loc
, the engine is automatically deleted from the cache (or the index is a copy, since id returns a different number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for checking
I executed the asv benchmarks for
All of them improved, except Side note, I also tried to move the class |
Thanks @GianlucaFicarelli |
* PERF: MultiIndex._engine use smaller dtypes * Move offsets downcasting to MultiIndex._engine * Remove unused import uint64_t
Use smaller dtypes in
MultiIndex._engine
if possible.Reduce both loading time and memory (peak and used).
The improvement should be more relevant when working with big indices.
Below there are 2 examples.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Improvement when
engine.values
usesuint32
instead ofuint64
:idx._engine
: from 8.53 s to 2.45 s_engine.values.nbytes
: from 800000000 to 400000000_engine
:Time, Pandas 2.2.2:
Time, PR branch:
Memory, Pandas 2.2.2:
Memory, PR branch:
Improvement when
engine.values
still needs to useuint64
:idx._engine
: from 17.6 s to 4.56 s_engine.values.nbytes
: from 390625000 to 390625000 (unchanged since the dtype is the same)_engine
:Time, Pandas 2.2.2:
Time, PR branch:
Memory, Pandas 2.2.2:
Memory, PR branch: