You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in
The line above from categorical.py is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along
The 50 missing milliseconds are spent calling _shallow_copy in MultiIndex._set_levels. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex
%timeitcategories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeitpd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note however that
%timeitpd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I would have liked to make PR but I don't fully understand the code and all that class overloading
The text was updated successfully, but these errors were encountered:
Just wanted to chime in that this seems to be the main performance bottleneck for me when reading in string index columns from Parquet (using, e.g., pyarrow).
Because they're dictionary-encoded on disk, they're efficiently read in as Categorical columns; calling set_index on this single column (to make a CategoricalIndex) or set_index with a string column and a few integer columns (to make a MultiIndex) takes 1-2min for 250M rows, which is longer than it actually takes to read the data from disk.
Hi,
Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in
pandas/pandas/core/arrays/categorical.py
Line 2510 in edb71fd
Unfortunately that call is painfully slow
The line above from
categorical.py
is the first bottleneck. It takes 130 ms in this example.We could go around an replace that line with something along
On my machine, this takes ~2 ms
The 50 missing milliseconds are spent calling
_shallow_copy
inMultiIndex._set_levels
. That copy may not be that shallow: looks like this boils down to yet another variation on theCategoricalIndex
Note however that
I would have liked to make PR but I don't fully understand the code and all that class overloading
The text was updated successfully, but these errors were encountered: