Skip to content

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False #48749

Open
@mroeschke

Description

@mroeschke

xref: dask/dask#9486 (comment)

TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False) the resulting CategoricalIndex will have it's values and categories unordered.

In [1]:     df = DataFrame(
   ...:         [
   ...:             ["(7.5, 10]", 10, 10],
   ...:             ["(7.5, 10]", 8, 20],
   ...:             ["(2.5, 5]", 5, 30],
   ...:             ["(5, 7.5]", 6, 40],
   ...:             ["(2.5, 5]", 4, 50],
   ...:             ["(0, 2.5]", 1, 60],
   ...:             ["(5, 7.5]", 7, 70],
   ...:         ],
   ...:         columns=["range", "foo", "bar"],
   ...:     )

In [2]: col = "range"

In [3]: df["range"] = Categorical(df["range"], ordered=False)

In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')

It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:

# sort=False should order groups in as-encountered order (GH-8868)
cat = c.unique()
# See GH-38140 for block below
# exclude nan from indexer for categories
take_codes = cat.codes[cat.codes != -1]
if cat.ordered:
take_codes = np.sort(take_codes)
cat = cat.set_categories(cat.categories.take(take_codes))
# But for groupby to work, all categories should be present,
# including those missing from the data (GH-13179), which .unique()
# above dropped
cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)])
return c.reorder_categories(cat.categories), None

May have been an outcome of fixing #8868, but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeGroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions