CLN/PERF: group quantile #43510

mzeitlin11 · 2021-09-11T04:43:14Z

Inspired by idea from @jbrockmendel in #43489 (comment).

Didn't end up seeing a huge perf improvement though, for example results on a wide frame:

arr = np.random.randn(10 ** 5, 100)
df = pd.DataFrame(arr)
gb = df.groupby(df.index % 3)
gb.quantile(0.5)
# this pr: 1.18 s ± 9.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master: 1.55 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Based on a quick profile, almost all time is spent in lexsort on master or with this pr - seems lexsorting the entire block at once has comparable performance to going column by column (I guess more improvement would show up on a really wide frame).

Regardless, though, I think moving this out of cython is an improvement (since gives no perf benefit for them to be in cython when just numpy functions, also saves ~80kb on compiled code).

pandas/_libs/groupby.pyi

jbrockmendel · 2021-09-11T17:10:12Z

pandas/core/groupby/groupby.py

        def blk_func(values: ArrayLike) -> ArrayLike:
            mask = isna(values)
            vals, inference = pre_processor(values)

            ncols = 1
            if vals.ndim == 2:
                ncols = vals.shape[0]
+                shaped_labels = np.broadcast_to(


why is the broadcast necessary here?

The problem was labels is 1-dimensional, but values can be 2. I couldn't figure out a way to make lexsort naturally broadcast in that case. I think this broadcast is relatively cheap, though, since np.broadcast_to uses views, so it shouldn't increase memory usage (and this op doesn't show up in a profile)

The broadcast itself is definitely cheap. In some cases I've later found that operations done on these broadcasted ndarrays are surprisingly slow. But if you've profiled it and its not an issue, then no complaints here.

Oh, interesting. The profile showed that all time is spent in lexsort, so definitely possible slow access on the broadcasted array is slowing it down. Will check how lexsort timing compares with this broadcast vs a contiguous array

Adding a copy after the broadcast doesn't seem to affect lexsort time in a profile

jreback · 2021-09-12T15:07:37Z

thanks @mzeitlin11

mzeitlin11 added 2 commits September 10, 2021 19:22

PERF: group_quantile sort outside cython

26dc177

Remove accidentally added file

430bdc1

mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean Groupby labels Sep 11, 2021

Add back ravel K

8c2af03

jreback added this to the 1.4 milestone Sep 11, 2021

jreback reviewed Sep 11, 2021

View reviewed changes

pandas/_libs/groupby.pyi Outdated Show resolved Hide resolved

mzeitlin11 added 3 commits September 11, 2021 11:29

Merge remote-tracking branch 'upstream/master' into lexsort_gb_quantile

6ca4a14

sort_arr -> sort_indexer

3063490

Remove accidentally added file

98c7cde

jbrockmendel reviewed Sep 11, 2021

View reviewed changes

jreback merged commit e0ef148 into pandas-dev:master Sep 12, 2021

mzeitlin11 deleted the lexsort_gb_quantile branch September 12, 2021 15:11

AlexeyGy pushed a commit to AlexeyGy/pandas that referenced this pull request Sep 13, 2021

CLN/PERF: group quantile (pandas-dev#43510)

e5ac01c

AlexeyGy pushed a commit to AlexeyGy/pandas that referenced this pull request Sep 13, 2021

CLN/PERF: group quantile (pandas-dev#43510)

b6433ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN/PERF: group quantile #43510

CLN/PERF: group quantile #43510

mzeitlin11 commented Sep 11, 2021

jbrockmendel Sep 11, 2021

mzeitlin11 Sep 11, 2021

jbrockmendel Sep 11, 2021

mzeitlin11 Sep 11, 2021

mzeitlin11 Sep 11, 2021

jreback commented Sep 12, 2021

CLN/PERF: group quantile #43510

CLN/PERF: group quantile #43510

Conversation

mzeitlin11 commented Sep 11, 2021

jbrockmendel Sep 11, 2021

Choose a reason for hiding this comment

mzeitlin11 Sep 11, 2021

Choose a reason for hiding this comment

jbrockmendel Sep 11, 2021

Choose a reason for hiding this comment

mzeitlin11 Sep 11, 2021

Choose a reason for hiding this comment

mzeitlin11 Sep 11, 2021

Choose a reason for hiding this comment

jreback commented Sep 12, 2021