PERF: GroupBy.quantile #51722

jbrockmendel · 2023-03-01T21:27:33Z

closes PERF: GroupBy.quantile #51385 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The existing implementation does a lexsort over (values, ids) for each column which gets very expensive. By using grouper._get_splitter()._sorted_data, we only sort by ids once, then we cheaply iterate over groups and do group-by-group argsorts. This is roughly equivalent to

obj = self._get_data_to_aggregate(...)

def quantile_func(df):
    returnn df.quantile(qs, interpolation=interpolation)

result = self._python_apply_general(df, obj)
[wrapping stuff specific to quantile]

I tried an implementation that used _python_apply_general and ripped out the cython group_quantile entirely and found it had some rough edges with a) axis=1, b) dtypes where GroupBy.quantile behaves different from DataFrame.quantile (xref #51424), and it performed poorly as ngroups becomes big. With a large number of rows though, that performs better than this. I speculate that np.percentile is doing something more efficiently than our cython group_quantile, but haven't figured it out yet.

Some timings!

def time_quantile(nrows, ncols, ngroups):
    qs = [0.5, 0.75]
    np.random.seed(342464)
    arr = np.random.randn(nrows, ncols)
    df = pd.DataFrame(arr)
    df["A"] = np.random.randint(ngroups, size=nrows)

    gb = df.groupby("A")

    res = %timeit -o gb.quantile(qs)
    return res.average

timings = {}

for nrows in [10**5, 10**6, 10**7]:
    for ncols in [1, 5, 10]:
        for ngroups in [10, 100, 1000, 10000]:
            key = (nrows, ncols, ngroups)
            timings[key] = time_quantile(nrows, ncols, ngroups)

Did this for each of main, this PR, and the pure-python implementation described above. The results:

                         pure_py         PR       main
nrows    ncols ngroups                                
100000   1     10       0.006144   0.008039   0.020776
               100      0.028917   0.006472   0.023184
               1000     0.275030   0.006711   0.022179
               10000    2.558759   0.023987   0.024582
         5     10       0.015371   0.038649   0.110015
               100      0.038711   0.027942   0.111417
               1000     0.266454   0.029911   0.119993
               10000    2.418917   0.101383   0.121122
         10    10       0.026670   0.072206   0.213674
               100      0.047769   0.060826   0.224247
               1000     0.271641   0.061457   0.234644
               10000    2.441758   0.200785   0.255878
1000000  1     10       0.044218   0.116293   0.331845
               100      0.069265   0.098826   0.358672
               1000     0.287011   0.084252   0.366118
               10000    2.360640   0.094245   0.410894
         5     10       0.149162   0.484149   1.610431
               100      0.169841   0.387538   1.724159
               1000     0.408468   0.291384   1.726341
               10000    2.581606   0.309029   1.741293
         10    10       0.261970   0.879218   3.167597
               100      0.259414   0.687485   4.399621
               1000     0.499755   0.640692   3.856934
               10000    2.778011   0.630412   4.160573
10000000 1     10       0.493514   1.398108  10.061577
               100      0.780788   1.416905  11.622942
               1000     1.105673   1.319962  11.244395
               10000    3.552967   1.357090  11.206071
         5     10       1.895618   6.074704  50.622437
               100      1.956891   4.885986  56.239948
               1000     2.101247   3.985950  58.742726
               10000    4.756156   3.486512  59.193984
         10    10       3.561774  12.117515        NaN
               100      3.392814   9.700422        NaN
               1000     3.302072   7.369062        NaN
               10000    5.719315   6.224435        NaN

on main I had to interrupt (with ctl-z, not ctl-c!) the largest cases.

This out-performs main in all cases. The pure-python version out-performs this in small-ngroups cases and large-nrows cases, but suffers pretty dramatically in the opposite cases.

Potential caveats:

I did this on my laptop, tried to keep other heavy processes to a minimum, but still had browser tabs etc.
Only did float64s
There are no NaNs in this data. I'm pretty sure the pure-python version goes through a less performant path in cases with nans.
Profiling one of the cases where this does poorly vs pure_py shows most of the time being in blk_func but libgroupby.group_quantile doesn't show up and certainly nothing inside it. My best guess is most of the time is in the argsort inside group_quantile.

Potential downsides vs main:

Accessing self.grouper._get_splitter()._sorted_data creates a copy. The np.lexsort in the status quo also allocates an array, but since that is done block-by-block its plausible we get less total memory allocation.
Reaching into grouper internals (ish) is not super-pretty.

We could plausibly keep multiple implementations and dispatch based on sizes, but im not sure groupby.quantile is important enough to really merit that. So which version to use really comes down to what sized cases we think are the most common.

WillAyd · 2023-03-03T04:00:31Z

pandas/_libs/groupby.pyx

    qs : ndarray[float64_t]
        The quantile values to search for.
+    starts : ndarray[int64]


Do starts and ends need to be passed separately? On initial review I'm expecting that the groups are all monotonic in each array, so does starts[i] always equal ends[i-1] - 1?

lib.generate_slices returns a pair of ndarrays, id have to look at that more closely to comment on the why

github-actions · 2023-05-01T00:06:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

rhshadrach · 2023-06-22T21:31:08Z

I'd lean away from the pure-Python/Cython hybrid approach; I think this looks great as is. Does the Boolean issue (#51424) hold this up with the Cython approach?

jbrockmendel · 2023-07-27T21:09:07Z

Does the Boolean issue (#51424) hold this up with the Cython approach?

I don't think so, no.

rhshadrach · 2023-07-28T21:35:00Z

@jbrockmendel - I'm good here if you're okay with the trade-offs you mentioned in the OP. If that's the case, can you merge main.

rhshadrach

lgtm - could use a line in the whatsnew

rhshadrach

lgtm

rhshadrach · 2023-08-01T02:04:40Z

Thanks @jbrockmendel

PERF: GroupBy.quantile

323a355

WillAyd reviewed Mar 3, 2023

View reviewed changes

jbrockmendel added 2 commits March 9, 2023 10:33

Merge branch 'main' into perf-quantile-2

f346053

Merge branch 'main' into perf-quantile-2

3427eb3

mroeschke added Groupby Performance Memory or execution speed performance quantile quantile method labels Mar 17, 2023

jbrockmendel added 3 commits March 17, 2023 14:19

Merge branch 'main' into perf-quantile-2

9ab0990

simplify

4d55907

Merge branch 'main' into perf-quantile-2

139e2cf

github-actions bot added the Stale label May 1, 2023

Merge branch 'main' into perf-quantile-2

36dcf46

Merge branch 'main' into perf-quantile-2

159e688

jbrockmendel requested a review from rhshadrach as a code owner July 28, 2023 23:03

Merge branch 'main' into perf-quantile-2

a8c3c5c

mroeschke added this to the 2.1 milestone Jul 31, 2023

rhshadrach removed the Stale label Jul 31, 2023

rhshadrach requested changes Jul 31, 2023

View reviewed changes

jbrockmendel added 2 commits July 31, 2023 16:02

Merge branch 'main' into perf-quantile-2

6c4e9f3

whatsnew

53a81ef

rhshadrach approved these changes Aug 1, 2023

View reviewed changes

rhshadrach merged commit e8961f1 into pandas-dev:main Aug 1, 2023

jbrockmendel deleted the perf-quantile-2 branch August 1, 2023 02:42

jbrockmendel mentioned this pull request Oct 15, 2023

PERF: investigate numpy's percentile implementation #55535

Open

TendouArisu mentioned this pull request Mar 1, 2024

Potential performance issue: GroupBy.quantile slow in pandas below 2.1 williamliujl/CMExam#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: GroupBy.quantile #51722

PERF: GroupBy.quantile #51722

jbrockmendel commented Mar 1, 2023 •

edited

Loading

WillAyd Mar 3, 2023

jbrockmendel Mar 3, 2023

github-actions bot commented May 1, 2023

rhshadrach commented Jun 22, 2023

jbrockmendel commented Jul 27, 2023

rhshadrach commented Jul 28, 2023

rhshadrach left a comment

rhshadrach left a comment

rhshadrach commented Aug 1, 2023

PERF: GroupBy.quantile #51722

PERF: GroupBy.quantile #51722

Conversation

jbrockmendel commented Mar 1, 2023 • edited Loading

WillAyd Mar 3, 2023

Choose a reason for hiding this comment

jbrockmendel Mar 3, 2023

Choose a reason for hiding this comment

github-actions bot commented May 1, 2023

rhshadrach commented Jun 22, 2023

jbrockmendel commented Jul 27, 2023

rhshadrach commented Jul 28, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Aug 1, 2023

jbrockmendel commented Mar 1, 2023 •

edited

Loading