Skip to content

PERF: groupby.agg() regression #46505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
anmyachev opened this issue Mar 25, 2022 · 6 comments
Open
2 of 3 tasks

PERF: groupby.agg() regression #46505

anmyachev opened this issue Mar 25, 2022 · 6 comments
Labels
Apply Apply, Aggregate, Transform, Map Groupby Performance Memory or execution speed performance

Comments

@anmyachev
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import string
import random

import pandas as pd
import numpy as np
from time import time

rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan

def test2():
    for _ in range(3):
        start = time()
        result = df.groupby(3, sort=False).agg(
            col1=(2, _format),
            col2=(1, _format),
        )
        print(f"end: {time()-start}")

test2()

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 59.8.0
Cython : None
pytest : 7.0.0
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : 2022.01.0
gcsfs : None
matplotlib : 3.2.2
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.17.0
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.01.0
scipy : 1.8.0
sqlalchemy : 1.4.31
tables : 3.7.0
tabulate : None
xarray : 0.21.1
xlrd : 2.0.1
xlwt : None
zstandard : None

Prior Performance

For 1.3.5 - 4.9 sec
For 1.4.1 - 19.8 sec

@anmyachev anmyachev added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 25, 2022
@rhshadrach
Copy link
Member

rhshadrach commented Mar 25, 2022

Thanks for the report. On my laptop 1.3.x takes ~3s, main takes ~15s. I ran two git bisects, one failing if it took longer than 5s, one failing if it took longer than 10s.

5s:

Author: jbrockmendel
Date: Fri Sep 10 17:17:41 2021 -0700

REF: remove libreduction.SeriesGrouper (#43505)

10s:

Author: Joris Van den Bossche
Date: Sun Jan 16 17:58:29 2022 +0100

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply (#45363)

cc @jbrockmendel and @jorisvandenbossche

I will note that your other 4255 PRs are good @jbrockmendel, at least in regards to this issue.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 25, 2022
@jbrockmendel
Copy link
Member

Hah thanks @rhshadrach a little bit of positive reinforcement goes a long way.

Not really looking at the function carefully, I'm assuming its what we used to call near-worst-case where the function does very little and all the time is spent in iterating over the groups. I think in the original get-rid-of-libreduction PR I intentionally avoided using _constructor and __finalize__ to minimize the performance hit. IIRC these were later re-added in order to be Technically Correct.

Can you report some profiling results?

@jbrockmendel
Copy link
Member

Profiling results on main vs 1.3.5

import string
import random

import pandas as pd
import numpy as np
from time import time

rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan


gb = df.groupby(3, sort=False)
func = {"col1": (2, _format), "col2": (1, _format)}
%prun -s cumtime gb.agg(**func)

main

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   25.986   25.986 {built-in method builtins.exec}
        1    0.001    0.001   25.986   25.986 <string>:1(<module>)
        1    0.000    0.000   25.985   25.985 generic.py:874(aggregate)
        1    0.001    0.001   25.975   25.975 apply.py:151(agg)
        1    0.000    0.000   25.975   25.975 apply.py:453(agg_dict_like)
        1    0.000    0.000   25.943   25.943 apply.py:486(<dictcomp>)
      4/2    0.001    0.000   25.943   12.972 generic.py:257(aggregate)
        2    0.000    0.000   25.943   12.971 generic.py:312(_aggregate_multiple_funcs)
        2    0.002    0.001   25.941   12.970 groupby.py:1648(_python_agg_general)
        2    0.013    0.006   25.938   12.969 ops.py:1040(agg_series)
        2    1.982    0.991   25.859   12.930 ops.py:1082(_aggregate_series_pure_python)
  1000002    0.743    0.000   18.458    0.000 ops.py:1313(__iter__)
  1000000    1.878    0.000   17.631    0.000 ops.py:1335(_chop)
  1000004    2.796    0.000    6.994    0.000 generic.py:5849(__finalize__)
  1000000    2.478    0.000    5.822    0.000 managers.py:1997(get_slice)
  1000000    0.485    0.000    3.988    0.000 groupby.py:1651(<lambda>)
  1000000    0.657    0.000    3.503    0.000 <ipython-input-1-caf266d6801e>:16(_format)
  1000000    0.411    0.000    2.737    0.000 series.py:664(values)
  1000004    0.842    0.000    2.673    0.000 series.py:343(__init__)
  1000006    0.786    0.000    2.540    0.000 series.py:659(name)
  1000000    1.046    0.000    2.327    0.000 managers.py:2021(external_values)
  1000000    0.722    0.000    2.107    0.000 base.py:5385(_getitem_slice)
  1000008    0.471    0.000    1.753    0.000 common.py:1723(validate_all_hashable)
  1000008    1.364    0.000    1.743    0.000 generic.py:260(__init__)
  1000034    0.908    0.000    1.385    0.000 base.py:691(_simple_new)
  1000035    0.424    0.000    1.283    0.000 {built-in method builtins.all}

1.3.5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.553    6.553 {built-in method builtins.exec}
        1    0.001    0.001    6.553    6.553 <string>:1(<module>)
        1    0.000    0.000    6.552    6.552 generic.py:964(aggregate)
        1    0.000    0.000    6.543    6.543 apply.py:143(agg)
        1    0.000    0.000    6.543    6.543 apply.py:405(agg_dict_like)
        1    0.000    0.000    6.512    6.512 apply.py:435(<dictcomp>)
      4/2    0.000    0.000    6.512    3.256 generic.py:225(aggregate)
        2    0.000    0.000    6.511    3.256 generic.py:278(_aggregate_multiple_funcs)
        2    0.002    0.001    6.509    3.254 groupby.py:1315(_python_agg_general)
        2    0.010    0.005    6.408    3.204 ops.py:1019(agg_series)
        2    0.001    0.000    6.330    3.165 ops.py:1069(_aggregate_series_fast)
        2    1.461    0.730    6.240    3.120 {method 'get_result' of 'pandas._libs.reduction.SeriesGrouper' objects}
  1000000    0.357    0.000    2.675    0.000 groupby.py:1318(<lambda>)
  1000000    0.487    0.000    2.317    0.000 <ipython-input-1-caf266d6801e>:16(_format)
  1000004    0.303    0.000    1.754    0.000 series.py:632(values)
  1000005    0.645    0.000    1.508    0.000 base.py:767(_engine)
  1000004    0.434    0.000    1.451    0.000 managers.py:1670(external_values)
  1000004    0.255    0.000    0.879    0.000 blocks.py:208(external_values)
4000375/3000291    0.446    0.000    0.663    0.000 {built-in method builtins.len}
  1000004    0.369    0.000    0.624    0.000 blocks.py:2114(external_values)
  1000000    0.526    0.000    0.596    0.000 managers.py:1712(set_values)
  1000008    0.230    0.000    0.347    0.000 base.py:4405(_get_engine_target)
  1000066    0.217    0.000    0.291    0.000 base.py:794(__len__)
  2001531    0.256    0.000    0.256    0.000 {built-in method builtins.isinstance}
  1000022    0.138    0.000    0.138    0.000 managers.py:1619(_block)
  1000056    0.117    0.000    0.117    0.000 base.py:4379(_values)

so consistent with the bisect results, the big changes are the removal of libreduction code path and the added __finalize__ calls.

  1. The perf hit we took by removing libreduction was expected and judged to be worth the improved correctness.
  2. We could plausibly get away without __finalize__ in many cases. im not sure how we could identify those cases, cc @jorisvandenbossche ?
  3. Looks like calls to external_values are about 60% slower than they used to be, maybe we could claw some of that back.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 4, 2023

Timing #46505 (comment) locally on main, the __finalize__ takes around ~35% of the total time for me, but then 60% of the finalize part is due to the series name validation.
This is something we control ourselves, and is not for subclasses (it's Series that has "name" in the _metadata).

Changing "name" to "_name" reduces the finalize part to ~20% of the overall time using profile.

The python profiler can sometimes give a bit distorted idea, so confirming with actually timing this example: with current main I see ~13-14s, with changing "name" to "_name" I get ~11-12s, and with completely removing finalize from _chop I get ~9-9.5s.
So finalize is of course a significant part, but this specific example is still slow nonetheless.

We could consider inspecting whether we have a subclass or not (once), and then only call finalize if that's the case. That would avoid it for pure pandas DataFrames. But strictly speaking also for pandas the finalize does things that a user could rely on in the UDF (flags and attrs), although that might be quite unlikely. EDIT: I think additionally checking for those attributes as well is what @jbrockmendel is doing in #51109

Will check if the "name"->"_name" seems a feasible change without breaking things.

@jorisvandenbossche
Copy link
Member

Opened #51784 to see what the full test suite things of this idea.

@jbrockmendel
Copy link
Member

The name->_name thing is a good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants