Skip to content

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsevero opened this issue Feb 17, 2017 · 3 comments · Fixed by #19722
Closed
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@dsevero
Copy link

dsevero commented Feb 17, 2017

%timeit -n 10000 np.random.choice([True, False], 10000).any()
%timeit -n 10000 np.random.choice([True, False], 10000).sum().astype(bool)
10000 loops, best of 3: 83.7 µs per loop
10000 loops, best of 3: 106 µs per loop
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0,20, 10000),
    "b": np.random.randint(0,20, 10000),
    "c": np.random.choice([True, False], 10000)
})

%timeit -n 100 df.groupby(["a", "b"])["c"].any()
%timeit -n 100 df.groupby(["a", "b"])["c"].sum().astype(bool)
100 loops, best of 3: 40.9 ms per loop
100 loops, best of 3: 1.46 ms per loop

Problem description

The issue here is that the any method for groupby objects seams to be freakishly slow. It is actually better to sum up all the boolean values and do a typecast with .astype(bool). In numpy the operations have similar benchmarks. The method with any is actually faster!.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: None
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: 2.45.0
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

@daniel-severo The reason for this difference is that for sum we have a specialized groupby version (in cython), and we don't have this for any. So in the case of any, the function is generally applied individually on each group, making it a lot slower.
But, if you or someone would be interested, I don't think it would be too hard to make such a specialized groupby version for any as well.

@jorisvandenbossche jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Feb 17, 2017
@jreback jreback added this to the Next Major Release milestone Feb 17, 2017
@dsevero
Copy link
Author

dsevero commented Feb 17, 2017

I see. I'll take a tackle at it :)

@jorisvandenbossche
Copy link
Member

@daniel-severo The similar cython methods are located here: https://github.com/pandas-dev/pandas/blob/master/pandas/src/algos_groupby_helper.pxi.in

@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@WillAyd WillAyd mentioned this issue Feb 16, 2018
4 tasks
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants