Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

dsevero · 2017-02-17T01:03:15Z

%timeit -n 10000 np.random.choice([True, False], 10000).any()
%timeit -n 10000 np.random.choice([True, False], 10000).sum().astype(bool)

10000 loops, best of 3: 83.7 µs per loop
10000 loops, best of 3: 106 µs per loop

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0,20, 10000),
    "b": np.random.randint(0,20, 10000),
    "c": np.random.choice([True, False], 10000)
})

%timeit -n 100 df.groupby(["a", "b"])["c"].any()
%timeit -n 100 df.groupby(["a", "b"])["c"].sum().astype(bool)

100 loops, best of 3: 40.9 ms per loop
100 loops, best of 3: 1.46 ms per loop

Problem description

The issue here is that the any method for groupby objects seams to be freakishly slow. It is actually better to sum up all the boolean values and do a typecast with .astype(bool). In numpy the operations have similar benchmarks. The method with any is actually faster!.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: None
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-02-17T08:47:28Z

@daniel-severo The reason for this difference is that for sum we have a specialized groupby version (in cython), and we don't have this for any. So in the case of any, the function is generally applied individually on each group, making it a lot slower.
But, if you or someone would be interested, I don't think it would be too hard to make such a specialized groupby version for any as well.

dsevero · 2017-02-17T17:42:33Z

I see. I'll take a tackle at it :)

jorisvandenbossche · 2017-02-26T00:55:13Z

@daniel-severo The similar cython methods are located here: https://github.com/pandas-dev/pandas/blob/master/pandas/src/algos_groupby_helper.pxi.in

jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Feb 17, 2017

jreback added Difficulty Intermediate labels Feb 17, 2017

jreback added this to the Next Major Release milestone Feb 17, 2017

cpcloud added the HackIllinois 2017 label Feb 25, 2017

jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

jreback removed the HackIllinois 2017 label Nov 26, 2017

mroeschke mentioned this issue Jan 10, 2018

PERF: Discrepancy in groupby methods #19165

Closed

7 tasks

WillAyd mentioned this issue Feb 16, 2018

Cythonized GroupBy any #19722

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Feb 27, 2018

jreback closed this as completed in #19722 Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

dsevero commented Feb 17, 2017

jorisvandenbossche commented Feb 17, 2017

dsevero commented Feb 17, 2017

jorisvandenbossche commented Feb 26, 2017

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

Comments

dsevero commented Feb 17, 2017

Problem description

jorisvandenbossche commented Feb 17, 2017

dsevero commented Feb 17, 2017

jorisvandenbossche commented Feb 26, 2017