Skip to content

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joseortiz3 opened this issue Feb 21, 2019 · 3 comments
Labels
Groupby Performance Memory or execution speed performance

Comments

@joseortiz3
Copy link
Contributor

joseortiz3 commented Feb 21, 2019

groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.

import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']):
    # creates dataframe df with categorical column c1, optional categorical column c2.
    # Times how long grouping by c1 and calling nth(0) and first takes
    global df 
    print(N_CATEGORIES)
    df = pd.DataFrame({
        'c1':np.arange(0,10000)%N_CATEGORIES,
        'c2':np.arange(0,10000)
        })
    for col in cat_cols:
        df[col] = df[col].astype('category')
    t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
    t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
    return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

print(results_c2_cat)
print(results_c2_not_cat)

The results are

>>> print(results_c2_cat)
            nth     first
1      0.004204  0.010677
10     0.005890  0.015701
100    0.005305  0.052130
1000   0.004365  0.581036
10000  0.004358  2.847709
>>> print(results_c2_not_cat)
            nth     first
1      0.003479  0.001110
10     0.003027  0.000993
100    0.003297  0.001089
1000   0.003297  0.001057
10000  0.003952  0.001382

As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).

first and last both suffer from this problem.

(see also #19283 and #19598)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented Feb 21, 2019

first/last are implemented in cython; cats are converted to object hence the slowdown
nth is basically an analytical solution so it will scale better (also first/last) don’t fully handle all dtypes

not averse to just having first/last just call nth

@mroeschke mroeschke added Groupby Performance Memory or execution speed performance labels Feb 22, 2019
@joseortiz3
Copy link
Contributor Author

joseortiz3 commented Apr 6, 2019

nth has problems handling missing values (#26011)

As a workaround, here is a function that you can rely on to get the job done fast and correctly:

def groupby_first_fast(df, groupby):
    """Pandas groupby.first() is slow due to a conversion.
    Meanwhile groupby.nth(0) has a bug with missing values. 
    This takes care of both problems.
    Issues #25397 and #26011"""
    # Get rid of missing values that break nth.
    df = df[~(df[groupby].isnull())]
    # Use nth instead of first for speed.
    return df.groupby(groupby).nth(0)

Edit: Didn't mean to push "close" button.

@jbrockmendel
Copy link
Member

Looks like this got fixed by #52120, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants