DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

joseortiz3 · 2019-02-21T08:38:16Z

groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.

import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']):
    # creates dataframe df with categorical column c1, optional categorical column c2.
    # Times how long grouping by c1 and calling nth(0) and first takes
    global df 
    print(N_CATEGORIES)
    df = pd.DataFrame({
        'c1':np.arange(0,10000)%N_CATEGORIES,
        'c2':np.arange(0,10000)
        })
    for col in cat_cols:
        df[col] = df[col].astype('category')
    t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
    t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
    return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

print(results_c2_cat)
print(results_c2_not_cat)

The results are

>>> print(results_c2_cat)
            nth     first
1      0.004204  0.010677
10     0.005890  0.015701
100    0.005305  0.052130
1000   0.004365  0.581036
10000  0.004358  2.847709
>>> print(results_c2_not_cat)
            nth     first
1      0.003479  0.001110
10     0.003027  0.000993
100    0.003297  0.001089
1000   0.003297  0.001057
10000  0.003952  0.001382

As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).

first and last both suffer from this problem.

(see also #19283 and #19598)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-02-21T09:25:35Z

first/last are implemented in cython; cats are converted to object hence the slowdown
nth is basically an analytical solution so it will scale better (also first/last) don’t fully handle all dtypes

not averse to just having first/last just call nth

joseortiz3 · 2019-04-06T21:47:31Z

nth has problems handling missing values (#26011)

As a workaround, here is a function that you can rely on to get the job done fast and correctly:

def groupby_first_fast(df, groupby):
    """Pandas groupby.first() is slow due to a conversion.
    Meanwhile groupby.nth(0) has a bug with missing values. 
    This takes care of both problems.
    Issues #25397 and #26011"""
    # Get rid of missing values that break nth.
    df = df[~(df[groupby].isnull())]
    # Use nth instead of first for speed.
    return df.groupby(groupby).nth(0)

Edit: Didn't mean to push "close" button.

jbrockmendel · 2023-03-30T22:42:34Z

Looks like this got fixed by #52120, closing.

mroeschke added Groupby Performance Memory or execution speed performance labels Feb 22, 2019

joseortiz3 closed this as completed Apr 6, 2019

joseortiz3 reopened this Apr 6, 2019

joseortiz3 mentioned this issue Apr 7, 2019

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

Closed

jbrockmendel closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

joseortiz3 commented Feb 21, 2019 •

edited

Loading

INSTALLED VERSIONS

jreback commented Feb 21, 2019

joseortiz3 commented Apr 6, 2019 •

edited

Loading

jbrockmendel commented Mar 30, 2023

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Comments

joseortiz3 commented Feb 21, 2019 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Feb 21, 2019

joseortiz3 commented Apr 6, 2019 • edited Loading

jbrockmendel commented Mar 30, 2023

joseortiz3 commented Feb 21, 2019 •

edited

Loading

Output of `pd.show_versions()`

joseortiz3 commented Apr 6, 2019 •

edited

Loading