You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.
Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of firstwhen and only whenc2 is a categorical column.
importpandasaspd, numpyasnp, timeitdeftest(N_CATEGORIES, cat_cols= ['c2']):
# creates dataframe df with categorical column c1, optional categorical column c2.# Times how long grouping by c1 and calling nth(0) and first takesglobaldfprint(N_CATEGORIES)
df=pd.DataFrame({
'c1':np.arange(0,10000)%N_CATEGORIES,
'c2':np.arange(0,10000)
})
forcolincat_cols:
df[col] =df[col].astype('category')
t_nth=timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
t_first=timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
returnt_nth, t_firsttest_N_categories= [1,10,100,1000,10000]
# Test when column c2 is categoricalresults_c2_cat=pd.DataFrame([test(N, cat_cols=['c2']) forNintest_N_categories],
index=test_N_categories,
columns=['nth','first'])
# Test when column c2 is not categoricalresults_c2_not_cat=pd.DataFrame([test(N, cat_cols=[]) forNintest_N_categories],
index=test_N_categories,
columns=['nth','first'])
print(results_c2_cat)
print(results_c2_not_cat)
As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).
first/last are implemented in cython; cats are converted to object hence the slowdown
nth is basically an analytical solution so it will scale better (also first/last) don’t fully handle all dtypes
not averse to just having first/last just call nth
As a workaround, here is a function that you can rely on to get the job done fast and correctly:
defgroupby_first_fast(df, groupby):
"""Pandas groupby.first() is slow due to a conversion. Meanwhile groupby.nth(0) has a bug with missing values. This takes care of both problems. Issues #25397 and #26011"""# Get rid of missing values that break nth.df=df[~(df[groupby].isnull())]
# Use nth instead of first for speed.returndf.groupby(groupby).nth(0)
groupby.first()
is much slower thangroupby.nth(0)
for categorical columns in a very specific way.Consider carefully the example below, where a dataframe has two columns
c1
andc2
. The number of unique values in thec1
column (regardless of its datatype) increases the runtime offirst
when and only whenc2
is a categorical column.The results are
As shown, when the column
c2
is categorical, the runtime offirst
grows very rapidly as a function of the number of unique values in the columnc1
(and thus the "width" of the groupby).first
andlast
both suffer from this problem.(see also #19283 and #19598)
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: