groupby().first() much slower with a str column present in the data. #19283

jdmarino · 2018-01-17T15:03:46Z

(I copied this code from a jupyter notebook)

import pandas as pd
import sys
pd.options.display.max_rows = 10
print('pandas version', pd.__version__)
print('python version', sys.version)

#pandas version 0.22.0
#python version 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]

msgs = pd.DataFrame({ 'orderid':pd.np.random.random_sample(size=100000)
                     ,'qty':pd.np.random.random_sample(size=100000)})
msgs['date'] = '1900-01-01'
msgs['textcol'] = 'lorem ipsum etc'
msgs.info()

# omits textcol in data  takes 59 ms
g = msgs[['date','orderid','qty']].groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

# has textcol in data  takes 10.6 s
g = msgs.groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

Problem description

I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)

Expected Output

When I run the same code under 0.21.1 the times are 55 ms and 67 ms.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-01-17T16:22:00Z

~~I'm unable to reproduce your slowdown~~:

On 0.21.1:

In [6]: %timeit g1.first()
2.63 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit g2.first()
7.96 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: pd.__version__
Out[8]: '0.21.1'

On 0.22.0:

In [8]: g1 = msgs[['date','orderid','qty']].groupby(['date','orderid'])

In [9]: g2 = msgs.groupby(['date', 'orderid'])

In [10]: %timeit g1.first()
2.49 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit g2.first()
5.07 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: pd.__version__
Out[12]: '0.22.0'

Are there any other differences in your environments? 0.22.0 only contained 1 change relative to 0.21.1, and I'd be surprised if it had an impact on Groupby.first.

Edit: Whoops, I'm apparently unable to read. I missed the ms vs. s in the second groupby. Consider me surprised then.

jdmarino · 2018-01-17T16:32:13Z

It might be worth removing your line "I'm unable to reproduce your slowdown". People like me might stop reading right there assuming the OP was a crackpot. ;-)

chris-b1 · 2018-01-17T19:40:46Z

Thanks @jdmarino - seems the min_count changes in #18921 is causing the groupy dispatch to the right cython method to fail. Fortunately/unfortunately there's a fallback so tests didn't pick it up. As a slightly faster workaround you could use nth(0).

In [18]: g2 = msgs.groupby(['date', 'orderid'])

In [19]: %timeit g2.nth(0)
64.2 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: from pandas.testing import assert_frame_equal

In [17]: assert_frame_equal(g2.nth(0), g2.first())

@TomAugspurger - looks like the 1 in this line is issue?

pandas/pandas/core/groupby.py

Line 2147 in 65d1b62

'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)

def fail():
    raise Exception

g2 = msgs.groupby(['date', 'orderid'])

g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-220c58451090> in <module>()
----> 1 g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   3607                             min_count=-1):
   3608         new_items, new_blocks = self._cython_agg_blocks(
-> 3609             how, alt=alt, numeric_only=numeric_only, min_count=min_count)
   3610         return self._wrap_agged_blocks(new_items, new_blocks)
   3611 

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   3651             try:
   3652                 result, _ = self.grouper.aggregate(
-> 3653                     block.values, how, axis=agg_axis, min_count=min_count)
   3654             except NotImplementedError:
   3655                 # generally if we have numeric_only=False

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in aggregate(self, values, how, axis, min_count)
   2312     def aggregate(self, values, how, axis=0, min_count=-1):
   2313         return self._cython_operation('aggregate', values, how, axis,
-> 2314                                       min_count=min_count)
   2315 
   2316     def transform(self, values, how, axis=0):

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_operation(self, kind, values, how, axis, min_count)
   2268             result = self._aggregate(
   2269                 result, counts, values, labels, func, is_numeric,
-> 2270                 is_datetimelike, min_count)
   2271         elif kind == 'transform':
   2272             result = _maybe_fill(np.empty_like(values, dtype=out_dtype),

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _aggregate(self, result, counts, values, comp_ids, agg_func, is_numeric, is_datetimelike, min_count)
   2330                          min_count)
   2331         else:
-> 2332             agg_func(result, counts, values, comp_ids, min_count)
   2333 
   2334         return result

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in wrapper(*args, **kwargs)
   2169 
   2170                 def wrapper(*args, **kwargs):
-> 2171                     return f(afunc, *args, **kwargs)
   2172 
   2173                 # need to curry our sub-function

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in <lambda>(func, a, b, c, d, e)
   2113             'first': {
   2114                 'name': 'group_nth',
-> 2115                 'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)
   2116             },
   2117             'last': 'group_last',

pandas/_libs/groupby.pyx in pandas._libs.groupby.group_nth_object()

TypeError: group_nth_object() takes exactly 5 positional arguments (6 given)

TomAugspurger · 2018-01-17T19:55:30Z

Thanks @chris-b1. I added a min_count to group_nth_ in groupby_helper, but group_nth_object is defined separately, and so lost it. Adding in an unused min_count to group_nth_object will fix things.

lv10 · 2018-01-29T20:06:34Z

Hi,

I would like to add that we are having the same issue. We have a block of code that groupsby and then call first as follows:

>> groups = self.df.groupby(['col_name'])
>> firsts = groups.first()

Under pandas 0.21 it takes: less than 3 seconds
Under pandas 0.22 it takes: between 2 and 3 mins

albertvillanova · 2018-01-30T11:04:15Z

The same performance issue is also found for function GroupBy.last

TomAugspurger · 2018-01-30T12:14:58Z

Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a min_count parameter to

pandas/pandas/_libs/groupby.pyx

Line 35 in 238499a

def group_nth_object(ndarray[object, ndim=2] out,

, and asserting that it's always -1.

jreback · 2018-01-30T12:29:09Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html

sursu · 2018-04-11T08:54:56Z

A friendly question:
Has this problem been resolved or the recommended solution remains to use .nth(0)?

Tears fall down my face due to how slow .first() and .last() are.
I have Anaconda 5.1 installed, and pd.__version__ returns '0.22.0'.

I could use .nth(0) instead of .first(), but I am not quiet sure what to do with .last(), apart from doing the computations on Dataframe.as_matrix() with numpy.

chris-b1 · 2018-04-11T11:22:37Z

This is fixed in the soon-ish to be released 0.23.0.

…

On Wed, Apr 11, 2018 at 3:55 AM, Sandu Ursu ***@***.***> wrote: A friendly question: Has this problem been resolved or the recommended solution remains to use .nth(0)? Tears fall down my face due to how slow .first() and .last() are. I have the latest Anaconda version installed, and pd.__version__ returns '0.22.0'. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19283 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1b_BPiCUx_fYw2MiTV5vwA-vOnjdC9ks5tncT-gaJpZM4Rhekl> .

sursu · 2018-04-11T12:03:43Z

Opened an issue here: #20657

I also find that .last() does not perform as expected (might be related, not sure whether I need to open a new issue):

Example:

df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
 [179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
 [179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
 [179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
 [179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]],
columns=['SessionID', 'PageTime', 'ReferrerURL', 'PageID'])

Now, when I run:
df.groupby('SessionID').last()
I get:

SessionID	PageTime	ReferrerURL	PageID
179293473	2016-06-01 00:04:22.346018	https://www.information.dk/indland/2016/05/hve...	39125224
179773461	2016-06-01 22:27:00.857224	https://www.google.dk	32718401

When, in fact, I was expecting the same result as obtained from:
df.groupby('SessionID').nth(-1)

SessionID	PageID	PageTime	ReferrerURL
179293473	39125224	2016-06-01 00:04:22.346018	https://www.information.dk/indland/2016/05/hve...
179773461	32718401	2016-06-01 22:27:00.857224	NaN

And while we are at .nth(), why does it mix up my column order?

chris-b1 · 2018-04-11T13:43:32Z

I think there's already an open issue for that, if not please make one.

…

On Wed, Apr 11, 2018, 8:04 AM Sandu Ursu ***@***.***> wrote: I also find that .last() does not perform as expected: Example: df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224], [179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124], [179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124], [179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124], [179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401], [179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401], [179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]]) Now, when I run: pv.groupby('SessionID').last() I get: SessionID PageTime ReferrerURL PageID 179293473 2016-06-01 00:04:22.346018 https://www.information.dk/indland/2016/05/hve... 39125224 179773461 2016-06-01 22:27:00.857224 https://www.google.dk 32718401 When, in fact, I was expecting the same result as obtained from: pv.groupby('SessionID').nth(-1) SessionID PageID PageTime ReferrerURL 179293473 39125224 2016-06-01 00:04:22.346018 https://www.information.dk/indland/2016/05/hve... 179773461 32718401 2016-06-01 22:27:00.857224 NaN And while we are at .nth(), why does it mix up my column order? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19283 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1b_P2fiCF6SGT2rKPGaaYpB1qUTtlvks5tnfE6gaJpZM4Rhekl> .

chris-b1 added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jan 17, 2018

chris-b1 added this to the 0.23.0 milestone Jan 17, 2018

jreback modified the milestones: 0.23.0, Next Major Release Jan 30, 2018

TomAugspurger mentioned this issue Feb 7, 2018

PERF: Correct signature for group_nth / group_object #19579

Merged

jreback closed this as completed in #19579 Feb 8, 2018

jreback mentioned this issue Feb 9, 2018

DataFrame groupby.first() is much slower than groupby.nth(0) #19598

Closed

joseortiz3 mentioned this issue Feb 21, 2019

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby().first() much slower with a str column present in the data. #19283

groupby().first() much slower with a str column present in the data. #19283

jdmarino commented Jan 17, 2018

INSTALLED VERSIONS

TomAugspurger commented Jan 17, 2018 •

edited

Loading

jdmarino commented Jan 17, 2018

chris-b1 commented Jan 17, 2018

TomAugspurger commented Jan 17, 2018 •

edited

Loading

lv10 commented Jan 29, 2018 •

edited

Loading

albertvillanova commented Jan 30, 2018

TomAugspurger commented Jan 30, 2018

jreback commented Jan 30, 2018

sursu commented Apr 11, 2018 •

edited

Loading

chris-b1 commented Apr 11, 2018 via email

sursu commented Apr 11, 2018 •

edited

Loading

chris-b1 commented Apr 11, 2018 via email

groupby().first() much slower with a str column present in the data. #19283

groupby().first() much slower with a str column present in the data. #19283

Comments

jdmarino commented Jan 17, 2018

(I copied this code from a jupyter notebook)

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jan 17, 2018 • edited Loading

jdmarino commented Jan 17, 2018

chris-b1 commented Jan 17, 2018

TomAugspurger commented Jan 17, 2018 • edited Loading

lv10 commented Jan 29, 2018 • edited Loading

albertvillanova commented Jan 30, 2018

TomAugspurger commented Jan 30, 2018

jreback commented Jan 30, 2018

sursu commented Apr 11, 2018 • edited Loading

chris-b1 commented Apr 11, 2018 via email

sursu commented Apr 11, 2018 • edited Loading

chris-b1 commented Apr 11, 2018 via email

Output of `pd.show_versions()`

TomAugspurger commented Jan 17, 2018 •

edited

Loading

TomAugspurger commented Jan 17, 2018 •

edited

Loading

lv10 commented Jan 29, 2018 •

edited

Loading

sursu commented Apr 11, 2018 •

edited

Loading

sursu commented Apr 11, 2018 •

edited

Loading