-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby().first() much slower with a str column present in the data. #19283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On 0.21.1: In [6]: %timeit g1.first()
2.63 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit g2.first()
7.96 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: pd.__version__
Out[8]: '0.21.1' On 0.22.0: In [8]: g1 = msgs[['date','orderid','qty']].groupby(['date','orderid'])
In [9]: g2 = msgs.groupby(['date', 'orderid'])
In [10]: %timeit g1.first()
2.49 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit g2.first()
5.07 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: pd.__version__
Out[12]: '0.22.0' Are there any other differences in your environments? 0.22.0 only contained 1 change relative to 0.21.1, and I'd be surprised if it had an impact on Edit: Whoops, I'm apparently unable to read. I missed the |
It might be worth removing your line "I'm unable to reproduce your slowdown". People like me might stop reading right there assuming the OP was a crackpot. ;-) |
Thanks @jdmarino - seems the In [18]: g2 = msgs.groupby(['date', 'orderid'])
In [19]: %timeit g2.nth(0)
64.2 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: from pandas.testing import assert_frame_equal
In [17]: assert_frame_equal(g2.nth(0), g2.first()) @TomAugspurger - looks like the Line 2147 in 65d1b62
def fail():
raise Exception
g2 = msgs.groupby(['date', 'orderid'])
g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-220c58451090> in <module>()
----> 1 g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
3607 min_count=-1):
3608 new_items, new_blocks = self._cython_agg_blocks(
-> 3609 how, alt=alt, numeric_only=numeric_only, min_count=min_count)
3610 return self._wrap_agged_blocks(new_items, new_blocks)
3611
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
3651 try:
3652 result, _ = self.grouper.aggregate(
-> 3653 block.values, how, axis=agg_axis, min_count=min_count)
3654 except NotImplementedError:
3655 # generally if we have numeric_only=False
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in aggregate(self, values, how, axis, min_count)
2312 def aggregate(self, values, how, axis=0, min_count=-1):
2313 return self._cython_operation('aggregate', values, how, axis,
-> 2314 min_count=min_count)
2315
2316 def transform(self, values, how, axis=0):
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_operation(self, kind, values, how, axis, min_count)
2268 result = self._aggregate(
2269 result, counts, values, labels, func, is_numeric,
-> 2270 is_datetimelike, min_count)
2271 elif kind == 'transform':
2272 result = _maybe_fill(np.empty_like(values, dtype=out_dtype),
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _aggregate(self, result, counts, values, comp_ids, agg_func, is_numeric, is_datetimelike, min_count)
2330 min_count)
2331 else:
-> 2332 agg_func(result, counts, values, comp_ids, min_count)
2333
2334 return result
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in wrapper(*args, **kwargs)
2169
2170 def wrapper(*args, **kwargs):
-> 2171 return f(afunc, *args, **kwargs)
2172
2173 # need to curry our sub-function
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in <lambda>(func, a, b, c, d, e)
2113 'first': {
2114 'name': 'group_nth',
-> 2115 'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)
2116 },
2117 'last': 'group_last',
pandas/_libs/groupby.pyx in pandas._libs.groupby.group_nth_object()
TypeError: group_nth_object() takes exactly 5 positional arguments (6 given) |
Thanks @chris-b1. I added a |
Hi, I would like to add that we are having the same issue. We have a block of code that groupsby and then call first as follows:
Under pandas 0.21 it takes: less than 3 seconds |
The same performance issue is also found for function GroupBy.last |
Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a pandas/pandas/_libs/groupby.pyx Line 35 in 238499a
|
A friendly question: Tears fall down my face due to how slow I could use |
This is fixed in the soon-ish to be released 0.23.0.
…On Wed, Apr 11, 2018 at 3:55 AM, Sandu Ursu ***@***.***> wrote:
A friendly question:
Has this problem been resolved or the recommended solution remains to use
.nth(0)?
Tears fall down my face due to how slow .first() and .last() are.
I have the latest Anaconda version installed, and pd.__version__ returns
'0.22.0'.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19283 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB1b_BPiCUx_fYw2MiTV5vwA-vOnjdC9ks5tncT-gaJpZM4Rhekl>
.
|
Opened an issue here: #20657 I also find that Example:
Now, when I run:
When, in fact, I was expecting the same result as obtained from:
And while we are at |
I think there's already an open issue for that, if not please make one.
…On Wed, Apr 11, 2018, 8:04 AM Sandu Ursu ***@***.***> wrote:
I also find that .last() does not perform as expected:
Example:
df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
[179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
[179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
[179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
[179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]])
Now, when I run:
pv.groupby('SessionID').last()
I get:
SessionID PageTime ReferrerURL PageID
179293473 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve... 39125224
179773461 2016-06-01 22:27:00.857224 https://www.google.dk 32718401
When, in fact, I was expecting the same result as obtained from:
pv.groupby('SessionID').nth(-1)
SessionID PageID PageTime ReferrerURL
179293473 39125224 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve...
179773461 32718401 2016-06-01 22:27:00.857224 NaN
And while we are at .nth(), why does it mix up my column order?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19283 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB1b_P2fiCF6SGT2rKPGaaYpB1qUTtlvks5tnfE6gaJpZM4Rhekl>
.
|
(I copied this code from a jupyter notebook)
Problem description
I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)
Expected Output
When I run the same code under 0.21.1 the times are 55 ms and 67 ms.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: