Skip to content

Performance decrease of groupby.first for datetime64 in 0.14 #7555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Marigold opened this issue Jun 24, 2014 · 11 comments · Fixed by #7560
Closed

Performance decrease of groupby.first for datetime64 in 0.14 #7555

Marigold opened this issue Jun 24, 2014 · 11 comments · Fixed by #7560
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby Performance Memory or execution speed performance
Milestone

Comments

@Marigold
Copy link

I have seen a dramatic (around 100x) slowdown of groupby.first method for datetime64 type after updating to 0.14.

Here is an example:

from time import time
d = pd.DataFrame({'a': pd.date_range('1/1/2011', periods=100000, freq='s'), 'b': range(100000)})
t = time()
d.groupby('b').first()
print(time() - t)

Result:

3.87164497375

Same for int type

from time import time
d = pd.DataFrame({'a': range(100000), 'b': range(100000)})
t = time()
d.groupby('b').first()
print(time() - t)

Result:

0.0148799419403
@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

what version were you using?

@Marigold
Copy link
Author

This is from 0.14.0, but I have experienced the same on master yesterday.
UPDATE: tested again on current master. Still the same.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

I see that

did u have a version where it was fast?

@Marigold
Copy link
Author

Latest 0.13 was definitely fine. I will check it again just to make sure.

UPDATE: Just tested 0.13.1 and it takes even longer. Maybe the bug is on the numpy side? I will try to go deeper into that and post the findings here.

@jreback jreback added this to the 0.14.1 milestone Jun 24, 2014
@Marigold
Copy link
Author

Downgrading numpy or pandas didn't help. Do you get the same results as me? Could you put me on the right direction where to look? Thanks.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

This was falling back to python space for the aggregation, so was much slower. fix shortly.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

see #7560

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

give a try on master

@Marigold
Copy link
Author

datetime works fine now, but I experience the same problem with object

from time import time
d = pd.DataFrame({'a': ['foo'] * 100000, 'b': range(100000)})
t = time()
d.groupby('b').first()
print(time() - t)

@jreback
Copy link
Contributor

jreback commented Jun 25, 2014

@Marigold ok fixed up...thanks for the report. was not checking these cases, and had a regression.

@Marigold
Copy link
Author

Works perfectly. Thanks for a quick feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants