-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: cythonize groupby.count #7016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback axis parameter has no effect here...i kept for back compat with |
Yeh not sure what it does in a groupby as an aside why do u still need the lambda expression in count? is it called? |
oh actually u need to pass a |
fallback if the dtype does work / match? FYI I don't think group count will work for datetime64; need to convert to i8 view and then compare against tslib.iNaT for nulls |
fallback is called if any exceptions occur in the cython call |
usually that is a dtype mismatch |
@jreback supporting date-likes here i think requires more than i originally set out to do with this ... about to push object dtype for counts |
I think datelikes DO need to be handled (you can't just take a i8 view when you pass in?) and then do your comparisons instead of
for nan testing? |
Yeah it's there. I was trying to implement ttimedelta arithmetic for groupby but the py26 oddities are in the way. I'll do it in another pr |
Sorry the numpy 1.6 oddities are there |
ok...no problem (or leave it in python land). is ok too for now; can't leave it hanging though because would break things, ahh yes, the 2.6 oddities, so maybe you can just make it use python and M8/m8 in 2.6 (e.g. have a branch in the code), ugly but what the hey |
@jreback yep took out the cython attempt ... now just squashing ... making sure that the cython path is hit for dates/timedeltas (but only |
@@ -2156,11 +2238,14 @@ def generate_put_template(template, use_ints = True, use_floats = True): | |||
('int32', 'int32_t', 'float64_t', 'np.float64'), | |||
('int64', 'int64_t', 'float64_t', 'np.float64'), | |||
] | |||
object_list = [('object', 'object', 'float64_t', 'np.float64')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should float really be here? (as that should be the use_floats)? no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes float
should be here use_(float|int)s
is really only for the first two arguments which refer to the name and the ctype, but the result of a groupby operation is always a float (even count
, which is astype
d to int64
as the last method call before returning to the user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh..right..np
vbench? |
vbench coming soon |
758x speedup, think we're doing ok :) |
ah object dtype is still slow , might be another one of my dumb typos |
can you make the vbench a bit smaller (so the base case takes say 1/10 the time)? (shouldn't really affect the speedup much), which is nice! |
sure thing |
when you do v0.14.0, put in the perf section FYI (release notes goes in improvements) |
will do thanks for the heads up |
@jreback can we get this in soon as green? want to avoid doc merge conflicts. |
sure go ahead on green |
thinking about this int types can never be nan at all object types need comparison against the actual NaT value (I think) wonder if any of that even matters |
Right, really only float, object, and int64 and those are really a view on dates or time deltas which are just ints really. Could have size as the fallback function instead of current version which computes sum of nonnull. Then if I take out the generated for ints it will raise and use the fallback. |
yep |
I can do some microbenchmarks. If size is done with Len then we go from linear in the number of rows to constant so def worth an investigation |
Pandas isn't on iPhone so this may have to wait until tomorrow :/ |
hah no size is already computed (when the groupby happens) so should be trivial more worried about the issue of the type coercion in the current cython code vs iNaT which is a bigger value than say int8 |
@jreback i have the pr with |
sure |
can you also post the vbench results at the top of the PR? |
yep np |
oh...I meant the cythonized count vs before you merged it in! (in this PR) because when someone clicks on the release notes they go to this issue |
yep ... that's coming running it now |
adding to SO rep! http://stackoverflow.com/questions/23545834/speed-up-pandas-aggregation |
closes #7003
axis parameter(this only works in non-cython land)object
dtype if there are nonevbench results: