-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: nlargest for DataFrame #3960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which is shockingly slow... but I guess there is a lot going on there. |
here's a comparison (using kth smallest)
heap
sort
and the winner
|
Ah ha! so there is a (Was going to suggest bottleneck http://stackoverflow.com/a/10463648/1240268) |
but it might be useful to wrap these up into a series method .....(as this is a cython method). and no kth_largest....so prob should just write that one....(I don't think there is an easy inversion) |
I would propose using nlargest and nsmallest... I could have a crack at some cython for kth_largest. Similarly you could have a key (like for the heapq versions) e.g. #3942. Is the issue here that looking up a key (from python) for each value makes it incredibly slow? |
Instead of a key, just make a column of the values of the key, then its all vectorized, so yes the 'key' argument is a problem. |
Just a thought - what if you followed some of the conventions of agg /apply df.orderby([a, b, c], sum) (but with better syntax :P)
|
And then you could use the n* or other functions on that
|
Last thing - maybe better in reverse order, pass function and then cols.
|
cols, function i think is the way R works. i like the function, cols syntax more but that's just me |
@jreback I guess you could just create the column to use as the key on the fly. function, cols doesn't make sense if you're not passing a key though (most cases?)... |
what would |
Presumably ordering by the "key_column"
without actually creating the column:
|
yep that's what i was thinking too |
I'm not sure I'm entirely sold on using the cols in this way actually, usually when passing columns you order by a then b then c. Really I was thinking of this more as
:s |
Yeah, I guess that fits the model of other functions better if using (cols,
which is the same as np.sort(df[[a, b, c]].sum()), but orderby would allow — |
I think this is the same thing
why do we need a separate function? |
Well, similar to groupby you could intelligently handle it s.t if no func |
So the key would actually be applied column wise (and then sort by these) ?
I guess these examples makes more sense when |
@jreback good point.
|
@jreback so kth_smallest just pulls out the kth smallest (obviously!), which isn't quite the same as Maybe I'll just look at how heapq.nlargest is implemented, or anyone know a better one? |
you can prob just copy it and reverse the signs and you only need the kth smallest, because if |
Doe this mean it doesn't sort those k values? And also... if there're duplicates this might not include the largest item? Isn't |
yes I think you would need to sort the kth smallest..... I thnk you are right about kth_largest....prob just as fast too (but not sure) |
fyi I believe the current kth algo and bottleneck are pretty similarr, as I recall a discussion maybe last year between wes and ken about how to fast median (which is of course kth = n/2) |
If there are dupes this isn't well defined so I don't think it matters. |
2 issues to consider, non-numeric dtypes (though you can convert to view('i8') for datetimelike) there is a function, needs_i8_conversion which detects this nan's, i would always exclude, maybe even do a dropna first (then you don't have to deal) |
also I would only make this method for Series, can always be applied to frame if needed (e.g. this is much like say argsort) |
this is a nice little function, should do in 0.14 |
@hayd this look fine ...were you waiting on something? |
No, just sulking about perf, will have a look again this week. Should def put in 0.14. |
@hayd ping!!!! |
I don't think there is a way to get the nlargest elements in a DataFrame without sorting.
In ordinary python you'd use heapq's nlargest (and we can hack a bit to use it for a DataFrame):
This is much slower than sorting, presumbly from the overhead, I thought I'd throw this as a feature idea anyway.
see http://stackoverflow.com/a/17194717/1240268
The text was updated successfully, but these errors were encountered: