ENH nlargest and nsmallest Series methods #5534

hayd · 2013-11-17T10:10:29Z

closes #3960.

Implemented using kth_largest, could cython using better algo later.

jreback · 2013-11-17T11:58:03Z

needs tests on other dtypes
will fail on datetime64 ATM (these need i8 conversion)

hayd · 2013-11-17T19:47:29Z

@y-p happy to change the name to topk and bottomk, was going off heapq notation.

@jreback any other dtypes to check? Will fix it up.

The plan is to rewrite this at somepoint in cython with a proper alog...

jreback · 2013-11-17T20:07:40Z

check all!

float, int, datetime64 (use i8), timedelts64 (same), object that are numbers, object that are strings

some u may want to simple raise as notimplemented

jtratner · 2013-11-17T23:13:27Z

And make sure to include things with nan so we specify that behavior.

ghost · 2013-11-18T05:39:50Z

@hayd, good reason, though I still prefer topk for being short and familiar parlance. Your call.

jorisvandenbossche · 2013-11-19T08:42:26Z

pandas/core/series.py

@@ -1821,6 +1822,88 @@ def _try_kind_sort(arr):
        return self._constructor(arr[sortedIdx], index=self.index[sortedIdx])\
                   .__finalize__(self)

+    def nlargest(self, n=5, take_last=False, sort=True):
+        '''
+        Returns the largest n rows:


Detail, but colon can just be a full stop

jorisvandenbossche · 2013-11-19T08:50:58Z

Added two small detail comments on the docs. Two other things:

Can you add them to api.rst?
The parameters are still undocumented

hayd · 2013-12-29T20:46:27Z

Should be using quick select algorithm ? (Julia does this fast.. at least with one dim array... in a more general way.)

ghost · 2013-12-29T21:31:33Z

Numerical recipes tend to reincarnate from langauge to language, all the way back
to some fortran code written 40 years ago or a paper by some CS father figure.

The cython code you're relying on from algos.pyx probably originated here:
http://ndevilla.free.fr/median/median/index.html

and from a quick look at the graph there, it seems wirth and quickselect have similar perf
(for the median special case k=n/2, I'm assuming that holds generally).

jreback · 2014-01-03T22:15:43Z

moving for 0.13.1

@hayd ok?

jreback · 2014-01-15T02:19:27Z

@hayd ready to merge?

jreback · 2014-01-15T02:19:49Z

doc/source/release.rst

@@ -60,6 +60,7 @@ New features
  - Clipboard functionality now works with PySide (:issue:`4282`)
  - New ``extract`` string method returns regex matches more conveniently
    (:issue:`4685`)
+  - Add nsmallest and nlargest Series methods (:issue:`3960`)


can u move these notes to 0.13.1..thxs

jorisvandenbossche · 2014-01-15T19:47:02Z

There were some comments on docs that would ideally be handled before merging.

jreback · 2014-01-17T13:43:57Z

@hayd can you address the comments, otherwise looks ok

hayd · 2014-01-17T17:54:27Z

@jreback I have a few issues assigned I think. Will do in the next few days.

I had tried implementing quick select to compare perf. IIRC The issue with the above is it's significantly slower than sorting and taking head for "small" (and even quite big) Series. The more conditions (type checks) the slower it was getting.

Perhaps a solution is to check length before doing it and if "small" then sort and head :s
Will come back with some timings, and fix it up.

jreback · 2014-01-17T17:57:44Z

@hayd absolutely

jreback · 2014-01-21T22:12:43Z

@hayd hows this going?

jreback · 2014-01-24T22:12:31Z

@hayd see if you can do this soon...thxs

hayd · 2014-01-24T23:21:02Z

ok will do right this minute.

hayd · 2014-01-25T06:57:55Z

@jreback to put these in vbench, should I create a series_methods file? (Not sure where to put them).

Basically atm slightly slower if lots of values are the kth.

jreback · 2014-01-25T08:29:15Z

i think their is an algos or something's? u can either create series_methiss or put near rank

jreback · 2014-01-25T08:29:22Z

methods

jreback · 2014-01-26T15:15:28Z

@hayd bump to 0.14?

hayd · 2014-01-26T19:35:25Z

@jreback Will finish this off this evening, it's passing tests but want to do a quick refactor.

If it gets left behind in 0.13.1 no biggy...

hayd · 2014-01-27T07:04:07Z

Gave it a refactor, and put the numpy versions into util. Suspect can be tweaked faster.

@jreback I tagged 0.14 but could change. Worried as kth_largest can cause stack overflow if called incorrectly (though it's only called with i8, int and float without NaNs so should be ok). Perhaps should sit in master for a while just in case ?

jreback · 2014-01-27T11:14:47Z

@hayd look ok to me

can merge or wait to 0.14, up 2 u

pprob want to add an example in docs - maybe in min/max section

hayd · 2014-01-27T22:32:34Z

Would prefer to wait tbh, would like to kick it in master for a bit.

jreback · 2014-01-27T22:39:34Z

sure

jreback · 2014-02-16T22:47:45Z

@hayd looks ready to go

pls change release notes to 0.14
add a mention in v0.14.0.txt (example if you want)
maybe add in docs somewhere.....?

jorisvandenbossche · 2014-02-16T22:49:01Z

There were some doc comments above from me that still should be adressed I think

jreback · 2014-02-26T22:38:26Z

@hayd can you rebase and address the comments....this should go in soon...thanks

jreback · 2014-04-05T23:43:32Z

@hayd would merge this soon

hayd · 2014-04-06T05:41:00Z

@jreback Sketched some time next week to fix up the several PRs I have in play atm. Thanks for pinging!

jreback · 2014-04-10T21:39:17Z

ping!

jreback · 2014-04-21T12:25:20Z

ping

jreback · 2014-04-27T23:43:27Z

@hayd ping....need to get this in ASAP

jreback · 2014-05-01T14:13:38Z

ping!

jreback · 2014-05-02T12:50:59Z

@hayd looks ok....let's get this in ASAP

jreback · 2014-05-05T00:24:56Z

@hayd ping!

jreback · 2014-05-06T12:04:20Z

@hayd can you rebase!

didn't we discuss calling these topk/bottomk (or top_n or topn or ntop)?

jreback · 2014-05-08T13:00:17Z

ping!

jreback · 2014-05-08T23:30:59Z

@cpcloud I want to get this in, but needs a rebase / maybe test fixes, can you give a shot?

cpcloud · 2014-05-09T00:15:52Z

Sure no prob. Will be home in a bout an hour will give it a whirlygig then.

jreback · 2014-05-12T10:53:16Z

bump to 0.14.1, but @cpcloud if you are able to workon this would be great

cpcloud · 2014-05-12T14:53:49Z

I've got a pr coming for this today (on a plane to NYC right now). Mostly just a bit of clean up and dealing with the object dtype and special cases of n <= 0 or n >= len(series). Might need a few more test cases

cpcloud · 2014-05-12T18:45:58Z

is there any consensus on how to deal with unorderable types? e.g., disallow ... punt to numpy? python 3 disallows str int comparisons ... @jreback thoughts when u get a chance?

jreback · 2014-05-12T18:53:59Z

you could do something like this: https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L141

but I would just raise (e.g. if its a object dtype and you have mixed str/number)

cpcloud · 2014-05-13T14:29:21Z

closing ... submitting a new pr based on this one

jorisvandenbossche reviewed Nov 19, 2013
View reviewed changes

jreback reviewed Jan 15, 2014
View reviewed changes

ghost assigned hayd Jan 17, 2014

jreback added the Numeric label Feb 16, 2014

hayd added 2 commits April 11, 2014 12:37

ENH nlargest and nsmallest Series methods

62b5b0c

wip

685ac64

jreback mentioned this pull request May 6, 2014

ENH: add nlargest/nsmallest to Series groupby #7053

Closed

jreback modified the milestones: 0.14.1, 0.14.0 May 12, 2014

cpcloud assigned cpcloud and unassigned hayd May 13, 2014

cpcloud closed this May 13, 2014

cpcloud mentioned this pull request May 13, 2014

ENH: add nlargest nsmallest to Series #7113

Merged

ENH nlargest and nsmallest Series methods #5534

ENH nlargest and nsmallest Series methods #5534

Conversation

hayd commented Nov 17, 2013

jreback commented Nov 17, 2013

hayd commented Nov 17, 2013

jreback commented Nov 17, 2013

jtratner commented Nov 17, 2013

ghost commented Nov 18, 2013

jorisvandenbossche Nov 19, 2013

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 19, 2013

hayd commented Dec 29, 2013

ghost commented Dec 29, 2013

jreback commented Jan 3, 2014

jreback commented Jan 15, 2014

jreback Jan 15, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 15, 2014

jreback commented Jan 17, 2014

hayd commented Jan 17, 2014

jreback commented Jan 17, 2014

jreback commented Jan 21, 2014

jreback commented Jan 24, 2014

hayd commented Jan 24, 2014

hayd commented Jan 25, 2014

jreback commented Jan 25, 2014

jreback commented Jan 25, 2014

jreback commented Jan 26, 2014

hayd commented Jan 26, 2014

hayd commented Jan 27, 2014

jreback commented Jan 27, 2014

hayd commented Jan 27, 2014

jreback commented Jan 27, 2014

jreback commented Feb 16, 2014

jorisvandenbossche commented Feb 16, 2014

jreback commented Feb 26, 2014

jreback commented Apr 5, 2014

hayd commented Apr 6, 2014

jreback commented Apr 10, 2014

jreback commented Apr 21, 2014

jreback commented Apr 27, 2014

jreback commented May 1, 2014

jreback commented May 2, 2014

jreback commented May 5, 2014

jreback commented May 6, 2014

jreback commented May 8, 2014

jreback commented May 8, 2014

cpcloud commented May 9, 2014

jreback commented May 12, 2014

cpcloud commented May 12, 2014

cpcloud commented May 12, 2014

jreback commented May 12, 2014

cpcloud commented May 13, 2014