Skip to content

ENH nlargest and nsmallest Series methods #5534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Nov 17, 2013

closes #3960.

Implemented using kth_largest, could cython using better algo later.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2013

needs tests on other dtypes
will fail on datetime64 ATM (these need i8 conversion)

@hayd
Copy link
Contributor Author

hayd commented Nov 17, 2013

@y-p happy to change the name to topk and bottomk, was going off heapq notation.

@jreback any other dtypes to check? Will fix it up.

The plan is to rewrite this at somepoint in cython with a proper alog...

@jreback
Copy link
Contributor

jreback commented Nov 17, 2013

check all!

float, int, datetime64 (use i8), timedelts64 (same), object that are numbers, object that are strings

some u may want to simple raise as notimplemented

@jtratner
Copy link
Contributor

And make sure to include things with nan so we specify that behavior.

@ghost
Copy link

ghost commented Nov 18, 2013

@hayd, good reason, though I still prefer topk for being short and familiar parlance. Your call.

@@ -1821,6 +1822,88 @@ def _try_kind_sort(arr):
return self._constructor(arr[sortedIdx], index=self.index[sortedIdx])\
.__finalize__(self)

def nlargest(self, n=5, take_last=False, sort=True):
'''
Returns the largest n rows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detail, but colon can just be a full stop

@jorisvandenbossche
Copy link
Member

Added two small detail comments on the docs. Two other things:

  • Can you add them to api.rst?
  • The parameters are still undocumented

@hayd
Copy link
Contributor Author

hayd commented Dec 29, 2013

Should be using quick select algorithm ? (Julia does this fast.. at least with one dim array... in a more general way.)

@ghost
Copy link

ghost commented Dec 29, 2013

Numerical recipes tend to reincarnate from langauge to language, all the way back
to some fortran code written 40 years ago or a paper by some CS father figure.

The cython code you're relying on from algos.pyx probably originated here:
http://ndevilla.free.fr/median/median/index.html

and from a quick look at the graph there, it seems wirth and quickselect have similar perf
(for the median special case k=n/2, I'm assuming that holds generally).

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

moving for 0.13.1

@hayd ok?

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

@hayd ready to merge?

@@ -60,6 +60,7 @@ New features
- Clipboard functionality now works with PySide (:issue:`4282`)
- New ``extract`` string method returns regex matches more conveniently
(:issue:`4685`)
- Add nsmallest and nlargest Series methods (:issue:`3960`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u move these notes to 0.13.1..thxs

@jorisvandenbossche
Copy link
Member

There were some comments on docs that would ideally be handled before merging.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

@hayd can you address the comments, otherwise looks ok

@ghost ghost assigned hayd Jan 17, 2014
@hayd
Copy link
Contributor Author

hayd commented Jan 17, 2014

@jreback I have a few issues assigned I think. Will do in the next few days.

I had tried implementing quick select to compare perf. IIRC The issue with the above is it's significantly slower than sorting and taking head for "small" (and even quite big) Series. The more conditions (type checks) the slower it was getting.

Perhaps a solution is to check length before doing it and if "small" then sort and head :s
Will come back with some timings, and fix it up.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

@hayd absolutely

@jreback
Copy link
Contributor

jreback commented Jan 21, 2014

@hayd hows this going?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@hayd see if you can do this soon...thxs

@hayd
Copy link
Contributor Author

hayd commented Jan 24, 2014

ok will do right this minute.

@hayd
Copy link
Contributor Author

hayd commented Jan 25, 2014

@jreback to put these in vbench, should I create a series_methods file? (Not sure where to put them).

Basically atm slightly slower if lots of values are the kth.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2014

i think their is an algos or something's? u can either create series_methiss or put near rank

@jreback
Copy link
Contributor

jreback commented Jan 25, 2014

methods

@jreback
Copy link
Contributor

jreback commented Jan 26, 2014

@hayd bump to 0.14?

@hayd
Copy link
Contributor Author

hayd commented Jan 26, 2014

@jreback Will finish this off this evening, it's passing tests but want to do a quick refactor.

If it gets left behind in 0.13.1 no biggy...

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

Gave it a refactor, and put the numpy versions into util. Suspect can be tweaked faster.

@jreback I tagged 0.14 but could change. Worried as kth_largest can cause stack overflow if called incorrectly (though it's only called with i8, int and float without NaNs so should be ok). Perhaps should sit in master for a while just in case ?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

@hayd look ok to me

can merge or wait to 0.14, up 2 u

pprob want to add an example in docs - maybe in min/max section

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

Would prefer to wait tbh, would like to kick it in master for a bit.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

sure

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@hayd looks ready to go

  • pls change release notes to 0.14
  • add a mention in v0.14.0.txt (example if you want)
  • maybe add in docs somewhere.....?

@jorisvandenbossche
Copy link
Member

There were some doc comments above from me that still should be adressed I think

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

@hayd can you rebase and address the comments....this should go in soon...thanks

@jreback
Copy link
Contributor

jreback commented Apr 5, 2014

@hayd would merge this soon

@hayd
Copy link
Contributor Author

hayd commented Apr 6, 2014

@jreback Sketched some time next week to fix up the several PRs I have in play atm. Thanks for pinging!

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

ping!

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

ping

@jreback
Copy link
Contributor

jreback commented Apr 27, 2014

@hayd ping....need to get this in ASAP

@jreback
Copy link
Contributor

jreback commented May 1, 2014

ping!

@jreback
Copy link
Contributor

jreback commented May 2, 2014

@hayd looks ok....let's get this in ASAP

@jreback
Copy link
Contributor

jreback commented May 5, 2014

@hayd ping!

@jreback
Copy link
Contributor

jreback commented May 6, 2014

@hayd can you rebase!

didn't we discuss calling these topk/bottomk (or top_n or topn or ntop)?

@jreback
Copy link
Contributor

jreback commented May 8, 2014

ping!

@jreback
Copy link
Contributor

jreback commented May 8, 2014

@cpcloud I want to get this in, but needs a rebase / maybe test fixes, can you give a shot?

@cpcloud
Copy link
Member

cpcloud commented May 9, 2014

Sure no prob. Will be home in a bout an hour will give it a whirlygig then.

@jreback
Copy link
Contributor

jreback commented May 12, 2014

bump to 0.14.1, but @cpcloud if you are able to workon this would be great

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 12, 2014
@cpcloud
Copy link
Member

cpcloud commented May 12, 2014

I've got a pr coming for this today (on a plane to NYC right now). Mostly just a bit of clean up and dealing with the object dtype and special cases of n <= 0 or n >= len(series). Might need a few more test cases

@cpcloud
Copy link
Member

cpcloud commented May 12, 2014

is there any consensus on how to deal with unorderable types? e.g., disallow ... punt to numpy? python 3 disallows str int comparisons ... @jreback thoughts when u get a chance?

@jreback
Copy link
Contributor

jreback commented May 12, 2014

you could do something like this: https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L141

but I would just raise (e.g. if its a object dtype and you have mixed str/number)

@cpcloud cpcloud assigned cpcloud and unassigned hayd May 13, 2014
@cpcloud
Copy link
Member

cpcloud commented May 13, 2014

closing ... submitting a new pr based on this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: nlargest for DataFrame
5 participants