-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: _nsorted incorrect with duplicated values in index (#13412) #14707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am not sold on this impl. Can you see if there are perf tests (asv) for this, and give a check. |
There wasn't an existing benchmark for Do you have any performance recommendations @jreback if I use the |
Current coverage is 85.27% (diff: 100%)@@ master #14707 diff @@
==========================================
Files 144 144
Lines 50946 50949 +3
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43445 43447 +2
- Misses 7501 7502 +1
Partials 0 0
|
can you add similar tests for series when duplicates are there the algo might work already, idk. further I would move |
the perf is fine. this is a pretty cheap operation. |
Tests passed for Series & reorganized the code @jreback |
@@ -717,6 +717,17 @@ def select_n(series, n, keep, method): | |||
return dropped.iloc[inds] | |||
|
|||
|
|||
def select_n_frame(frame, columns, n, method, keep): | |||
from pandas.core.series import Series | |||
if not is_list_like(columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for moving. If you are interested in next PR! would want to clean up this code w.r.t. largest/smallest a bit more anyhow; IOW, make it cleaner / nicer. (and I think we have a perf issue outstanding I think with Series).
@@ -717,6 +717,17 @@ def select_n(series, n, keep, method): | |||
return dropped.iloc[inds] | |||
|
|||
|
|||
def select_n_frame(frame, columns, n, method, keep): | |||
from pandas.core.series import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add doc-strings to select_n_*
methods?. otherwise lgtm. ping on green.
ping @jreback. Thanks for your help! I'll also look into reorganizing the nlargest/nsmallest code. |
@@ -1323,6 +1323,34 @@ def test_nsmallest_multiple_columns(self): | |||
expected = df.sort_values(['a', 'c']).head(5) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_nsmallest_nlargest_duplicate_index(self): | |||
df = pd.DataFrame({'a': [1, 2, 3, 4], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment
trivial change (and needs rebase). ping on green. |
…3412) Add note to whatsnew Add nlargest benchmark Add tests for Series organize nsorted methods pep 8 fixes passed test and pep8 add docstrings add github issue
ping @jreback. Thanks! |
thanks! |
git diff upstream/master | flake8 --diff
DataFrame.nlargest
andDataFrame.nsmallest
does not behave correctly when there's duplicate values in the index.Any feedback is appreciated!