Skip to content

BUG: _nsorted incorrect with duplicated values in index (#13412) #14707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mroeschke
Copy link
Member

DataFrame.nlargest and DataFrame.nsmallest does not behave correctly when there's duplicate values in the index.

Any feedback is appreciated!

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

I am not sold on this impl. Can you see if there are perf tests (asv) for this, and give a check.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 22, 2016
@mroeschke
Copy link
Member Author

There wasn't an existing benchmark for DataFrame.nlargest. I wrote my own and I got an after/before ratio of 1.37, so there is some speed regression.

Do you have any performance recommendations @jreback if I use the merge method? Or should I look into a new solution?

@codecov-io
Copy link

codecov-io commented Nov 23, 2016

Current coverage is 85.27% (diff: 100%)

Merging #14707 into master will decrease coverage by <.01%

@@             master     #14707   diff @@
==========================================
  Files           144        144          
  Lines         50946      50949     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43445      43447     +2   
- Misses         7501       7502     +1   
  Partials          0          0          

Powered by Codecov. Last update c0e13d1...d03be72

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

can you add similar tests for series when duplicates are there the algo might work already, idk.

further I would move _nsorted to algorithms. maybe rename select_n -> select_n_series and call this one select_n_frame

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

the perf is fine. this is a pretty cheap operation.

@mroeschke
Copy link
Member Author

Tests passed for Series & reorganized the code @jreback
(Looks like a Travis-CI job failed due to connectivity)

@jreback jreback added this to the 0.19.2 milestone Nov 25, 2016
@@ -717,6 +717,17 @@ def select_n(series, n, keep, method):
return dropped.iloc[inds]


def select_n_frame(frame, columns, n, method, keep):
from pandas.core.series import Series
if not is_list_like(columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for moving. If you are interested in next PR! would want to clean up this code w.r.t. largest/smallest a bit more anyhow; IOW, make it cleaner / nicer. (and I think we have a perf issue outstanding I think with Series).

@@ -717,6 +717,17 @@ def select_n(series, n, keep, method):
return dropped.iloc[inds]


def select_n_frame(frame, columns, n, method, keep):
from pandas.core.series import Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add doc-strings to select_n_* methods?. otherwise lgtm. ping on green.

@mroeschke
Copy link
Member Author

ping @jreback. Thanks for your help! I'll also look into reorganizing the nlargest/nsmallest code.

@@ -1323,6 +1323,34 @@ def test_nsmallest_multiple_columns(self):
expected = df.sort_values(['a', 'c']).head(5)
tm.assert_frame_equal(result, expected)

def test_nsmallest_nlargest_duplicate_index(self):
df = pd.DataFrame({'a': [1, 2, 3, 4],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

trivial change (and needs rebase). ping on green.

…3412)

Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings

add github issue
@mroeschke
Copy link
Member Author

ping @jreback. Thanks!

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

thanks!

@jreback jreback closed this in 6e514da Dec 6, 2016
jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
@mroeschke mroeschke deleted the fix_13412 branch December 20, 2017 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: in _nsorted for frame with duplicated values index
3 participants