-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix nsmallest/nlargest With Identical Values #15299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nsmallest/nlargest With Identical Values #15299
Conversation
@jreback What do you think of this approach? The issue before was on this line (obviously PR isn't ready to go, just wanted your opinion on the approach first) |
pandas/tests/frame/test_analytics.py
Outdated
@@ -1323,6 +1323,12 @@ def test_nlargest_multiple_columns(self): | |||
expected = df.sort_values(['a', 'b'], ascending=False).head(5) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_nlargest_identical_values(self): | |||
df = pd.DataFrame({'a': [1] * 5, 'b': [1, 2, 3, 4, 5]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here as a comment
test on nsmallest as well.
pandas/tests/frame/test_analytics.py
Outdated
result = df.nlargest(3, 'a') | ||
expected = pd.DataFrame({'a': [1] * 3, 'b': [1, 2, 3]}) | ||
tm.assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add similar tests for series
pandas/core/algorithms.py
Outdated
@@ -911,10 +911,15 @@ def select_n_frame(frame, columns, n, method, keep): | |||
if not is_list_like(columns): | |||
columns = [columns] | |||
columns = list(columns) | |||
tmp = Series(frame.index) | |||
frame.reset_index(inplace=True, drop=True) | |||
ser = getattr(frame[columns[0]], method)(n, keep=keep) | |||
if isinstance(ser, Series): | |||
ser = ser.to_frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the right way to fix this is to make the internal routines return an indexer instead (e.g. integers), select_n_series
already has this (just needs to return it).
then this is a simple .take()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, i'm a little confused here. I don't see how the current implementation produces the desired result when using multiple columns, for example if we have:
df = pd.DataFrame(dict(
a=[1, 1, 1, 2, 2],
b=[4, 3, 8, 9, 5]
))
and we call df.nsmallest(2, ['a', 'b'])
There's no code in the current implementation that would try to order by both columns a
and b
in that order to give us a resulting frame of
pd.DataFrame(dict(
a=[1, 1],
b=[3, 4]
))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I don't think it IS there. what you can easily do it simply use the routine to select the nsmallest/largest (its according to a single column), then sort accordting to all of the columns provided (this is still quite efficient as the selection uses partition sort which).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure we can do that @jreback, take this for example
df = pd.DataFrame(dict(
a=[2, 2, 1, 1, 1],
b=[4, 3, 8, 9, 5],
))
result = df.nsmallest(2, ['a', 'b'])
If we do nsmallest on col a
and then sort on cols a
and b
we'll get
a c
2 1 8
3 1 9
when the desired result is
a c
4 1 5
2 1 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's much more we can do than sort_values(columns).head/tail(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RogerThomas that defeats the entire purpose of partition sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a way to do this might be to (for say nsmallest
when multiple columns)
run nsmallest with the first column, take the last value (or values if they are duplicates), select from the frame and argsort them. Then your result are the top values from the first run + the (totalrequested - top values of the 2nd run).
e.g in your example above since nsmallest(2, 'a') yields 1 (two rows of them), you could then select the 1 from the frame and just argsort at that point.
This would allow the single column to be quite efficient and avoid the next-n problem.
You would only need to do this if you have multiple columns.
I would look around and see if there is a better solution out there (IOW google and/or look at how some other stats packages do this).
Codecov Report
@@ Coverage Diff @@
## master #15299 +/- ##
==========================================
+ Coverage 90.96% 90.96% +<.01%
==========================================
Files 145 145
Lines 49534 49560 +26
==========================================
+ Hits 45057 45082 +25
- Misses 4477 4478 +1
Continue to review full report at Codecov.
|
any progress? |
Ah, sorry @jreback been super busy, haven't gotten around to this. Will try and get to it over the weekend, is that ok? |
@RogerThomas np. just pushing things along. you can ping when you are ready for a review. |
here is another example: #14846 (comment) |
@RogerThomas can you rebase / update. |
d9db74c
to
739704e
Compare
@jreback sorry took so long on this, been super busy. |
f070082
to
4955d67
Compare
pandas/core/algorithms.py
Outdated
reverse = method == 'nlargest' | ||
for i, column in enumerate(columns): | ||
series = frame[column] | ||
if reverse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this defeat the purpose of .nlargest()
? e.g. you are sorting the entire series.
why not just use series.nlargest(n)
here? (or .nsmallest
). And defer to the duplicate handling there (I think that works).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if there are duplicates I need to do another filter using isin
this returns an unsorted view of the series, I then have no way of aligning the series according to the sorting requirement, i'd still need to do an argsort after the nsmallest
/nlargest
call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sorta what you mean @jreback ?
Note I had to copy the index incase there is duplicate index in incoming frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roughly.
pandas/core/algorithms.py
Outdated
else: | ||
inds = series.argsort()[:n] | ||
values = series.iloc[inds] | ||
if i != len(columns) - 1 and values.duplicated().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values.is_unique
is more idiomatic
@@ -1455,6 +1455,16 @@ def test_nsmallest_nlargest(self): | |||
expected = s.sort_values().head(3) | |||
assert_series_equal(result, expected) | |||
|
|||
# GH 15297 | |||
s = Series([1] * 5, index=[1, 2, 3, 4, 5]) | |||
expected = Series([1] * 3, index=[1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to the doc-string for both Series/DataFrame what the policy for dealing with duplicates is. IOW, we are going to always return n (or less values), so the duplicated portion would be in order of the index (see if you can reword!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback are you sure we need to add to the docstring, I think it's obvious enough here. keep=first
in this example, and the series values are all duplicated, so you would expect it to return [1, 2, 3]
as the index regardless of using nsmallest
or nlargest
pandas/core/algorithms.py
Outdated
cur_frame = cur_frame[series.isin(values)].sort_values( | ||
column, ascending=ascending | ||
) | ||
frame.index = tmp.ix[frame.index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use .ix
pandas/core/algorithms.py
Outdated
|
||
ascending = method == 'nsmallest' | ||
tmp = Series(frame.index) | ||
frame.reset_index(inplace=True, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to do here?
inplace
is not idiomatic at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to reorder the incoming frame based on the specified sorting method. I can't do that using the original index as it might contain duplicates which gives rise to the original issue. Therefore I save the original index to temp and reset the index of frame inplace so we're not needlessly copying the entire frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, try to make as simple as possible then. Generally you want to use indexers (meaning locations/positions) for anything that could possibly be duplicated, then do a take at the end.
also you don't want to modify the input AT ALL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jreback, i understand the need to not change the input, but as you see on L968 I change the index back.
I don't think this solution is ideal, but I don't see how we can support duplicate indexes and not temporarily change the index on frame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can copy it if you need to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy is quite cheap (compared to everything else). we cannot modify the input at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair enough, just in my personal experience I deal with large dataframes (frames that just about fit in memory) and I have to be very careful about copying.
I have to sign off for the day, buy I'll clean this up tomorrow with a copy (also a positive is we only have to copy of the index is duplicated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*if the index is duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RogerThomas yes, this procedure only needs to be done on duplicated anyhow (which to be honest is not very common :)
thanks!
ping when you need a look
1664eb5
to
1a8043f
Compare
OK @jreback pretty sure this is ready to go now. Saving and restoring duplicated index turned out to be challenging when considering multi-indexes, but managed to get it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acutally I don't see any reason you need to actually copy the original frame, you simply need to keep a reference to it original_index
. Then you drop the index. I will reverse what I said earlier, it seems you can just do this for both duplicated and unique indexes.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -1057,3 +1057,4 @@ Bug Fixes | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
- Bug in ``pd.read_msgpack()`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`) | |||
- Bug in ``DataFrame.nsmallest`` and ``DataFrame.nlargest`` where identical values resulted in duplicated rows (:issue:`15297`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move this to the correct section (reshaping) in Bug FIxes. be sure to rebase on master.
pandas/core/algorithms.py
Outdated
# We must save frame's index to tmp | ||
tmp = Series(np.arange(len(frame)), index=frame.index) | ||
frame = frame.reset_index(drop=True) | ||
for i, column in enumerate(columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what this algo is doing.
pandas/core/algorithms.py
Outdated
if not index_is_unique: | ||
# If index not unique we must reset index to allow re-indexing below | ||
# We must save frame's index to tmp | ||
tmp = Series(np.arange(len(frame)), index=frame.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do
original_index = frame.index
2b47439
to
5f772db
Compare
['c'], | ||
['a', 'b'], | ||
['a', 'c'], | ||
['b', 'a'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RogerThomas you will see this test failing for the below. The issues is that if one of the columns you are n* on is object this will fail. So I think that should raise a nice error message (you can check if any are object upfront).
___________________________________________________________________________________________ TestNLargestNSmallest.test_n[10-order119] ___________________________________________________________________________________________
pandas/tests/frame/test_analytics.py:1945: in test_n
result = df.nsmallest(n, order)
pandas/core/frame.py:3478: in nsmallest
return algorithms.select_n_frame(self, columns, n, 'nsmallest', keep)
pandas/core/algorithms.py:966: in select_n_frame
values = getattr(series, method)(n, keep=keep)
pandas/core/series.py:1907: in nsmallest
method='nsmallest')
pandas/core/algorithms.py:915: in select_n_series
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
E TypeError: Cannot use method 'nsmallest' with dtype object
d08a60e
to
7f8cd04
Compare
@jreback updated tests for object dtypes Also had to update algorithm as your tests exposed a flaw in the previous algorithm |
pandas/core/algorithms.py
Outdated
return ser.merge(frame, on=columns[0], left_index=True)[frame.columns] | ||
for column in columns: | ||
dtype = frame[column].dtype | ||
if not issubclass(dtype.type, (np.integer, np.floating, np.datetime64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_integer_dtype, is_float_dtype, is_datetime64_any_dtype, is_timedelta64_dtype
; this would break on others (e.g. categorical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its possible we need additional tests for this (just checking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though, I think you call EACH column (as a Series) right? so this check is already there (well not 100% that is actually checks things).
Or you may skip a column if no dupes?
What I am trying to say is that I want to defer the dtype checks to the other algos if possible (and not do it here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, ok, sorry I thought by your "check upfront" comment you wanted to do it for each column in this method.
But you're right the line getattr(series, method)(cur_n, keep=keep)
should defer the check to n* on the series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, if we ARE checking all columns with this method (whether they are duplicated or not). Then this is fine. Is that always true?
pandas/core/algorithms.py
Outdated
series = cur_frame[column] | ||
values = getattr(series, method)(cur_n, keep=keep) | ||
is_last_column = len(columns) - 1 == i | ||
if is_last_column or len(values.unique()) == sum(series.isin(values)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use values.nunique() == series.isin(values).sum()
(you don't want to use python operators)
pandas/core/algorithms.py
Outdated
duplicated_filter = series.duplicated(keep=False) | ||
non_duplicated = values[~duplicated_filter] | ||
duplicated = values[duplicated_filter] | ||
if method == 'nsmallest': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a tad more clear to do
def get_indexer(x, y):
if method == ....
then (eg.)
indexer = get_indexer(indexer, values.index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, I was SO close to doing that, but wasn't sure how you guys felt about inline functions. But, I agree, i'll do it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am big fan of DRY, even at the expense of some readability.
pandas/tests/frame/test_analytics.py
Outdated
expected = df.sort_values(order, ascending=False).head(n) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_n_error(self, df_strings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob should expand this with a different df that has pretty much all dtypes, then can test ones which have an error
heres a sample
df = pd.DataFrame(
{'group': [1, 1, 2],
'int': [1, 2, 3],
'float': [4., 5., 6.],
'string': list('abc'),
'category_string': pd.Series(list('abc')).astype('category'),
'category_int': [7, 8, 9],
'datetime': pd.date_range('20130101', periods=3),
'datetimetz': pd.date_range('20130101',
periods=3,
tz='US/Eastern'),
'timedelta': pd.timedelta_range('1 s', periods=3, freq='s')},
columns=['group', 'int', 'float', 'string',
'category_string', 'category_int',
'datetime', 'datetimetz',
'timedelta'])
@pytest.mark.parametrize( | ||
"r", [Series([3., 2, 1, 2, '5'], dtype='object'), | ||
Series([3., 2, 1, 2, 5], dtype='object'), | ||
# not supported on some archs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a mini-example from test_error in frame
Hey @jreback |
pandas/tests/frame/test_analytics.py
Outdated
def test_n(self, df_strings, method, n, order): | ||
# GH10393 | ||
df = df_strings | ||
if order[0] == 'b': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this error if 'b' is in order
(IOW, it is by-definition an unorderable column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, no because the only time the algorithm considers 'b' is if it's the first field it encounters, otherwise, because 'a' and 'c' are unique, 'b' will not be considered. See comment on line 1967
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think that is a bit odd then. Why shouldn't it raise if a non-valid column is in here?
# Top / bottom | ||
@pytest.mark.parametrize( | ||
'method, n, order', | ||
product(['nsmallest', 'nlargest'], range(1, 11), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding method doesn't add much here (and instead add both as comparision tests below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but after seeing your other methods, maybe ok. if its easier to debug then ok.
pandas/tests/frame/test_analytics.py
Outdated
# Only expect error when 'b' is first in order, as 'a' and 'c' are | ||
# unique | ||
error_msg = ( | ||
"'b' has dtype: object, cannot use method 'nsmallest' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use tm.assertRaisesRegexp
when you want to compare the message as well (also a context manager)
this is prob because Series is not tested as well. Let me see if I can fix this (and you can rebase on top).
|
75ac0b7
to
901ab4f
Compare
@@ -944,14 +944,60 @@ def select_n_frame(frame, columns, n, method, keep): | |||
------- | |||
nordered : DataFrame | |||
""" | |||
from pandas.core.series import Series | |||
from pandas import Int64Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I only put this import here as the previous algorithm imported Series here, why do you do inline imports?
I would always have all my imports at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because some of the routines from algorithms are imported at the very top level (e.g. pd.factorize
and you would get circular imports otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH ok, thanks!
@RogerThomas ok #15902 fixes the tz problem in series nsmallest. after I merge you should be able to easilly rebase on top, just take |
xref #15299 Author: Jeff Reback <[email protected]> Closes #15902 from jreback/series_n and squashes the following commits: 657eac8 [Jeff Reback] TST: better testing of Series.nlargest/nsmallest
@RogerThomas ok merged that PR in. pls rebase and should work! |
1d4be5e
to
333e126
Compare
OK @jreback tests passing for me |
@@ -981,14 +981,60 @@ def select_n_frame(frame, columns, n, method, keep): | |||
------- | |||
nordered : DataFrame | |||
""" | |||
from pandas.core.series import Series | |||
from pandas import Int64Index | |||
if not is_list_like(columns): | |||
columns = [columns] | |||
columns = list(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do this check on all the columns upfront (its cheap).
you can put this in a separate function on core/algorithms (called by there and here): https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L949
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u want to reference the column in the exception? IOW
def _ensure_is_valid_dtype_n_method(method, dtype):
if not ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
needs_i8_conversion(dtype)):
raise TypeError("Cannot use method '{method}' with "
"dtype {dtype}".format(method=method, dtype=dtype))
or something like
def _is_valid_dtype_n_method(dtype):
return ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
needs_i8_conversion(dtype)):
and then in select_n_frame function something like
if not _is_valid_dtype_n_method(dtype):
raise TypeError("Column `column` has dtype `dtype` can't use n with column")
and then in select_n_series function something like
if not _is_valid_dtype_n_method(dtype):
raise TypeError("Cannot use method `method` with dtype")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _is_valid_dtype_n_method(dtype):
return ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
needs_i8_conversion(dtype))
then you can have separate messages for a series calling this (alone) and a dataframe with many columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though keep the message for Series pretty much the same (actually it IS tested).
4c1720d
to
d3964f8
Compare
thanks @RogerThomas very nice PR! and excellent responsiveness on changes. keem em coming! |
git diff upstream/master | flake8 --diff