Skip to content

API: ExtensionArray.argsort places the missing value at the end #26354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
c22c01b
Add mask as return value of _values_for_argsort
makbigc May 11, 2019
93ac589
Fix mypy typing and add annotation to the test
makbigc May 12, 2019
b401cba
Add whatsnew note
makbigc May 12, 2019
aa50eed
Modify nargsort
makbigc May 12, 2019
d38a5e1
Simplify the subfunction permutation
makbigc May 16, 2019
bae30fd
Remove test_argsort_nan_loc
makbigc May 16, 2019
effe18c
Change the docstring of _values_for_argsort
makbigc May 16, 2019
d1dd5fa
Remove the code related to test_argsort_nan_loc
makbigc May 16, 2019
bf1e927
Change docstring and move the entry in whatsnew
makbigc May 21, 2019
1925423
amend after review
makbigc May 26, 2019
c57dff7
Use sorting.nargsort and remove some _values_for_argsort
makbigc Jun 6, 2019
16cff4e
Fix lint error
makbigc Jun 6, 2019
bc713cd
remove relundent lines added in rebase
makbigc Jun 6, 2019
b10e5f9
Add test for NA-last behaviour
makbigc Jun 8, 2019
7811a08
Add _values_for_argsort in core/arrays/base.py
makbigc Jun 9, 2019
c172ba1
Fix PEP8 error
makbigc Jun 9, 2019
d9660b1
Add one more test
makbigc Jun 9, 2019
c333e8c
Add _values_for_argsort back to numpy_.py for efficiency
makbigc Jun 9, 2019
315e6f2
Set na_position last and revert _values_for_argsort in datetimelike.py
makbigc Jun 10, 2019
ebdfac2
Resolve conflict
makbigc Jun 11, 2019
09638f4
merge for update
makbigc Jun 14, 2019
0dc48a4
Add np.ndarray check to aviod double conversion
makbigc Jun 14, 2019
7186d9d
change whatsnew
makbigc Jun 14, 2019
84f87a8
Undo the addition
makbigc Jun 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,7 @@ Other API Changes
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
- The ``.str``-accessor has been disabled for 1-level :class:`MultiIndex`, use :meth:`MultiIndex.to_flat_index` if necessary (:issue:`23679`)
- Removed support of gtk package for clipboards (:issue:`26563`)
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array, except :class:`Categorical` (:issue:`21801`)

.. _whatsnew_0250.deprecations:

Expand Down
9 changes: 5 additions & 4 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

from pandas._typing import ArrayLike
from pandas.core import ops
from pandas.core.sorting import nargsort

_not_implemented_message = "{} does not implement {}."

Expand Down Expand Up @@ -403,13 +404,13 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
# Implementor note: You have two places to override the behavior of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you augment the doc-string for argsort / _values_for_argsort to say that NaNs are placed at the end.

# argsort.
# 1. _values_for_argsort : construct the values passed to np.argsort
# 1. _values_for_argsort : construct the values passed to np.argsort.
# 2. argsort : total control over sorting.
ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)

values = self._values_for_argsort()
result = np.argsort(values, kind=kind, **kwargs)
if not ascending:
result = result[::-1]
result = nargsort(values, kind=kind, ascending=ascending,
na_position='last')
return result

def fillna(self, value=None, method=None, limit=None):
Expand Down
10 changes: 7 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1523,7 +1523,7 @@ def check_for_ordered(self, op):
def _values_for_argsort(self):
return self._codes.copy()

def argsort(self, *args, **kwargs):
def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
# TODO(PY2): use correct signature
# We have to do *args, **kwargs to avoid a a py2-only signature
# issue since np.argsort differs from argsort.
Expand Down Expand Up @@ -1567,8 +1567,12 @@ def argsort(self, *args, **kwargs):
>>> cat.argsort()
array([3, 0, 1, 2])
"""
# Keep the implementation here just for the docstring.
return super().argsort(*args, **kwargs)
ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)
values = self._values_for_argsort()
result = np.argsort(values, kind=kind, **kwargs)
if not ascending:
result = result[::-1]
return result

def sort_values(self, inplace=False, ascending=True, na_position='last'):
"""
Expand Down
17 changes: 0 additions & 17 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,23 +501,6 @@ def value_counts(self, dropna=True):

return Series(array, index=index)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does we have a benchmark for sorting Int64? so does this change things?

Copy link
Contributor Author

@makbigc makbigc Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't found the benchmark for argsort. But this change slows down the performance.
HEAD:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
28.3 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

master:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
5.73 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this benchmark (make the arrays longer though)

"""Return values for sorting.

Returns
-------
ndarray
The transformed values should maintain the ordering between values
within the array.

See Also
--------
ExtensionArray.argsort
"""
data = self._data.copy()
data[self._mask] = data.min() - 1
return data

@classmethod
def _create_comparison_method(cls, op):
def cmp_method(self, other):
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -676,7 +676,7 @@ def _check_timedeltalike_freq_compat(self, other):
_raise_on_incompatible(self, other)

def _values_for_argsort(self):
return self._data
return self


PeriodArray._add_comparison_ops()
Expand Down
18 changes: 10 additions & 8 deletions pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,15 +255,17 @@ def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
sorted_idx = np.roll(sorted_idx, cnt_null)
return sorted_idx

with warnings.catch_warnings():
# https://github.com/pandas-dev/pandas/issues/25439
# can be removed once ExtensionArrays are properly handled by nargsort
warnings.filterwarnings(
"ignore", category=FutureWarning,
message="Converting timezone-aware DatetimeArray to")
items = np.asanyarray(items)
idx = np.arange(len(items))
mask = isna(items)
if not isinstance(items, np.ndarray):
with warnings.catch_warnings():
# https://github.com/pandas-dev/pandas/issues/25439
# can be removed once ExtensionArrays are properly handled by
# nargsort
warnings.filterwarnings(
"ignore", category=FutureWarning,
message="Converting timezone-aware DatetimeArray to")
items = np.asanyarray(items)
idx = np.arange(len(items))
non_nans = items[~mask]
non_nan_idx = idx[~mask]
nan_idx = np.nonzero(mask)[0]
Expand Down
5 changes: 5 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ def test_argsort_missing(self, data_missing_for_sorting):
expected = pd.Series(np.array([1, -1, 0], dtype=np.int64))
self.assert_series_equal(result, expected)

def test_argsort_nan_last(self, data_missing_for_sorting):
# GH 21801
result = data_missing_for_sorting.argsort()
assert result[-1] == 1

@pytest.mark.parametrize('ascending', [True, False])
def test_sort_values(self, data_for_sorting, ascending):
ser = pd.Series(data_for_sorting)
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,12 @@ def _reduce(self, name, skipna=True, **kwargs):
"the {} operation".format(name))
return op(axis=0)

def _values_for_argsort(self):
data = self._data
mask = self.isna()
data[mask] = decimal.Decimal('Infinity')
return data


def to_decimal(values, context=None):
return DecimalArray([decimal.Decimal(x) for x in values], context=context)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,10 @@ def _values_for_argsort(self):
# If all the elemnts of self are the same size P, NumPy will
# cast them to an (N, P) array, instead of an (N,) array of tuples.
frozen = [()] + [tuple(x.items()) for x in self]
return np.array(frozen, dtype=object)[1:]
data = np.array(frozen, dtype=object)[1:]
mask = self.isna()
data[mask] = np.nan
return data


def make_data():
Expand Down
5 changes: 5 additions & 0 deletions pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,11 @@ def test_searchsorted(self, data_for_sorting):
if not data_for_sorting.ordered:
raise pytest.skip(reason="searchsorted requires ordered data.")

def test_argsort_nan_last(self, data_missing_for_sorting):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NA-first behaviour of Categorical will be changed to NA-last in the next PR. So the original _values_for_argsort is placed here temporarily.

The corresponding test for NA-first behaviour of Categorical will also be changed.

# GH 21801
# TODO: Categorical.argsort places NA values at the end
pass


class TestCasting(base.BaseCastingTests):
pass
Expand Down
11 changes: 10 additions & 1 deletion pandas/tests/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
import pytest

from pandas import (
DataFrame, MultiIndex, Series, array, concat, merge, to_datetime)
Categorical, DataFrame, MultiIndex, Series, array, concat, merge,
to_datetime)
from pandas.core import common as com
from pandas.core.sorting import (
decons_group_index, get_group_index, is_int64_overflow_possible,
Expand Down Expand Up @@ -188,6 +189,14 @@ def test_nargsort_datetimearray_warning(self):
with tm.assert_produces_warning(None):
nargsort(data)

@pytest.mark.parametrize('data', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually instead of here, can you add the test in pandas/tests/extension/base/missing.py (and it will be generic with the data_missing fixure)

Categorical(['a', 'c', 'a', 'b']),
to_datetime([0, 2, 0, 1]).tz_localize('Europe/Brussels')])
def test_nargsort_extension_array(self, data):
result = nargsort(data)
expected = np.array([0, 2, 3, 1], dtype=np.intp)
tm.assert_numpy_array_equal(result, expected)


class TestMerge:

Expand Down