Skip to content

API: ExtensionArray.argsort places the missing value at the end #26354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
c22c01b
Add mask as return value of _values_for_argsort
makbigc May 11, 2019
93ac589
Fix mypy typing and add annotation to the test
makbigc May 12, 2019
b401cba
Add whatsnew note
makbigc May 12, 2019
aa50eed
Modify nargsort
makbigc May 12, 2019
d38a5e1
Simplify the subfunction permutation
makbigc May 16, 2019
bae30fd
Remove test_argsort_nan_loc
makbigc May 16, 2019
effe18c
Change the docstring of _values_for_argsort
makbigc May 16, 2019
d1dd5fa
Remove the code related to test_argsort_nan_loc
makbigc May 16, 2019
bf1e927
Change docstring and move the entry in whatsnew
makbigc May 21, 2019
1925423
amend after review
makbigc May 26, 2019
c57dff7
Use sorting.nargsort and remove some _values_for_argsort
makbigc Jun 6, 2019
16cff4e
Fix lint error
makbigc Jun 6, 2019
bc713cd
remove relundent lines added in rebase
makbigc Jun 6, 2019
b10e5f9
Add test for NA-last behaviour
makbigc Jun 8, 2019
7811a08
Add _values_for_argsort in core/arrays/base.py
makbigc Jun 9, 2019
c172ba1
Fix PEP8 error
makbigc Jun 9, 2019
d9660b1
Add one more test
makbigc Jun 9, 2019
c333e8c
Add _values_for_argsort back to numpy_.py for efficiency
makbigc Jun 9, 2019
315e6f2
Set na_position last and revert _values_for_argsort in datetimelike.py
makbigc Jun 10, 2019
ebdfac2
Resolve conflict
makbigc Jun 11, 2019
09638f4
merge for update
makbigc Jun 14, 2019
0dc48a4
Add np.ndarray check to aviod double conversion
makbigc Jun 14, 2019
7186d9d
change whatsnew
makbigc Jun 14, 2019
84f87a8
Undo the addition
makbigc Jun 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,6 @@ Other Enhancements
- :meth:`DataFrame.query` and :meth:`DataFrame.eval` now supports quoting column names with backticks to refer to names with spaces (:issue:`6508`)
- :func:`merge_asof` now gives a more clear error message when merge keys are categoricals that are not equal (:issue:`26136`)
- :meth:`pandas.core.window.Rolling` supports exponential (or Poisson) window type (:issue:`21303`)
-

.. _whatsnew_0250.api_breaking:

Expand Down Expand Up @@ -438,6 +437,7 @@ Other API Changes
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
- The ``.str``-accessor has been disabled for 1-level :class:`MultiIndex`, use :meth:`MultiIndex.to_flat_index` if necessary (:issue:`23679`)
- Removed support of gtk package for clipboards (:issue:`26563`)
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array (:issue:`21801`)

.. _whatsnew_0250.deprecations:

Expand Down
35 changes: 13 additions & 22 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

from pandas._typing import ArrayLike
from pandas.core import ops
from pandas.core.sorting import nargsort

_not_implemented_message = "{} does not implement {}."

Expand Down Expand Up @@ -361,23 +362,6 @@ def isna(self) -> ArrayLike:
"""
raise AbstractMethodError(self)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we leave this and just enhance the signature, passing na_position ? then the default implementation will just work

Copy link
Contributor Author

@makbigc makbigc Jun 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values_for_argsort is added backed to ExtensionArray interface. Actually, _values_for_argsort which calls np.array does the same thing as np.asanyarray in nargsort. But the code in ExtensionArray.argsort is organized in more general way.

DatetimeLikeArray and PeriodArray's own _values_for_argsort methods are deleted. Otherwise, NA values in PeriodArray would get a negative number from self._data which would be placed in the beginning in nargsort.

"""
Return values for sorting.

Returns
-------
ndarray
The transformed values should maintain the ordering between values
within the array.

See Also
--------
ExtensionArray.argsort
"""
# Note: this is used in `ExtensionArray.argsort`.
return np.array(self)

def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
Return the indices that would sort this array.
Expand All @@ -403,13 +387,20 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
# Implementor note: You have two places to override the behavior of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you augment the doc-string for argsort / _values_for_argsort to say that NaNs are placed at the end.

# argsort.
# 1. _values_for_argsort : construct the values passed to np.argsort
# 1. _values_for_argsort : construct the values passed to np.argsort.
# The transformed values should maintain the ordering between values
# within the array. Preferably, the largest value is assiged to NA
# values in transformation.
# 2. argsort : total control over sorting.
ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)
values = self._values_for_argsort()
result = np.argsort(values, kind=kind, **kwargs)
if not ascending:
result = result[::-1]

if hasattr(self, '_values_for_argsort'):
values = self._values_for_argsort()
else:
values = self
na_position = 'last' if ascending else 'first'
result = nargsort(values, kind=kind, ascending=ascending,
na_position=na_position)
return result

def fillna(self, value=None, method=None, limit=None):
Expand Down
10 changes: 7 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1523,7 +1523,7 @@ def check_for_ordered(self, op):
def _values_for_argsort(self):
return self._codes.copy()

def argsort(self, *args, **kwargs):
def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
# TODO(PY2): use correct signature
# We have to do *args, **kwargs to avoid a a py2-only signature
# issue since np.argsort differs from argsort.
Expand Down Expand Up @@ -1567,8 +1567,12 @@ def argsort(self, *args, **kwargs):
>>> cat.argsort()
array([3, 0, 1, 2])
"""
# Keep the implementation here just for the docstring.
return super().argsort(*args, **kwargs)
ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)
values = self._values_for_argsort()
result = np.argsort(values, kind=kind, **kwargs)
if not ascending:
result = result[::-1]
return result

def sort_values(self, inplace=False, ascending=True, na_position='last'):
"""
Expand Down
3 changes: 0 additions & 3 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,9 +620,6 @@ def _values_for_factorize(self):
def _from_factorized(cls, values, original):
return cls(values, dtype=original.dtype)

def _values_for_argsort(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cast to object dtype for datetimetz. I think this can be reverted.

In [38]: a = pd.date_range('2000', periods=100, tz='US/Eastern')._data

In [39]: %timeit a.argsort()
325 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [40]: b = pd.date_range('2000', periods=100)._data

In [41]: %timeit b.argsort()
44.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This should be reve

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc I believe you reverted this and pref is back to normal, can you confirm (also do we hav an asv for this example that @TomAugspurger gave; if not can you add).

return self._data

# ------------------------------------------------------------------
# Additional array methods
# These are not part of the EA API, but we implement them because
Expand Down
17 changes: 0 additions & 17 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,23 +512,6 @@ def value_counts(self, dropna=True):

return Series(array, index=index)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does we have a benchmark for sorting Int64? so does this change things?

Copy link
Contributor Author

@makbigc makbigc Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't found the benchmark for argsort. But this change slows down the performance.
HEAD:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
28.3 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

master:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
5.73 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this benchmark (make the arrays longer though)

"""Return values for sorting.

Returns
-------
ndarray
The transformed values should maintain the ordering between values
within the array.

See Also
--------
ExtensionArray.argsort
"""
data = self._data.copy()
data[self._mask] = data.min() - 1
return data

@classmethod
def _create_comparison_method(cls, op):
def cmp_method(self, other):
Expand Down
3 changes: 0 additions & 3 deletions pandas/core/arrays/numpy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,9 +280,6 @@ def take(self, indices, allow_fill=False, fill_value=None):
def copy(self, deep=False):
return type(self)(self._ndarray.copy())

def _values_for_argsort(self):
return self._ndarray

def _values_for_factorize(self):
return self._ndarray, -1

Expand Down
3 changes: 0 additions & 3 deletions pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -673,9 +673,6 @@ def _check_timedeltalike_freq_compat(self, other):

_raise_on_incompatible(self, other)

def _values_for_argsort(self):
return self._data


PeriodArray._add_comparison_ops()

Expand Down
5 changes: 5 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ def test_argsort_missing(self, data_missing_for_sorting):
expected = pd.Series(np.array([1, -1, 0], dtype=np.int64))
self.assert_series_equal(result, expected)

def test_argsort_nan_last(self, data_missing_for_sorting):
# GH 21801
result = data_missing_for_sorting.argsort()
assert result[-1] == 1

@pytest.mark.parametrize('ascending', [True, False])
def test_sort_values(self, data_for_sorting, ascending):
ser = pd.Series(data_for_sorting)
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,12 @@ def _reduce(self, name, skipna=True, **kwargs):
"the {} operation".format(name))
return op(axis=0)

def _values_for_argsort(self):
data = self._data
mask = self.isna()
data[mask] = decimal.Decimal('Infinity')
return data


def to_decimal(values, context=None):
return DecimalArray([decimal.Decimal(x) for x in values], context=context)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,10 @@ def _values_for_argsort(self):
# If all the elemnts of self are the same size P, NumPy will
# cast them to an (N, P) array, instead of an (N,) array of tuples.
frozen = [()] + [tuple(x.items()) for x in self]
return np.array(frozen, dtype=object)[1:]
data = np.array(frozen, dtype=object)[1:]
mask = self.isna()
data[mask] = np.nan
return data


def make_data():
Expand Down
5 changes: 5 additions & 0 deletions pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,11 @@ def test_searchsorted(self, data_for_sorting):
if not data_for_sorting.ordered:
raise pytest.skip(reason="searchsorted requires ordered data.")

def test_argsort_nan_last(self, data_missing_for_sorting):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NA-first behaviour of Categorical will be changed to NA-last in the next PR. So the original _values_for_argsort is placed here temporarily.

The corresponding test for NA-first behaviour of Categorical will also be changed.

# GH 21801
# TODO: Categorical.argsort places NA values at the end
pass


class TestCasting(base.BaseCastingTests):
pass
Expand Down