Skip to content

COMPAT: Argsort position matches NumPy #31210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,44 @@ source, you should no longer need to install Cython into your build environment
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_100.api_breaking.nat_sort:

Changed sort position for ``NaT``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:attr:`NaT` will now sort at the *end* rather than the beginning in sorting functions (:issue:`29884`).
This matches the behavior in NumPy 1.18 and newer, which makes the ``NaT`` behavior consistent
with other missing values like :attr:`numpy.nan`.

.. ipython:: python

values = pd.Index(['2001', 'NaT', '2000'], dtype='datetime64[ns]')

*pandas 0.25.x*

.. code-block:: python

>>> values.sort_values()
DatetimeIndex(['NaT', '2000-01-01', '2001-01-01'], dtype='datetime64[ns]', freq=None)

>>> values.argsort()
array([1, 2, 0])


*pandas 1.0.0*

.. ipython:: python

values.sort_values()
values.argsort()

This affects all sorting functions on indexes, Series, DataFrames, and arrays.


.. note::

This change was made between pandas 1.0.0rc0 and pandas 1.0.0.

.. _whatsnew_100.api_breaking.MultiIndex._names:

Avoid using names from ``MultiIndex.levels``
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/indexes/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,9 @@ def map(self, mapper, na_action=None):
except Exception:
return self.astype(object).map(mapper)

def argsort(self, *args, **kwargs) -> np.ndarray:
return np.argsort(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if we want this to be consistent across DTI/TDI/PI this needs to take a .view("M8[ns]")

Also should this happen at the DTA/TDA/PA level and get passed through?


def sort_values(self, return_indexer=False, ascending=True):
"""
Return sorted copy of Index.
Expand All @@ -191,10 +194,7 @@ def sort_values(self, return_indexer=False, ascending=True):
sorted_index = self.take(_as)
return sorted_index, _as
else:
# NB: using asi8 instead of _ndarray_values matters in numpy 1.18
# because the treatment of NaT has been changed to put NaT last
# instead of first.
sorted_values = np.sort(self.asi8)
sorted_values = np.sort(self._ndarray_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means i8n for PeriodIndex (so NaT to the front), m8 for TimedeltaIndex (so NaT to the front until numpy updates td64 to match dt64) and M8 for DatetimeIndex, (so NaT to the end). Is that what you're going for here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy has fixed the timedelta sorting, at least for 1.19+

In [22]: arr = np.array([np.timedelta64(2, 'ns'), np.timedelta64('NaT', 'ns'), np.timedelta64(1, 'ns')])

In [23]: arr.sort()

In [24]: arr
Out[24]: array([    1,     2, 'NaT'], dtype='timedelta64[ns]')


freq = self.freq
if freq is not None and not is_period_dtype(self):
Expand Down
9 changes: 6 additions & 3 deletions pandas/tests/indexes/datetimes/test_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ def test_order_with_freq(self, idx):
),
(
[pd.NaT, "2011-01-03", "2011-01-05", "2011-01-02", pd.NaT],
[pd.NaT, pd.NaT, "2011-01-02", "2011-01-03", "2011-01-05"],
["2011-01-02", "2011-01-03", "2011-01-05", pd.NaT, pd.NaT],
),
],
)
Expand All @@ -269,14 +269,17 @@ def test_order_without_freq(self, index_dates, expected_dates, tz_naive_fixture)
ordered, indexer = index.sort_values(return_indexer=True)
tm.assert_index_equal(ordered, expected)

exp = np.array([0, 4, 3, 1, 2])
if index.isna().any():
exp = np.array([3, 1, 2, 0, 4])
else:
exp = np.array([0, 4, 3, 1, 2])
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
assert ordered.freq is None

ordered, indexer = index.sort_values(return_indexer=True, ascending=False)
tm.assert_index_equal(ordered, expected[::-1])

exp = np.array([2, 1, 3, 4, 0])
exp = exp[::-1]
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
assert ordered.freq is None

Expand Down