-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
COMPAT: Argsort position matches NumPy #31210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexes/datetimelike.py
Outdated
@@ -180,6 +180,9 @@ def map(self, mapper, na_action=None): | |||
except Exception: | |||
return self.astype(object).map(mapper) | |||
|
|||
def argsort(self, *args, **kwargs) -> np.ndarray: | |||
return np.argsort(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think if we want this to be consistent across DTI/TDI/PI this needs to take a .view("M8[ns]")
Also should this happen at the DTA/TDA/PA level and get passed through?
pandas/core/indexes/datetimelike.py
Outdated
# because the treatment of NaT has been changed to put NaT last | ||
# instead of first. | ||
sorted_values = np.sort(self.asi8) | ||
sorted_values = np.sort(self._ndarray_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means i8n for PeriodIndex (so NaT to the front), m8 for TimedeltaIndex (so NaT to the front until numpy updates td64 to match dt64) and M8 for DatetimeIndex, (so NaT to the end). Is that what you're going for here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy has fixed the timedelta sorting, at least for 1.19+
In [22]: arr = np.array([np.timedelta64(2, 'ns'), np.timedelta64('NaT', 'ns'), np.timedelta64(1, 'ns')])
In [23]: arr.sort()
In [24]: arr
Out[24]: array([ 1, 2, 'NaT'], dtype='timedelta64[ns]')
@@ -710,7 +710,7 @@ def _from_factorized(cls, values, original): | |||
return cls(values, dtype=original.dtype) | |||
|
|||
def _values_for_argsort(self): | |||
return self._data | |||
return self._data.view("M8[ns]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel moved this down to the arrays. Basic idea is to just sort as datetime64[ns], which should be correct for datetime, timedelta, and period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, thanks
Perf looks OK I think import pandas as pd
import numpy as np
arr = np.arange(10000)
np.random.shuffle(arr)
idx = pd.DatetimeIndex(arr)
%timeit idx.sort_values()
idx = pd.PeriodIndex(ordinal=arr, freq='D')
%timeit idx.sort_values() Master
PR
Not sure why they aren't identical yet. We're a bit outside the standard deviation on those runs. |
# because the treatment of NaT has been changed to put NaT last | ||
# instead of first. | ||
sorted_values = np.sort(self.asi8) | ||
values = self._data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use self._data.values_for_argsort?
def test_argsort(dtype): | ||
a = pd.array([2001, pd.NaT, 2000], dtype=dtype) | ||
result = a.argsort() | ||
expected = np.array([2, 0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im a bit confused by this. how is this prevented from varying based on the numpy version?
i think we get one, maybe two extra function-call-overheads; does that account for 15-20 microseconds? |
That was my original question. Do we want to match the installed version of NumPy so that Both seem quite difficult, given that we sometimes do things like sort the datetimes, extract the i8 values, and then operate on them assuming they're sorted. I'm not confident that I can get this done for 1.0, if ever. |
This does sound like the kind of problem that goes away on its own once our numpy min-version catches up. I lean towards "dont sweat it" |
I'm going to close this to get it out of the way. |
@TomAugspurger what was missing / wrong / not yet working in the current PR? (or just still a lot of failing tests that need further investigation?) |
Anything that relied on |
Also affected: the ability to have DatetimeIndexOpsMixin.argsort dispatch to self._data.argsort |
Ah, that indeed sounds more complicated :) |
Closes #29884
Bit of a WIP, CI is probably going to fail here. One API question, do we want NaT to sort at the end, regardless of the NumPy version? Or should we match the behavior of the installed NumPy (NaT at the start for <1.18, at the end for >=1.18)?