Skip to content

COMPAT: Argsort position matches NumPy #31210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

TomAugspurger
Copy link
Contributor

Closes #29884

Bit of a WIP, CI is probably going to fail here. One API question, do we want NaT to sort at the end, regardless of the NumPy version? Or should we match the behavior of the installed NumPy (NaT at the start for <1.18, at the end for >=1.18)?

@TomAugspurger TomAugspurger added this to the 1.0.0 milestone Jan 22, 2020
@TomAugspurger TomAugspurger added Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Jan 22, 2020
@@ -180,6 +180,9 @@ def map(self, mapper, na_action=None):
except Exception:
return self.astype(object).map(mapper)

def argsort(self, *args, **kwargs) -> np.ndarray:
return np.argsort(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if we want this to be consistent across DTI/TDI/PI this needs to take a .view("M8[ns]")

Also should this happen at the DTA/TDA/PA level and get passed through?

# because the treatment of NaT has been changed to put NaT last
# instead of first.
sorted_values = np.sort(self.asi8)
sorted_values = np.sort(self._ndarray_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means i8n for PeriodIndex (so NaT to the front), m8 for TimedeltaIndex (so NaT to the front until numpy updates td64 to match dt64) and M8 for DatetimeIndex, (so NaT to the end). Is that what you're going for here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy has fixed the timedelta sorting, at least for 1.19+

In [22]: arr = np.array([np.timedelta64(2, 'ns'), np.timedelta64('NaT', 'ns'), np.timedelta64(1, 'ns')])

In [23]: arr.sort()

In [24]: arr
Out[24]: array([    1,     2, 'NaT'], dtype='timedelta64[ns]')

@@ -710,7 +710,7 @@ def _from_factorized(cls, values, original):
return cls(values, dtype=original.dtype)

def _values_for_argsort(self):
return self._data
return self._data.view("M8[ns]")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel moved this down to the arrays. Basic idea is to just sort as datetime64[ns], which should be correct for datetime, timedelta, and period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks

@TomAugspurger
Copy link
Contributor Author

Perf looks OK I think

import pandas as pd
import numpy as np

arr = np.arange(10000)
np.random.shuffle(arr)
idx = pd.DatetimeIndex(arr)

%timeit idx.sort_values()

idx = pd.PeriodIndex(ordinal=arr, freq='D')
%timeit idx.sort_values()

Master

403 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
450 µs ± 9.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR

421 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
468 µs ± 8.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not sure why they aren't identical yet. We're a bit outside the standard deviation on those runs.

# because the treatment of NaT has been changed to put NaT last
# instead of first.
sorted_values = np.sort(self.asi8)
values = self._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use self._data.values_for_argsort?

def test_argsort(dtype):
a = pd.array([2001, pd.NaT, 2000], dtype=dtype)
result = a.argsort()
expected = np.array([2, 0, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im a bit confused by this. how is this prevented from varying based on the numpy version?

@jbrockmendel
Copy link
Member

Not sure why they aren't identical yet. We're a bit outside the standard deviation on those runs.

i think we get one, maybe two extra function-call-overheads; does that account for 15-20 microseconds?

@TomAugspurger
Copy link
Contributor Author

im a bit confused by this. how is this prevented from varying based on the numpy version?

That was my original question. Do we want to match the installed version of NumPy so that np.sort(ndarray[m8[ns]]) matches np.sort(DatetimeIndex), regardless of the NumPy versoin? Or do we want to adopt the new behavior of sorting last regardless of the NumPy version?

Both seem quite difficult, given that we sometimes do things like sort the datetimes, extract the i8 values, and then operate on them assuming they're sorted. I'm not confident that I can get this done for 1.0, if ever.

@jbrockmendel
Copy link
Member

I'm not confident that I can get this done for 1.0, if ever.

This does sound like the kind of problem that goes away on its own once our numpy min-version catches up. I lean towards "dont sweat it"

@TomAugspurger
Copy link
Contributor Author

I'm going to close this to get it out of the way.

@jorisvandenbossche
Copy link
Member

@TomAugspurger what was missing / wrong / not yet working in the current PR? (or just still a lot of failing tests that need further investigation?)

@TomAugspurger
Copy link
Contributor Author

Anything that relied on .sort_values().asi8 to be sorted was broken / incorrect. This includes .rolling, .resample, etc.

@jbrockmendel
Copy link
Member

Also affected: the ability to have DatetimeIndexOpsMixin.argsort dispatch to self._data.argsort

@jorisvandenbossche
Copy link
Member

Ah, that indeed sounds more complicated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NumPy 1.18 behavior for sorting NaT
3 participants