COMPAT: Argsort position matches NumPy #31210

TomAugspurger · 2020-01-22T15:14:38Z

Bit of a WIP, CI is probably going to fail here. One API question, do we want NaT to sort at the end, regardless of the NumPy version? Or should we match the behavior of the installed NumPy (NaT at the start for <1.18, at the end for >=1.18)?

Closes pandas-dev#29884

jbrockmendel · 2020-01-22T17:29:30Z

pandas/core/indexes/datetimelike.py

@@ -180,6 +180,9 @@ def map(self, mapper, na_action=None):
        except Exception:
            return self.astype(object).map(mapper)

+    def argsort(self, *args, **kwargs) -> np.ndarray:
+        return np.argsort(self._data)


i think if we want this to be consistent across DTI/TDI/PI this needs to take a .view("M8[ns]")

Also should this happen at the DTA/TDA/PA level and get passed through?

jbrockmendel · 2020-01-22T17:30:55Z

pandas/core/indexes/datetimelike.py

-            #  because the treatment of NaT has been changed to put NaT last
-            #  instead of first.
-            sorted_values = np.sort(self.asi8)
+            sorted_values = np.sort(self._ndarray_values)


this means i8n for PeriodIndex (so NaT to the front), m8 for TimedeltaIndex (so NaT to the front until numpy updates td64 to match dt64) and M8 for DatetimeIndex, (so NaT to the end). Is that what you're going for here?

NumPy has fixed the timedelta sorting, at least for 1.19+

In [22]: arr = np.array([np.timedelta64(2, 'ns'), np.timedelta64('NaT', 'ns'), np.timedelta64(1, 'ns')]) In [23]: arr.sort() In [24]: arr Out[24]: array([ 1, 2, 'NaT'], dtype='timedelta64[ns]')

TomAugspurger · 2020-01-22T18:22:20Z

pandas/core/arrays/datetimelike.py

@@ -710,7 +710,7 @@ def _from_factorized(cls, values, original):
        return cls(values, dtype=original.dtype)

    def _values_for_argsort(self):
-        return self._data
+        return self._data.view("M8[ns]")


@jbrockmendel moved this down to the arrays. Basic idea is to just sort as datetime64[ns], which should be correct for datetime, timedelta, and period.

sounds good, thanks

pandas/core/indexes/datetimelike.py

TomAugspurger · 2020-01-22T18:33:50Z

Perf looks OK I think

import pandas as pd
import numpy as np

arr = np.arange(10000)
np.random.shuffle(arr)
idx = pd.DatetimeIndex(arr)

%timeit idx.sort_values()

idx = pd.PeriodIndex(ordinal=arr, freq='D')
%timeit idx.sort_values()

Master

403 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
450 µs ± 9.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR

421 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
468 µs ± 8.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not sure why they aren't identical yet. We're a bit outside the standard deviation on those runs.

jbrockmendel · 2020-01-23T00:07:03Z

pandas/core/indexes/datetimelike.py

-            #  because the treatment of NaT has been changed to put NaT last
-            #  instead of first.
-            sorted_values = np.sort(self.asi8)
+            values = self._data


maybe use self._data.values_for_argsort?

jbrockmendel · 2020-01-23T00:08:07Z

pandas/tests/arrays/test_datetimelike.py

+def test_argsort(dtype):
+    a = pd.array([2001, pd.NaT, 2000], dtype=dtype)
+    result = a.argsort()
+    expected = np.array([2, 0, 1])


im a bit confused by this. how is this prevented from varying based on the numpy version?

jbrockmendel · 2020-01-23T00:12:14Z

Not sure why they aren't identical yet. We're a bit outside the standard deviation on those runs.

i think we get one, maybe two extra function-call-overheads; does that account for 15-20 microseconds?

TomAugspurger · 2020-01-23T13:31:53Z

im a bit confused by this. how is this prevented from varying based on the numpy version?

That was my original question. Do we want to match the installed version of NumPy so that np.sort(ndarray[m8[ns]]) matches np.sort(DatetimeIndex), regardless of the NumPy versoin? Or do we want to adopt the new behavior of sorting last regardless of the NumPy version?

Both seem quite difficult, given that we sometimes do things like sort the datetimes, extract the i8 values, and then operate on them assuming they're sorted. I'm not confident that I can get this done for 1.0, if ever.

jbrockmendel · 2020-01-24T01:08:56Z

I'm not confident that I can get this done for 1.0, if ever.

This does sound like the kind of problem that goes away on its own once our numpy min-version catches up. I lean towards "dont sweat it"

TomAugspurger · 2020-01-27T14:05:16Z

I'm going to close this to get it out of the way.

jorisvandenbossche · 2020-01-27T21:23:35Z

@TomAugspurger what was missing / wrong / not yet working in the current PR? (or just still a lot of failing tests that need further investigation?)

TomAugspurger · 2020-01-27T21:25:44Z

Anything that relied on .sort_values().asi8 to be sorted was broken / incorrect. This includes .rolling, .resample, etc.

jbrockmendel · 2020-01-28T01:48:41Z

Also affected: the ability to have DatetimeIndexOpsMixin.argsort dispatch to self._data.argsort

jorisvandenbossche · 2020-01-28T06:43:30Z

Ah, that indeed sounds more complicated :)

COMPAT: Argsort position matches NumPy

a9921ba

Closes pandas-dev#29884

TomAugspurger added this to the 1.0.0 milestone Jan 22, 2020

TomAugspurger added Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Jan 22, 2020

jbrockmendel reviewed Jan 22, 2020

View reviewed changes

TomAugspurger added 3 commits January 22, 2020 12:06

move to arrays

ee6b068

wip

f13c8ab

fixup

c387d85

TomAugspurger commented Jan 22, 2020

View reviewed changes

double-view

3b01363

jbrockmendel reviewed Jan 23, 2020

View reviewed changes

TomAugspurger mentioned this pull request Jan 24, 2020

NumPy 1.18 behavior for sorting NaT #29884

Closed

TomAugspurger closed this Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPAT: Argsort position matches NumPy #31210

COMPAT: Argsort position matches NumPy #31210

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 22, 2020

jbrockmendel Jan 22, 2020

TomAugspurger Jan 22, 2020

TomAugspurger Jan 22, 2020

jbrockmendel Jan 23, 2020

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 23, 2020

jbrockmendel Jan 23, 2020

jbrockmendel commented Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jbrockmendel commented Jan 24, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jbrockmendel commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020

COMPAT: Argsort position matches NumPy #31210

COMPAT: Argsort position matches NumPy #31210

Conversation

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 22, 2020

Choose a reason for hiding this comment

jbrockmendel Jan 22, 2020

Choose a reason for hiding this comment

TomAugspurger Jan 22, 2020

Choose a reason for hiding this comment

TomAugspurger Jan 22, 2020

Choose a reason for hiding this comment

jbrockmendel Jan 23, 2020

Choose a reason for hiding this comment

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 23, 2020

Choose a reason for hiding this comment

jbrockmendel Jan 23, 2020

Choose a reason for hiding this comment

jbrockmendel commented Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jbrockmendel commented Jan 24, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jbrockmendel commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020