Skip to content

NumPy 1.18 behavior for sorting NaT #29884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Nov 27, 2019 · 17 comments · Fixed by #36176
Closed

NumPy 1.18 behavior for sorting NaT #29884

TomAugspurger opened this issue Nov 27, 2019 · 17 comments · Fixed by #36176
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Followup to #29877. cc @jbrockmendel if you could fill out the details.

  • Adopt NaT at the end for Index? (we already do for Series)
  • Report timedelta NaT-sorting issue to NumPy?
@jbrockmendel jbrockmendel added the Compat pandas objects compatability with Numpy or Python functions label Nov 28, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 29, 2019
@jorisvandenbossche
Copy link
Member

I reported the timedelta sorting to numpy, see numpy/numpy#15063, I think the plan is to fix this as well for numpy 1.8

I would propose we follow this change for Index.

@jbrockmendel
Copy link
Member

xref #30460.

@TomAugspurger
Copy link
Contributor Author

So the remaining item here is sorting NaT at the end for DatetimeIndex and TimedeltaIndex? So this would change from

In [4]: pd.DatetimeIndex(['2000', None, '2001']).sort_values()
Out[4]: DatetimeIndex(['NaT', '2000-01-01', '2001-01-01'], dtype='datetime64[ns]', freq=None)

to

DatetimeIndex(['2000-01-01', '2001-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

Is this a blocker for 1.0? If so, do we need a deprecation cycle?

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 30, 2019
@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 2, 2020 via email

@jorisvandenbossche
Copy link
Member

Since other index types sort the NaNs last, I would maybe consider this a bug and just change it (that's also what numpy did, IIUC). Eg:

In [22]: pd.Index([1, np.nan, 2]).sort_values() 
Out[22]: Float64Index([1.0, 2.0, nan], dtype='float64')

If we want to do a deprecation cycle, we probably need to add a na_position keyword to Index.sort_values? (Series.sort_values has that, but Index not)

@jbrockmendel
Copy link
Member

If we want to do a deprecation cycle, we probably need to add a na_position keyword to Index.sort_values?

if we're calling it a bugfix we can probably do without a deprecation cycle

Do we have consensus that long-term we want NaT to be sorted last for each of DTA/TDA/PA/DTI/TDI/PI? If so, we should be able to update the DTA/TDA/PA versions without a deprecation cycle.

@TomAugspurger
Copy link
Contributor Author

Following NumPy seems fine to me.

@jorisvandenbossche
Copy link
Member

I think this is a blocker for the RC ...

@jorisvandenbossche
Copy link
Member

I quickly looked at this. Fixing it for the latest numpy is easy (as we then just need to sort with the M8 values instead of i8 values), but we probably want to have it consistent for all numpy versions? (ensuring that might involve a bit more trickery)

The change to enable it for latest numpy would be something like:

--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -191,10 +191,7 @@ class DatetimeIndexOpsMixin(ExtensionIndex):
             sorted_index = self.take(_as)
             return sorted_index, _as
         else:
-            # NB: using asi8 instead of _ndarray_values matters in numpy 1.18
-            #  because the treatment of NaT has been changed to put NaT last
-            #  instead of first.
-            sorted_values = np.sort(self.asi8)
+            sorted_values = np.sort(self.values)
             attribs = self._get_attributes_dict()
             freq = attribs["freq"]

@TomAugspurger
Copy link
Contributor Author

I'm OK with this missing the RC (or we can include it in a second RC if we're doing that).

Otherwise, we'll likely need a keyword to control the na position, which can be done anytime.

@jorisvandenbossche
Copy link
Member

I am thinking we should maybe just change it, still for 1.0.0. Not ideal that it was not in the RC, but it's a clear inconsistency within pandas, and it would also be good to be consistent with latest numpy.

@jorisvandenbossche
Copy link
Member

@TomAugspurger @jbrockmendel thoughts on this?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 20, 2020 via email

@jorisvandenbossche
Copy link
Member

@jbrockmendel do you have time to look at this?

@TomAugspurger
Copy link
Contributor Author

I'm looking into it today.

@TomAugspurger
Copy link
Contributor Author

Just FYI, I started #31210, but am unsure how to proceed. The fact that DatetimeIndex.sort_values().asi8 will no longer be sorted makes this pretty complex to get right.

@TomAugspurger
Copy link
Contributor Author

Pushing this from 1.0 unfortunately. We'll need to re-evaluate how to proceed, but I think an option to control the na_position in sort_values makes the most sense.

@TomAugspurger TomAugspurger modified the milestones: 1.0.0, Contributions Welcome Jan 27, 2020
@mroeschke mroeschke added the Bug label Apr 2, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
5 participants