Skip to content

DataFrame sort_values and multiple "by" columns fails to order NaT correctly (since v0.19) #16836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arc12 opened this issue Jul 6, 2017 · 7 comments
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@arc12
Copy link

arc12 commented Jul 6, 2017

Code Used to Demonstrate Issue

import pandas as pd
import numpy as np
import datetime as dt
import random

print 'Pandas Version', pd.__version__

now = dt.datetime.today()
d1 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,3)] * 2
d2 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,6)]
d1[3] = np.nan

print "\nSort on Dates"
df_t = pd.DataFrame({'a': d1, 'b':d2})
print "column type:", df_t.a.dtype
print "\npresorted:"
print df_t
print "\nsort by=['a', 'b'], na_position='last':"
print df_t.sort_values(by=['a', 'b'], na_position='last')
print "\nsort by=['a'], na_position='last':"
print df_t.sort_values(by=['a'], na_position='last')

Problem description

Considering sort_values(by=['a', 'b'], na_position='last')
In v0.19.2 this works as expected:

  • primary sort on a, putting NaT last
  • then secondary sort on b

In v0.20.2 this does not work as expected, with NaT in 'a' appearing first. Sorting only on 'a' does put NaT last.

Expected Output - as observed for v0.19

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705

sort by=['a', 'b'], na_position='last':
                        a                       b
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

Bad Output - observed for v0.20

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654

sort by=['a', 'b'], na_position='last':
                        a                       b
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: 2.8.5
pip: 8.1.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.1.1
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
None

@jreback
Copy link
Contributor

jreback commented Jul 6, 2017

yeah looks a bit buggy. happy to have you take a look.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Datetime Datetime data dtype Difficulty Intermediate Regression Functionality that used to work in a prior pandas version and removed Bug labels Jul 6, 2017
@jreback jreback added this to the 0.21.0 milestone Jul 6, 2017
@jdeschenes
Copy link
Contributor

A bit of investigation:

The problem starts here

The array is converted to a int64 view and the nan is thus lost. I presume this is done for speed up?

I think lexsort_indexer will need to be aware that it is being passed a converted datetime and special case the code for NaT. What do you think?

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

These are converted just fine. (now I wouldn't mind actually moving the key conversions inside lexsort_indexer though (e.g. we do this for Categoricals now).

In [6]: pd.DatetimeIndex(['20170101', 'nat'])
Out[6]: DatetimeIndex(['2017-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

In [7]: pd.DatetimeIndex(['20170101', 'nat']).view('i8')
Out[7]: array([ 1483228800000000000, -9223372036854775808])

@jdeschenes
Copy link
Contributor

The problem is that lexsort_indexer doesn't know that it is being passed a datetime instead of a i8.

If the function was converting the array instead, it could special case the NaT.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

actually in this case these can be simply converted to categoricals before passing to lexsort_indexer which will solve the problem (they are converted internallly anyhow).

In [2]: pd.Categorical(pd.DatetimeIndex(['20170101', 'nat']))
Out[2]: 
[2017-01-01, NaT]
Categories (1, datetime64[ns]): [2017-01-01]

In [3]: pd.Categorical(pd.DatetimeIndex(['20170101', 'nat'])).codes
Out[3]: array([ 0, -1], dtype=int8)

so first thing to try is to remove the i8 conversion entirely (as Categoricals already handle all of these endge cases).

@jdeschenes
Copy link
Contributor

as you suggested, just removing the conversion seems to solve the problem.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

great!

jdeschenes pushed a commit to jdeschenes/pandas that referenced this issue Jul 17, 2017
* Removed unnecessary conversion to i8
* Fixed failed test (`test_frame_column_inplace_sort_exception`)
* Added check to ensure that the test is performing its intended goal(`test_sort_nan`)
jdeschenes pushed a commit to jdeschenes/pandas that referenced this issue Jul 17, 2017
* Removed unnecessary conversion to i8
* Fixed failed test (`test_frame_column_inplace_sort_exception`)
* Added check to ensure that the test is performing its intended goal(`test_sort_nan`)
jdeschenes pushed a commit to jdeschenes/pandas that referenced this issue Jul 18, 2017
* Removed unnecessary conversion to i8
* Fixed failed test (`test_frame_column_inplace_sort_exception`)
* Added check to ensure that the test is performing its intended goal(`test_sort_nan`)
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
…aT correctly

closes pandas-dev#16836

Author: Jean-Mathieu Deschenes <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes pandas-dev#16995 from jdeschenes/datetime_sort_issues and squashes the following commits:

257e10a [Jean-Mathieu Deschenes] Changes requested by @jreback
c6d55e2 [Jean-Mathieu Deschenes] Fix for pandas-dev#16836
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
…aT correctly

closes pandas-dev#16836

Author: Jean-Mathieu Deschenes <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes pandas-dev#16995 from jdeschenes/datetime_sort_issues and squashes the following commits:

257e10a [Jean-Mathieu Deschenes] Changes requested by @jreback
c6d55e2 [Jean-Mathieu Deschenes] Fix for pandas-dev#16836
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants