Skip to content

DataFrame sort_values and multiple "by" columns fails to order NaT correctly (since v0.19) #16836

Closed
@arc12

Description

@arc12

Code Used to Demonstrate Issue

import pandas as pd
import numpy as np
import datetime as dt
import random

print 'Pandas Version', pd.__version__

now = dt.datetime.today()
d1 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,3)] * 2
d2 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,6)]
d1[3] = np.nan

print "\nSort on Dates"
df_t = pd.DataFrame({'a': d1, 'b':d2})
print "column type:", df_t.a.dtype
print "\npresorted:"
print df_t
print "\nsort by=['a', 'b'], na_position='last':"
print df_t.sort_values(by=['a', 'b'], na_position='last')
print "\nsort by=['a'], na_position='last':"
print df_t.sort_values(by=['a'], na_position='last')

Problem description

Considering sort_values(by=['a', 'b'], na_position='last')
In v0.19.2 this works as expected:

  • primary sort on a, putting NaT last
  • then secondary sort on b

In v0.20.2 this does not work as expected, with NaT in 'a' appearing first. Sorting only on 'a' does put NaT last.

Expected Output - as observed for v0.19

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705

sort by=['a', 'b'], na_position='last':
                        a                       b
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

Bad Output - observed for v0.20

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654

sort by=['a', 'b'], na_position='last':
                        a                       b
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: 2.8.5
pip: 8.1.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.1.1
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypeRegressionFunctionality that used to work in a prior pandas versionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions