Closed
Description
Code Used to Demonstrate Issue
import pandas as pd
import numpy as np
import datetime as dt
import random
print 'Pandas Version', pd.__version__
now = dt.datetime.today()
d1 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,3)] * 2
d2 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,6)]
d1[3] = np.nan
print "\nSort on Dates"
df_t = pd.DataFrame({'a': d1, 'b':d2})
print "column type:", df_t.a.dtype
print "\npresorted:"
print df_t
print "\nsort by=['a', 'b'], na_position='last':"
print df_t.sort_values(by=['a', 'b'], na_position='last')
print "\nsort by=['a'], na_position='last':"
print df_t.sort_values(by=['a'], na_position='last')
Problem description
Considering sort_values(by=['a', 'b'], na_position='last')
In v0.19.2 this works as expected:
- primary sort on a, putting NaT last
- then secondary sort on b
In v0.20.2 this does not work as expected, with NaT in 'a' appearing first. Sorting only on 'a' does put NaT last.
Expected Output - as observed for v0.19
Sort on Dates
column type: datetime64[ns]
presorted:
a b
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
sort by=['a', 'b'], na_position='last':
a b
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
sort by=['a'], na_position='last':
a b
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
Bad Output - observed for v0.20
Sort on Dates
column type: datetime64[ns]
presorted:
a b
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
3 NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
sort by=['a', 'b'], na_position='last':
a b
3 NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
sort by=['a'], na_position='last':
a b
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
3 NaT 2017-07-15 12:13:02.654
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.2
pytest: 2.8.5
pip: 8.1.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.1.1
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
None