Different order in datetime-column using sort_values on multiple columns #13230

marcomayer · 2016-05-19T18:09:42Z

Code Sample, a copy-pastable example if possible

#0.17.1:
df = pd.DataFrame([1,2,3,4,5], columns=list('A'), index=pd.date_range('2010-01-01 09:00:00', periods=5, freq='s')).reset_index()

df['date'] = df['index']
del df['index']
df.loc[4,'A'] = 4
df.loc[4,'date'] = pd.NaT

print(df.sort_values(['A','date']))

  A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
4  4                 NaT
3  4 2010-01-01 09:00:03

#0.18.1:
df = pd.DataFrame([1,2,3,4,5], columns=list('A'), index=pd.date_range('2010-01-01 09:00:00', periods=5, freq='s')).reset_index()

df['date'] = df['index']
del df['index']
df.loc[4,'A'] = 4
df.loc[4,'date'] = pd.NaT

print(df.sort_values(['A','date']))

   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
3  4 2010-01-01 09:00:03
4  4                 NaT

Expected Output

This one was hard to find. The order stays the same as in 0.17.1 when using only sort_values('date'), but using multiple cols, it changes sorting datetimes with NaT. Couldn't find anything in the Changelogs that points to a reason for this.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2016-05-19T18:18:31Z

so the default in the code is na_postiion='last', so this is correct. I don't recall us changing this, though first is listed in the doc-string. Can you see if you can pinpoint the change?

In [15]: print(df.sort_values(['A','date'],na_position='last'))
   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
3  4 2010-01-01 09:00:03
4  4                 NaT

In [16]: print(df.sort_values(['A','date'],na_position='first'))
   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
4  4                 NaT
3  4 2010-01-01 09:00:03

marcomayer · 2016-05-19T19:08:45Z

yes it looks like it's correct now and was wrong before. I changed na_position to 'first' which solves the issue for me. I'd suggest to put it in the change-notes regarding backwards compatibility.

Seems as if that was the last one, now all my stuff is running again. Thanks again for your help and especially the suggestions regarding the float storage with units instead of decimal.decimal I'll give that a try when I find the time as it could help a lot!

I'd love to contribute more than these issue-reports but right now totally lack the time to do so. But I hope this changes in the future so that I'll be able to give some back to the pydata-community. Till then I'll keep on spreading the word about it. I mentioned your "pandas for finance" talk in a youtube video I did a few weeks ago (https://youtu.be/-0mmLkr420g?t=15m46s) and on twitter etc. Also should you ever be in Berlin, I'd be happy to have a beer or two with you!

jreback · 2016-05-19T19:17:12Z

@marcomayer sounds good, but prob won't be in berlin anytime soon :(

I think this 'change' was an oversight. So when this issue is address will correct it.

jreback · 2016-09-28T10:24:08Z

@jorisvandenbossche I think this just needs a note in the docs.

jorisvandenbossche · 2016-09-28T15:38:05Z

As I understand the above discussion, this was probably not an intentional, but still 'correct' change?
So we could add a note in the whatsnew docs (don't think one is needed anywhere else in the docs), but I would also add a test to confirm the behaviour, so we don't change it again by accident.

jreback · 2017-03-23T13:44:57Z

@marcomayer want to add a confirmation test for this?

jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels May 19, 2016

jreback added this to the 0.18.2 milestone May 19, 2016

jreback modified the milestones: 0.19.0, 0.18.2 May 19, 2016

jreback added Regression Functionality that used to work in a prior pandas version Difficulty Novice labels May 19, 2016

jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Aug 11, 2016

jreback modified the milestones: 0.19.0, 0.19.1 Sep 28, 2016

jorisvandenbossche added Testing pandas testing functions or related to the test suite Docs and removed API Design Regression Functionality that used to work in a prior pandas version labels Sep 28, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.1 Oct 6, 2016

jreback modified the milestones: 0.20.0, Next Minor Release Apr 12, 2017

jreback removed this from the 0.20.0 milestone Apr 12, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Docs Effort Low Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite Datetime Datetime data dtype labels Oct 7, 2019

mroeschke mentioned this issue Jan 7, 2020

TST: Add tests for fixed issues #30769

Merged

8 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 7, 2020

mroeschke closed this as completed in #30769 Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different order in datetime-column using sort_values on multiple columns #13230

Different order in datetime-column using sort_values on multiple columns #13230

marcomayer commented May 19, 2016

jreback commented May 19, 2016

marcomayer commented May 19, 2016

jreback commented May 19, 2016

jreback commented Sep 28, 2016

jorisvandenbossche commented Sep 28, 2016

jreback commented Mar 23, 2017

Different order in datetime-column using sort_values on multiple columns #13230

Different order in datetime-column using sort_values on multiple columns #13230

Comments

marcomayer commented May 19, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented May 19, 2016

marcomayer commented May 19, 2016

jreback commented May 19, 2016

jreback commented Sep 28, 2016

jorisvandenbossche commented Sep 28, 2016

jreback commented Mar 23, 2017

output of `pd.show_versions()`