Skip to content

Different order in datetime-column using sort_values on multiple columns #13230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marcomayer opened this issue May 19, 2016 · 6 comments · Fixed by #30769
Closed

Different order in datetime-column using sort_values on multiple columns #13230

marcomayer opened this issue May 19, 2016 · 6 comments · Fixed by #30769
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@marcomayer
Copy link

Code Sample, a copy-pastable example if possible

#0.17.1:
df = pd.DataFrame([1,2,3,4,5], columns=list('A'), index=pd.date_range('2010-01-01 09:00:00', periods=5, freq='s')).reset_index()

df['date'] = df['index']
del df['index']
df.loc[4,'A'] = 4
df.loc[4,'date'] = pd.NaT

print(df.sort_values(['A','date']))

  A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
4  4                 NaT
3  4 2010-01-01 09:00:03

#0.18.1:
df = pd.DataFrame([1,2,3,4,5], columns=list('A'), index=pd.date_range('2010-01-01 09:00:00', periods=5, freq='s')).reset_index()

df['date'] = df['index']
del df['index']
df.loc[4,'A'] = 4
df.loc[4,'date'] = pd.NaT

print(df.sort_values(['A','date']))

   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
3  4 2010-01-01 09:00:03
4  4                 NaT

Expected Output

This one was hard to find. The order stays the same as in 0.17.1 when using only sort_values('date'), but using multiple cols, it changes sorting datetimes with NaT. Couldn't find anything in the Changelogs that points to a reason for this.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented May 19, 2016

so the default in the code is na_postiion='last', so this is correct. I don't recall us changing this, though first is listed in the doc-string. Can you see if you can pinpoint the change?

In [15]: print(df.sort_values(['A','date'],na_position='last'))
   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
3  4 2010-01-01 09:00:03
4  4                 NaT

In [16]: print(df.sort_values(['A','date'],na_position='first'))
   A                date
0  1 2010-01-01 09:00:00
1  2 2010-01-01 09:00:01
2  3 2010-01-01 09:00:02
4  4                 NaT
3  4 2010-01-01 09:00:03

@jreback jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels May 19, 2016
@jreback jreback added this to the 0.18.2 milestone May 19, 2016
@marcomayer
Copy link
Author

yes it looks like it's correct now and was wrong before. I changed na_position to 'first' which solves the issue for me. I'd suggest to put it in the change-notes regarding backwards compatibility.

Seems as if that was the last one, now all my stuff is running again. Thanks again for your help and especially the suggestions regarding the float storage with units instead of decimal.decimal I'll give that a try when I find the time as it could help a lot!

I'd love to contribute more than these issue-reports but right now totally lack the time to do so. But I hope this changes in the future so that I'll be able to give some back to the pydata-community. Till then I'll keep on spreading the word about it. I mentioned your "pandas for finance" talk in a youtube video I did a few weeks ago (https://youtu.be/-0mmLkr420g?t=15m46s) and on twitter etc. Also should you ever be in Berlin, I'd be happy to have a beer or two with you!

@jreback
Copy link
Contributor

jreback commented May 19, 2016

@marcomayer sounds good, but prob won't be in berlin anytime soon :(

I think this 'change' was an oversight. So when this issue is address will correct it.

@jreback jreback modified the milestones: 0.19.0, 0.18.2 May 19, 2016
@jreback jreback added Regression Functionality that used to work in a prior pandas version Difficulty Novice labels May 19, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Aug 11, 2016
@jreback
Copy link
Contributor

jreback commented Sep 28, 2016

@jorisvandenbossche I think this just needs a note in the docs.

@jreback jreback modified the milestones: 0.19.0, 0.19.1 Sep 28, 2016
@jorisvandenbossche
Copy link
Member

As I understand the above discussion, this was probably not an intentional, but still 'correct' change?
So we could add a note in the whatsnew docs (don't think one is needed anywhere else in the docs), but I would also add a test to confirm the behaviour, so we don't change it again by accident.

@jorisvandenbossche jorisvandenbossche added Testing pandas testing functions or related to the test suite Docs and removed API Design Regression Functionality that used to work in a prior pandas version labels Sep 28, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.1 Oct 6, 2016
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@marcomayer want to add a confirmation test for this?

@jreback jreback modified the milestones: 0.20.0, Next Minor Release Apr 12, 2017
@jreback jreback removed this from the 0.20.0 milestone Apr 12, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Docs Effort Low Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite Datetime Datetime data dtype labels Oct 7, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants