Skip to content

DataFrame sort_values and multiple "by" columns fails to order NaT correctly #16995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ Sparse
Reshaping
^^^^^^^^^
- Joining/Merging with a non unique ``PeriodIndex`` raised a TypeError (:issue:`16871`)
- Fixes regression when sorting by multiple columns on a ``datetime64`` dtype ``Series`` with ``NaT`` values (:issue:`16836`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime64[ns] dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say bug in DataFrame.sort_values()



Numeric
Expand Down
7 changes: 1 addition & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3336,18 +3336,13 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
if len(by) > 1:
from pandas.core.sorting import lexsort_indexer

def trans(v):
if needs_i8_conversion(v):
return v.view('i8')
return v

keys = []
for x in by:
k = self.xs(x, axis=other_axis).values
if k.ndim == 2:
raise ValueError('Cannot sort by duplicate column %s' %
str(x))
keys.append(trans(k))
keys.append(k)
indexer = lexsort_indexer(keys, orders=ascending,
na_position=na_position)
indexer = _ensure_platform_int(indexer)
Expand Down
27 changes: 26 additions & 1 deletion pandas/tests/frame/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,11 @@ def test_sort_datetimes(self):
df2 = df.sort_values(by=['B'])
assert_frame_equal(df1, df2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a comment as to the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular change does not have anything to do with the fix. I added it to fix the test since I saw that its goal was broken

df1 = df.sort_values(by='B')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

expected = 

result = 


df2 = df.sort_values(by=['C', 'B'])
assert_frame_equal(df1, df2)

def test_frame_column_inplace_sort_exception(self):
s = self.frame['A']
with tm.assert_raises_regex(ValueError, "This Series is a view"):
Expand Down Expand Up @@ -321,7 +326,27 @@ def test_sort_nat_values_in_int_column(self):
assert_frame_equal(df_sorted, df_reversed)

df_sorted = df.sort_values(["datetime", "float"], na_position="last")
assert_frame_equal(df_sorted, df_reversed)
assert_frame_equal(df_sorted, df)
Copy link
Member

@gfyoung gfyoung Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this assertion statement change? (good to know for future reference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this was a bug in the test. Previously, the two assertions implied that na_position does not have any effect on the sort, which is incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for a new issue, generally like a separate test. can you fix here. otherwise lgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter test starting on line 337 is actually redundant with this test. Do you want me to remove it, or move it to a separate test?

# Ascending should not affect the results.
df_sorted = df.sort_values(["datetime", "float"], ascending=False)
assert_frame_equal(df_sorted, df)

# GH 16836

d1 = [Timestamp(x) for x in ['2016-01-01', '2015-01-01',
np.nan, '2016-01-01']]
d2 = [Timestamp(x) for x in ['2017-01-01', '2014-01-01',
'2016-01-01', '2015-01-01']]
df = pd.DataFrame({'a': d1, 'b': d2}, index=[0, 1, 2, 3])

d3 = [Timestamp(x) for x in ['2015-01-01', '2016-01-01',
'2016-01-01', np.nan]]
d4 = [Timestamp(x) for x in ['2014-01-01', '2015-01-01',
'2017-01-01', '2016-01-01']]
expected = pd.DataFrame({'a': d3, 'b': d4}, index=[1, 3, 0, 2])
sorted_df = df.sort_values(by=['a', 'b'], )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result=

tm.assert_frame_equal(sorted_df, expected)


class TestDataFrameSortIndexKinds(TestData):
Expand Down