Skip to content

Commit 6bea827

Browse files
uweschmittjreback
authored andcommitted
BUG: sorting with large float and multiple columns incorrect
closes #14922 Having the `int` equivalent of `NaT` in an `int64` column caused wrong sorting because this special value was considered as "missing value". Author: Uwe <[email protected]> Closes #14944 from uweschmitt/fix-gh-14922 and squashes the following commits: c244438 [Uwe] further cleanup tests 4f28026 [Uwe] fixed typo in whatsnew/v0.20.0.txt 60cca5d [Uwe] add fix of GH14922 to release notes for 0.20.0 04dcbe8 [Uwe] further test cleanup 21e610c [Uwe] extended tests + minor cleanup 358a31e [Uwe] Merge branch 'fix-gh-14922' of github.com:uweschmitt/pandas into fix-gh-14922 03699c6 [Uwe] Fix GH 14922 1afdbb8 [Uwe] Fix GH 14922
1 parent 97b4295 commit 6bea827

File tree

3 files changed

+50
-2
lines changed

3 files changed

+50
-2
lines changed

doc/source/whatsnew/v0.20.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,7 @@ Bug Fixes
283283
- Bug in ``astype()`` where ``inf`` values were incorrectly converted to integers. Now raises error now with ``astype()`` for Series and DataFrames (:issue:`14265`)
284284
- Bug in ``DataFrame(..).apply(to_numeric)`` when values are of type decimal.Decimal. (:issue:`14827`)
285285
- Bug in ``describe()`` when passing a numpy array which does not contain the median to the ``percentiles`` keyword argument (:issue:`14908`)
286+
- Bug in ``DataFrame.sort_values()`` when sorting by multiple columns where one column is of type ``int64`` and contains ``NaT`` (:issue:`14922`)
286287

287288

288289

pandas/core/algorithms.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
349349

350350
table = hash_klass(size_hint or len(vals))
351351
uniques = vec_klass()
352-
labels = table.get_labels(vals, uniques, 0, na_sentinel, True)
352+
check_nulls = not is_integer_dtype(values)
353+
labels = table.get_labels(vals, uniques, 0, na_sentinel, check_nulls)
353354

354355
labels = _ensure_platform_int(labels)
355356

pandas/tests/frame/test_sorting.py

+47-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from pandas.compat import lrange
88
from pandas import (DataFrame, Series, MultiIndex, Timestamp,
9-
date_range)
9+
date_range, NaT)
1010

1111
from pandas.util.testing import (assert_series_equal,
1212
assert_frame_equal,
@@ -491,3 +491,49 @@ def test_frame_column_inplace_sort_exception(self):
491491

492492
cp = s.copy()
493493
cp.sort_values() # it works!
494+
495+
def test_sort_nat_values_in_int_column(self):
496+
497+
# GH 14922: "sorting with large float and multiple columns incorrect"
498+
499+
# cause was that the int64 value NaT was considered as "na". Which is
500+
# only correct for datetime64 columns.
501+
502+
int_values = (2, int(NaT))
503+
float_values = (2.0, -1.797693e308)
504+
505+
df = DataFrame(dict(int=int_values, float=float_values),
506+
columns=["int", "float"])
507+
508+
df_reversed = DataFrame(dict(int=int_values[::-1],
509+
float=float_values[::-1]),
510+
columns=["int", "float"],
511+
index=[1, 0])
512+
513+
# NaT is not a "na" for int64 columns, so na_position must not
514+
# influence the result:
515+
df_sorted = df.sort_values(["int", "float"], na_position="last")
516+
assert_frame_equal(df_sorted, df_reversed)
517+
518+
df_sorted = df.sort_values(["int", "float"], na_position="first")
519+
assert_frame_equal(df_sorted, df_reversed)
520+
521+
# reverse sorting order
522+
df_sorted = df.sort_values(["int", "float"], ascending=False)
523+
assert_frame_equal(df_sorted, df)
524+
525+
# and now check if NaT is still considered as "na" for datetime64
526+
# columns:
527+
df = DataFrame(dict(datetime=[Timestamp("2016-01-01"), NaT],
528+
float=float_values), columns=["datetime", "float"])
529+
530+
df_reversed = DataFrame(dict(datetime=[NaT, Timestamp("2016-01-01")],
531+
float=float_values[::-1]),
532+
columns=["datetime", "float"],
533+
index=[1, 0])
534+
535+
df_sorted = df.sort_values(["datetime", "float"], na_position="first")
536+
assert_frame_equal(df_sorted, df_reversed)
537+
538+
df_sorted = df.sort_values(["datetime", "float"], na_position="last")
539+
assert_frame_equal(df_sorted, df_reversed)

0 commit comments

Comments
 (0)