Skip to content

BUG: allow missing values in Index when calling Index.sort_values #35604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Sep 6, 2020
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
076061b
BUG: attempt initial fix
AlexKirko Aug 7, 2020
3fa9351
TST: add test
AlexKirko Aug 7, 2020
333c6e4
CLN: run black
AlexKirko Aug 7, 2020
8414fd0
CLN: clean up unnecessary print
AlexKirko Aug 7, 2020
fa12898
exempt MultiIndex from handling missing values
AlexKirko Aug 7, 2020
e2575c9
CLN: run black
AlexKirko Aug 7, 2020
4293e22
add na_position kwarg to sort_values
AlexKirko Aug 10, 2020
275067e
debug na_position logic in sort_values
AlexKirko Aug 10, 2020
098fe10
CLN: fix na_position type
AlexKirko Aug 10, 2020
f0d4c2d
DOC: mv to 1.2.0 since API is changed
AlexKirko Aug 10, 2020
743eca6
DOC: make the comment more specific
AlexKirko Aug 11, 2020
9d17d7a
TST: parametrize the test, add raise check
AlexKirko Aug 11, 2020
744e5c1
CLN: run black
AlexKirko Aug 11, 2020
13fcd94
CLN: fix versionadded
AlexKirko Aug 11, 2020
33b02da
CLN: move na_position before key
AlexKirko Aug 13, 2020
73cc94c
DOC: add comment for MultiIndex
AlexKirko Aug 13, 2020
153fb06
REFACT: switch to nargsort from sorting.py
AlexKirko Aug 13, 2020
3481b81
reverse equal els in nargsort of ascending is False
AlexKirko Aug 13, 2020
9d92ddd
switch Series to nargsort
AlexKirko Aug 13, 2020
7ec9e7e
Revert "reverse equal els in nargsort of ascending is False"
AlexKirko Aug 13, 2020
2d16b8d
add TypeError to exception catch in nargsort
AlexKirko Aug 13, 2020
8939040
reverse equal els order in nargsort (again)
AlexKirko Aug 13, 2020
066f0af
CLN: isort test_datetime.py
AlexKirko Aug 15, 2020
7e351cd
keep equal el order in nargsort (again)
AlexKirko Aug 15, 2020
79ed8ee
add equals_reversible to nargsort and a debugging branch
AlexKirko Aug 15, 2020
62268e8
CLN: remove debug code
AlexKirko Aug 15, 2020
5cab3d4
CLN: remove extra newline in test_datetime
AlexKirko Aug 15, 2020
30ac949
CLN: alter test_ops to preserve equal element order
AlexKirko Aug 15, 2020
0ac0a3c
CLN: run black
AlexKirko Aug 15, 2020
051c3e7
revert changes to Series and connected tests
AlexKirko Aug 21, 2020
dfdee40
reimplement test fix for Index.sort_values
AlexKirko Aug 21, 2020
acf6170
TST: don't expect same indexer for Index and PeriodIndex
AlexKirko Aug 21, 2020
4b43852
TST: add Index types, switch to fixture
AlexKirko Aug 25, 2020
fb8f703
CLN: fix import order
AlexKirko Aug 25, 2020
fc58c3f
TST: add test for Index and PeriodIndex incompatibility
AlexKirko Aug 25, 2020
1dfd4f2
TST: move fixture to pandas/conftest.py
AlexKirko Aug 27, 2020
65c001a
TST: add na_position=None test case
AlexKirko Aug 27, 2020
dd82cc1
TST: move index type xfails to test_common.py
AlexKirko Aug 27, 2020
a3855ad
DOC: remove unnecessary comment in test_ops
AlexKirko Aug 27, 2020
481de59
DOC: add xref to unstable sort issue
AlexKirko Aug 27, 2020
c7e8240
TST: add prints to debug numpy dev test fails
AlexKirko Aug 31, 2020
ad29fff
Revert "TST: add prints to debug numpy dev test fails"
AlexKirko Aug 31, 2020
6ae6e8a
restart tests
AlexKirko Sep 2, 2020
213b13b
Merge branch 'master' into ind-sort-values
AlexKirko Sep 2, 2020
760bda1
DOC: remove junk from whatsnew
AlexKirko Sep 2, 2020
d8f9061
TST: warp test for debug
AlexKirko Sep 2, 2020
2f8bec6
Revert "TST: warp test for debug"
AlexKirko Sep 2, 2020
29d6849
remove new tests for debug
AlexKirko Sep 2, 2020
8b81e55
Revert "remove new tests for debug"
AlexKirko Sep 3, 2020
5b2a0d3
TST: try deep copy in fixture
AlexKirko Sep 3, 2020
a3310ba
TST: try immediate return instead of deep copy
AlexKirko Sep 3, 2020
e80bc9d
Revert "TST: try immediate return instead of deep copy"
AlexKirko Sep 3, 2020
2007699
REFACT: add immediate returns
AlexKirko Sep 3, 2020
8fd5a77
Merge branch 'master' into ind-sort-values
AlexKirko Sep 3, 2020
2e39294
DOC: add comment to conftest clarifying deep copy
AlexKirko Sep 3, 2020
d78a9a2
restart tests
AlexKirko Sep 3, 2020
5715c31
Merge branch 'master' into ind-sort-values
AlexKirko Sep 3, 2020
bfd2e9c
CLN: remove test-output.xml
AlexKirko Sep 3, 2020
54a6e82
TST: check for MultiIndex through request.param
AlexKirko Sep 5, 2020
f60d2a8
CLN: revert changes to nargsort
AlexKirko Sep 5, 2020
af92fe8
DOC: add issue xref to xfail reason in period/test_ops.py
AlexKirko Sep 5, 2020
6d33657
DOC: clarify xfail reasons and add comments
AlexKirko Sep 5, 2020
4935309
REFACT: switch to isinstance, add blank line
AlexKirko Sep 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,9 @@ Interval

Indexing
^^^^^^^^

- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`)
-
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`)
-

Missing
Expand Down
23 changes: 23 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,6 +437,29 @@ def index(request):
index_fixture2 = index


@pytest.fixture(params=indices_dict.keys())
def index_with_missing(request):
"""
Fixture for indices with missing values
"""
if request.param in ["int", "uint", "range", "empty", "repeats"]:
pytest.xfail("missing values not supported")
# GH 35538. Use deep copy to avoid illusive bug on np-dev
# Azure pipeline that writes into indices_dict despite copy
ind = indices_dict[request.param].copy(deep=True)
vals = ind.values
if type(vals[0]) == tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just check request.parm == 'multindex'? as its more explict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# For setting missing values in the top level of MultiIndex
vals = ind.tolist()
vals[0] = tuple([None]) + vals[0][1:]
vals[-1] = tuple([None]) + vals[-1][1:]
return MultiIndex.from_tuples(vals)
else:
vals[0] = None
vals[-1] = None
return type(ind)(vals)


# ----------------------------------------------------------------
# Series'
# ----------------------------------------------------------------
Expand Down
27 changes: 22 additions & 5 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
import pandas.core.missing as missing
from pandas.core.ops import get_op_result_name
from pandas.core.ops.invalid import make_invalid_op
from pandas.core.sorting import ensure_key_mapped
from pandas.core.sorting import ensure_key_mapped, nargsort
from pandas.core.strings import StringMethods

from pandas.io.formats.printing import (
Expand Down Expand Up @@ -4443,7 +4443,11 @@ def asof_locs(self, where, mask):
return result

def sort_values(
self, return_indexer=False, ascending=True, key: Optional[Callable] = None
self,
return_indexer=False,
ascending=True,
na_position: str_t = "last",
key: Optional[Callable] = None,
):
"""
Return a sorted copy of the index.
Expand All @@ -4457,6 +4461,12 @@ def sort_values(
Should the indices that would sort the index be returned.
ascending : bool, default True
Should the index values be sorted in an ascending order.
na_position : {'first' or 'last'}, default 'last'
Argument 'first' puts NaNs at the beginning, 'last' puts NaNs at
the end.

.. versionadded:: 1.2.0

key : callable, optional
If not None, apply the key function to the index values
before sorting. This is similar to the `key` argument in the
Expand Down Expand Up @@ -4497,9 +4507,16 @@ def sort_values(
"""
idx = ensure_key_mapped(self, key)

_as = idx.argsort()
if not ascending:
_as = _as[::-1]
# GH 35584. Sort missing values according to na_position kwarg
# ignore na_position for MutiIndex
if not isinstance(self, ABCMultiIndex):
_as = nargsort(
items=idx, ascending=ascending, na_position=na_position, key=key
)
else:
_as = idx.argsort()
if not ascending:
_as = _as[::-1]

sorted_index = self.take(_as)

Expand Down
9 changes: 8 additions & 1 deletion pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,10 +306,17 @@ def nargsort(
non_nan_idx = idx[~mask]

nan_idx = np.nonzero(mask)[0]

if not ascending:
non_nans = non_nans[::-1]
non_nan_idx = non_nan_idx[::-1]
indexer = non_nan_idx[non_nans.argsort(kind=kind)]
try:
indexer = non_nan_idx[non_nans.argsort(kind=kind)]
except TypeError:
# For compatibility with Series: fall back to quicksort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a test that hits this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is a remnant from trying to bring Series.sort_values in line with Index.sort_values. Reverted for now. This change doesn't belong in this PR. I'll bring it back in the next PR when I'll be synchronizing Index and Series behavior.

# when mergesort is unavailable for object element type
indexer = non_nan_idx[non_nans.argsort(kind="quicksort")]

if not ascending:
indexer = indexer[::-1]
# Finally, place the NaNs at the end or the beginning according to
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/indexes/interval/test_interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -618,7 +618,7 @@ def test_sort_values(self, closed):
expected = IntervalIndex([Interval(0, 1), Interval(1, 2), np.nan])
tm.assert_index_equal(result, expected)

result = index.sort_values(ascending=False)
result = index.sort_values(ascending=False, na_position="first")
expected = IntervalIndex([np.nan, Interval(1, 2), Interval(0, 1)])
tm.assert_index_equal(result, expected)

Expand Down
16 changes: 13 additions & 3 deletions pandas/tests/indexes/period/test_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,9 +174,6 @@ def _check_freq(index, expected_index):

ordered, indexer = idx.sort_values(return_indexer=True, ascending=False)
tm.assert_index_equal(ordered, expected[::-1])

exp = np.array([2, 1, 3, 4, 0])
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
_check_freq(ordered, idx)

pidx = PeriodIndex(["2011", "2013", "NaT", "2011"], name="pidx", freq="D")
Expand Down Expand Up @@ -333,3 +330,16 @@ def test_freq_setter_deprecated(self):
# warning for setter
with pytest.raises(AttributeError, match="can't set attribute"):
idx.freq = pd.offsets.Day()


@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well its doesn't work :->, do we have an issue tracking this, pls add the refernce in the xfail (and if we don't pls create an issue)

Copy link
Member Author

@AlexKirko AlexKirko Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find an existing issue, so I opened #35922

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk add the issue number to the xfail message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I think the comprehensive test below covers this? (not averse to having this as well, just asking)

def test_order_incompat():
# GH 35584. The new implementation of sort_values for Index.sort_values
# is stable when sorting in descending order. PeriodIndex.sort_values
# currently isn't stable. xfail should be removed after
# the implementations' behavior is synchronized (xref GH 35922)
pidx = PeriodIndex(["2011", "2013", "2015", "2012", "2011"], name="pidx", freq="A")
iidx = Index([2011, 2013, 2015, 2012, 2011], name="idx")
ordered1, indexer1 = pidx.sort_values(return_indexer=True, ascending=False)
ordered2, indexer2 = iidx.sort_values(return_indexer=True, ascending=False)
tm.assert_numpy_array_equal(indexer1, indexer2)
46 changes: 45 additions & 1 deletion pandas/tests/indexes/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,14 @@
from pandas.core.dtypes.common import is_period_dtype, needs_i8_conversion

import pandas as pd
from pandas import CategoricalIndex, MultiIndex, RangeIndex
from pandas import (
CategoricalIndex,
DatetimeIndex,
MultiIndex,
PeriodIndex,
RangeIndex,
TimedeltaIndex,
)
import pandas._testing as tm


Expand Down Expand Up @@ -391,3 +398,40 @@ def test_astype_preserves_name(self, index, dtype):
assert result.names == index.names
else:
assert result.name == index.name


@pytest.mark.parametrize("na_position", [None, "middle"])
def test_sort_values_invalid_na_position(index_with_missing, na_position):

if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:
pytest.xfail("stable descending order sort not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. maybe just enumerate the PeriodIndex case here (do the others have an issue?) pls open if they don't (you can make a single issue with checkboxes)

elif type(index_with_missing) in [CategoricalIndex, MultiIndex]:
pytest.xfail("missing value sorting order not defined for index type")

if na_position not in ["first", "last"]:
with pytest.raises(
ValueError, match=f"invalid na_position: {na_position}",
):
index_with_missing.sort_values(na_position=na_position)


@pytest.mark.parametrize("na_position", ["first", "last"])
def test_sort_values_with_missing(index_with_missing, na_position):
# GH 35584. Test that sort_values works with missing values,
# sort non-missing and place missing according to na_position

if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Sorry, I wasn't at all clear in the xfail messages as to how the tests are different. The one in period/test_ops.py checks whether PeriodIndex and another non-datetime-like subtype sort duplicates into the same order when descending=True. As part of syncing duplicate sorting behavior in the next PR, I'd like to write a general test with an index_with_duplicates fixture to check for sorting stability and replace this test.

The tests in test_common.py xfail, because the na_position kwarg isn't implemented yet for datetime-like indices. Implementing it would go with implementing stable duplicate sorting behavior and break a bunch of tests. I also intend to deal with that in the next PR as I bring sort_values in sync between all Index subtypes and Series.

I've clarified the xfail messages everywhere and added comments with xrefs to the tests in test_common.py. Is that acceptable? I think keeping the tests separate makes sense for now.

Sorry if I misunderstood your question.

pytest.xfail("stable descending order sort not implemented")
elif type(index_with_missing) in [CategoricalIndex, MultiIndex]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you isinstance(....) instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pytest.xfail("missing value sorting order not defined for index type")
missing_count = np.sum(index_with_missing.isna())
not_na_vals = index_with_missing[index_with_missing.notna()].values
sorted_values = np.sort(not_na_vals)
if na_position == "first":
sorted_values = np.concatenate([[None] * missing_count, sorted_values])
else:
sorted_values = np.concatenate([sorted_values, [None] * missing_count])
expected = type(index_with_missing)(sorted_values)

result = index_with_missing.sort_values(na_position=na_position)
tm.assert_index_equal(result, expected)