BUG: allow missing values in Index when calling Index.sort_values #35604

AlexKirko · 2020-08-07T12:15:38Z

closes BUG: Index.sort_values fails with TypeError #35584, xref BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922
tests added 1 / passed 1
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Problem

Index.sort_values breaks when Index includes None. The OP points out that a Series doesn't break in a similar case. In my opinion, there is a number of ways missing values can creep into an Index, and sort_values shoudn't be breaking.

Details

When we do Series.sort_values we shunt the missing values to the end or the beginning of the Series depending on na_position, and then argsort the rest. When we currently do sort_values on an Index, we try to argsort the whole thing, and it expectedly breaks.

Solution and alternatives

I propose we do the same stuff for Index that Series does, that is send the missing values to the end or the beginning. Added the na_position kwarg that works similar to how it works in Series.

Another possible solution is to explicitly forbid Indices with missing values and raise a clear error. I don't think it's advisable, since there isn't much of a practical difference between non-unique indices, which we allow, and this case.

AlexKirko · 2020-08-07T12:17:03Z

Couldn't find a better home for the test than test_common.py in indixes. Maybe there is a better place for it?

AlexKirko · 2020-08-07T13:04:16Z

Exempted MultiIndex from this fix, as it's completely unclear where you should put missing values in this case.

AlexKirko · 2020-08-08T05:03:52Z

Hm. There are a bunch of tests that expect NaNs to be at the beginning. Will have to add na_position to the function signature.

AlexKirko · 2020-08-11T08:18:41Z

I think this is ready for review. Since this PR changes the API, I put it in 1.2.0 whatsnew.

jreback

looks good. opportunity to clean up some code here.

jreback · 2020-08-12T15:37:54Z

pandas/core/indexes/base.py

+        self,
+        return_indexer=False,
+        ascending=True,
+        key: Optional[Callable] = None,


can you order these the same as in series (na_position before key)

pandas/core/indexes/base.py

jreback · 2020-08-12T15:41:28Z

pandas/core/indexes/base.py

+                    _as[: np.sum(good)] = _as[: np.sum(good)][::-1]
+            elif na_position == "first":
+                _as = np.concatenate([_as[bad], _as[good][idx[good].argsort()]])
+                if not ascending:


can you share this code with the series impl (might be better / easier to create a helper to actually do this) and put it in pandas/core/sorting.py

@jreback Found such a function in sorting.py: it's called nargsort. With a minor addition to insure that mergesort defaults to quicksort when sorting object dtype, it can be used for our purposes.

But I've uncovered a bit of a problem when trying to unify Index and Series behavior: when we set ascending=False and the object containes duplicates, in general, we expect different behavior from Series.sort_index and Series.sort_values. We expect sort_values to reverse the order of equal elements, but we expect sort_index to be stable and maintain it. This also creates weirdness in DataFrame.sort_values.

I've switched Series.sort_values to using nargsort and tried adding an argument to nargsort to signify if we want to preserve order or not, but there are still edge cases in the test sutie because of this inconsistency.

It is probably best to settle on one convention and carefully alter all existing tests to fit it. I personally think that equal elements should maintain order when we sort in descending order, that is, the descending sort should also be stable. I don't know if we should do this in this PR though, or leave Series as it is and only add NaN-handling to Index.sort_values through nargsort. Then I can make another PR to handle Series there and alter the tests to the convention we agree upon.

What do you think?

pep8speaks · 2020-08-13T13:28:07Z

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-06 04:24:45 UTC

AlexKirko · 2020-08-16T05:49:58Z

For some weird reason, we expect different behavior from Index.sort_values and Series.sort_values when ascending=False and the object contains duplicates. I'll give more details once I figure out a candidate solution that passes the tests without altering the expectation of this behavior.

AlexKirko · 2020-08-20T10:52:13Z

Guess I'll limit this PR to implementing missing values support for Index.sort_values using nargsort.

Switching the Series.sort_values to it, I think, belongs in another issue and PR, along with picking a convention for sorting duplicates and enforcing it across our test suite.

jreback

see my comments on the tests; we need to expand to test all index types here (of course can xfail ones which don't work for now)

pandas/tests/indexes/test_common.py

jreback · 2020-08-21T22:38:18Z

pandas/tests/indexes/test_common.py

@@ -395,3 +395,26 @@ def test_astype_preserves_name(self, index, dtype, copy):
            assert result.names == index.names
        else:
            assert result.name == index.name
+
+
+@pytest.mark.parametrize("idx", [Index(["a", None, "c", None, "e"])])


ideally you can use the index fix here ? (might need to create a derivative one that has missing values), but we want to cover all indexes as possible (i see your comment about PI), that you can xfail fo rnow.

Made a new fixture with all kinds of indices.

jreback · 2020-08-21T22:40:21Z

Guess I'll limit this PR to implementing missing values support for Index.sort_values using nargsort.

Switching the Series.sort_values to it, I think, belongs in another issue and PR, along with picking a convention for sorting duplicates and enforcing it across our test suite.

yes absolutely. after this PR is merged we for sure want to combine implementations (there even might be some open issues about this); this impacted adding the key arg as well.

AlexKirko · 2020-08-24T10:17:21Z

Great, I'll make the changes tomorrow.

jreback · 2020-08-25T00:19:58Z

pandas/tests/indexes/period/test_ops.py

-            exp = np.array([2, 1, 3, 4, 0])
-            tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
-            _check_freq(ordered, idx)
+            # GH 35584. Index sort is now stable when sorting in descending order


can you create a new test and xfail it

Added the new test to the end of the file.

ok you don't need this comment then

Removed the comment.

AlexKirko · 2020-08-25T15:23:38Z

~~Made the changes to the tests, all green. Please take a look.~~
I wonder what's up with the test fails on Windows. Looks unrelated, will look into it next morning.

AlexKirko · 2020-08-26T08:43:05Z

@jreback
Added a fixture, reworked the tests. All green, please take a look whether this is what you had in mind.

jreback · 2020-08-27T02:16:48Z

pandas/tests/indexes/period/test_ops.py

-            exp = np.array([2, 1, 3, 4, 0])
-            tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
-            _check_freq(ordered, idx)
+            # GH 35584. Index sort is now stable when sorting in descending order


ok you don't need this comment then

jreback · 2020-08-27T02:17:13Z

pandas/tests/indexes/period/test_ops.py

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
        # warning for setter
        with pytest.raises(AttributeError, match="can't set attribute"):
            idx.freq = pd.offsets.Day()
+
+
+@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")


well its doesn't work :->, do we have an issue tracking this, pls add the refernce in the xfail (and if we don't pls create an issue)

Couldn't find an existing issue, so I opened #35922

kk add the issue number to the xfail message

though I think the comprehensive test below covers this? (not averse to having this as well, just asking)

jreback · 2020-08-27T02:18:37Z

pandas/tests/indexes/test_common.py

+        tm.makeRangeIndex(10),
+        tm.makeCategoricalIndex(10),
+        tm.makeMultiIndex(10),
+        tm.makeDateIndex(10),


we already have the 'index' fixtures, pls use it.

if you need a derivative one, then you can create it similary (e.g. use indices_dict). you could just add it in the pandas/conftest.py is fine.

Sorry, didn't know that we keep all the general fixtures in pandas/conftest.py. I'll add another fixture there that adds missing values to the indices.

pandas/tests/indexes/test_common.py

AlexKirko · 2020-08-28T09:46:57Z

Numpy dev fails now, but I don't see Index.sort_values being called in the failing tests. Will investigate why this is happening.

This reverts commit a3310ba.

AlexKirko · 2020-09-03T09:46:07Z

Waiting for the mypy fix (#36085). The py37_np16 appears to be unrelated too, will look into it.

AlexKirko · 2020-09-03T14:13:29Z

@jreback
Finally all green, please take another look.
Apologies for the delay: ran into an elusive bug with Index.copy on one of the pipelines, which I wasn't able to reproduce. Took a while to track down and circumvent.

jreback

looks good, couple more comments

jreback · 2020-09-04T15:18:17Z

pandas/conftest.py

+    # Azure pipeline that writes into indices_dict despite copy
+    ind = indices_dict[request.param].copy(deep=True)
+    vals = ind.values
+    if type(vals[0]) == tuple:


can you just check request.parm == 'multindex'? as its more explict

jreback · 2020-09-04T15:26:50Z

pandas/core/sorting.py

+    try:
+        indexer = non_nan_idx[non_nans.argsort(kind=kind)]
+    except TypeError:
+        # For compatibility with Series: fall back to quicksort


we have a test that hits this?

Sorry, this is a remnant from trying to bring Series.sort_values in line with Index.sort_values. Reverted for now. This change doesn't belong in this PR. I'll bring it back in the next PR when I'll be synchronizing Index and Series behavior.

jreback · 2020-09-04T15:27:08Z

pandas/tests/indexes/period/test_ops.py

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
        # warning for setter
        with pytest.raises(AttributeError, match="can't set attribute"):
            idx.freq = pd.offsets.Day()
+
+
+@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")


kk add the issue number to the xfail message

jreback · 2020-09-04T15:27:42Z

pandas/tests/indexes/period/test_ops.py

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
        # warning for setter
        with pytest.raises(AttributeError, match="can't set attribute"):
            idx.freq = pd.offsets.Day()
+
+
+@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")


though I think the comprehensive test below covers this? (not averse to having this as well, just asking)

jreback · 2020-09-04T15:28:11Z

pandas/tests/indexes/test_common.py

+def test_sort_values_invalid_na_position(index_with_missing, na_position):
+
+    if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:
+        pytest.xfail("stable descending order sort not implemented")


e.g. maybe just enumerate the PeriodIndex case here (do the others have an issue?) pls open if they don't (you can make a single issue with checkboxes)

jreback · 2020-09-04T15:28:26Z

pandas/tests/indexes/test_common.py

+    # GH 35584. Test that sort_values works with missing values,
+    # sort non-missing and place missing according to na_position
+
+    if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:


@jreback
Sorry, I wasn't at all clear in the xfail messages as to how the tests are different. The one in period/test_ops.py checks whether PeriodIndex and another non-datetime-like subtype sort duplicates into the same order when descending=True. As part of syncing duplicate sorting behavior in the next PR, I'd like to write a general test with an index_with_duplicates fixture to check for sorting stability and replace this test.

The tests in test_common.py xfail, because the na_position kwarg isn't implemented yet for datetime-like indices. Implementing it would go with implementing stable duplicate sorting behavior and break a bunch of tests. I also intend to deal with that in the next PR as I bring sort_values in sync between all Index subtypes and Series.

I've clarified the xfail messages everywhere and added comments with xrefs to the tests in test_common.py. Is that acceptable? I think keeping the tests separate makes sense for now.

Sorry if I misunderstood your question.

jreback

small comments, ping on green.

pandas/tests/indexes/test_common.py

jreback · 2020-09-05T18:19:57Z

pandas/tests/indexes/test_common.py

+        # synchronizing duplicate-sorting behavior, because we currently expect
+        # them, other indices, and Series to sort differently (xref 35922)
+        pytest.xfail("sort_values does not support na_position kwarg")
+    elif type(index_with_missing) in [CategoricalIndex, MultiIndex]:


can you isinstance(....) instead

AlexKirko · 2020-09-06T05:03:31Z

@jreback
Changes made, all green.

jreback · 2020-09-06T17:48:04Z

thanks @AlexKirko very nice! happy to take more to consolidate the Index/Series code for this :->

AlexKirko · 2020-09-07T05:01:20Z

Thanks!
A bit sick atm, but I'll get back to this in a few days with a new PR.

…ndas-dev#35604)

AlexKirko marked this pull request as draft August 7, 2020 12:50

AlexKirko added Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 7, 2020

AlexKirko marked this pull request as ready for review August 11, 2020 07:05

AlexKirko requested a review from jreback August 12, 2020 07:55

jreback requested changes Aug 12, 2020

View reviewed changes

jreback requested changes Aug 21, 2020

View reviewed changes

jreback requested changes Aug 25, 2020

View reviewed changes

AlexKirko requested a review from jreback August 26, 2020 08:43

jreback requested changes Aug 27, 2020

View reviewed changes

AlexKirko mentioned this pull request Aug 27, 2020

BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922

Closed

3 tasks

AlexKirko force-pushed the ind-sort-values branch from bd1276b to a968f6e Compare August 31, 2020 09:45

AlexKirko added 5 commits August 31, 2020 12:51

BUG: attempt initial fix

076061b

TST: add test

3fa9351

CLN: run black

333c6e4

CLN: clean up unnecessary print

8414fd0

exempt MultiIndex from handling missing values

fa12898

AlexKirko added 6 commits September 3, 2020 10:25

TST: try immediate return instead of deep copy

a3310ba

Revert "TST: try immediate return instead of deep copy"

e80bc9d

This reverts commit a3310ba.

REFACT: add immediate returns

2007699

Merge branch 'master' into ind-sort-values

8fd5a77

DOC: add comment to conftest clarifying deep copy

2e39294

restart tests

d78a9a2

Merge branch 'master' into ind-sort-values

5715c31

AlexKirko requested a review from jreback September 3, 2020 14:13

CLN: remove test-output.xml

bfd2e9c

jreback requested changes Sep 4, 2020

View reviewed changes

AlexKirko added 4 commits September 5, 2020 07:24

TST: check for MultiIndex through request.param

54a6e82

CLN: revert changes to nargsort

f60d2a8

DOC: add issue xref to xfail reason in period/test_ops.py

af92fe8

DOC: clarify xfail reasons and add comments

6d33657

jreback requested changes Sep 5, 2020

View reviewed changes

jreback added this to the 1.2 milestone Sep 5, 2020

REFACT: switch to isinstance, add blank line

4935309

AlexKirko requested a review from jreback September 6, 2020 05:03

jreback approved these changes Sep 6, 2020

View reviewed changes

jreback merged commit ba552ec into pandas-dev:master Sep 6, 2020

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Sep 8, 2020

BUG: allow missing values in Index when calling Index.sort_values (pa…

366f63c

…ndas-dev#35604)

junjunjunk mentioned this pull request Sep 20, 2020

Index.sort_values puts missing values at the start with ascending=False #31220

Closed

AlexKirko mentioned this pull request Oct 23, 2020

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Merged

5 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: allow missing values in Index when calling Index.sort_values (pa…

f2adbee

…ndas-dev#35604)

AlexKirko deleted the ind-sort-values branch January 22, 2022 10:07

BUG: allow missing values in Index when calling Index.sort_values #35604

BUG: allow missing values in Index when calling Index.sort_values #35604

Conversation

AlexKirko commented Aug 7, 2020 • edited Loading

Problem

Details

Solution and alternatives

AlexKirko commented Aug 7, 2020

AlexKirko commented Aug 7, 2020

AlexKirko commented Aug 8, 2020

AlexKirko commented Aug 11, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko Aug 16, 2020 • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Aug 13, 2020 • edited Loading

Comment last updated at 2020-09-06 04:24:45 UTC

AlexKirko commented Aug 16, 2020

AlexKirko commented Aug 20, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

jreback commented Aug 21, 2020

AlexKirko commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko commented Aug 25, 2020 • edited Loading

AlexKirko commented Aug 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko Aug 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko commented Aug 28, 2020

AlexKirko commented Sep 3, 2020 • edited Loading

AlexKirko commented Sep 3, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKirko commented Sep 6, 2020

jreback commented Sep 6, 2020

AlexKirko commented Sep 7, 2020

AlexKirko commented Aug 7, 2020 •

edited

Loading

AlexKirko commented Aug 11, 2020 •

edited

Loading

AlexKirko Aug 16, 2020 •

edited

Loading

pep8speaks commented Aug 13, 2020 •

edited

Loading

AlexKirko Aug 25, 2020 •

edited

Loading

AlexKirko commented Aug 25, 2020 •

edited

Loading

AlexKirko Aug 27, 2020 •

edited

Loading

AlexKirko commented Sep 3, 2020 •

edited

Loading

AlexKirko commented Sep 3, 2020 •

edited

Loading