Skip to content

BUG: allow missing values in Index when calling Index.sort_values #35604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Sep 6, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Aug 7, 2020

Problem

Index.sort_values breaks when Index includes None. The OP points out that a Series doesn't break in a similar case. In my opinion, there is a number of ways missing values can creep into an Index, and sort_values shoudn't be breaking.

Details

When we do Series.sort_values we shunt the missing values to the end or the beginning of the Series depending on na_position, and then argsort the rest. When we currently do sort_values on an Index, we try to argsort the whole thing, and it expectedly breaks.

Solution and alternatives

I propose we do the same stuff for Index that Series does, that is send the missing values to the end or the beginning. Added the na_position kwarg that works similar to how it works in Series.

Another possible solution is to explicitly forbid Indices with missing values and raise a clear error. I don't think it's advisable, since there isn't much of a practical difference between non-unique indices, which we allow, and this case.

@AlexKirko
Copy link
Member Author

Couldn't find a better home for the test than test_common.py in indixes. Maybe there is a better place for it?

@AlexKirko AlexKirko marked this pull request as draft August 7, 2020 12:50
@AlexKirko
Copy link
Member Author

Exempted MultiIndex from this fix, as it's completely unclear where you should put missing values in this case.

@AlexKirko AlexKirko added Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 7, 2020
@AlexKirko
Copy link
Member Author

Hm. There are a bunch of tests that expect NaNs to be at the beginning. Will have to add na_position to the function signature.

@AlexKirko AlexKirko marked this pull request as ready for review August 11, 2020 07:05
@AlexKirko
Copy link
Member Author

AlexKirko commented Aug 11, 2020

I think this is ready for review. Since this PR changes the API, I put it in 1.2.0 whatsnew.

@AlexKirko AlexKirko requested a review from jreback August 12, 2020 07:55
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. opportunity to clean up some code here.

self,
return_indexer=False,
ascending=True,
key: Optional[Callable] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you order these the same as in series (na_position before key)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_as[: np.sum(good)] = _as[: np.sum(good)][::-1]
elif na_position == "first":
_as = np.concatenate([_as[bad], _as[good][idx[good].argsort()]])
if not ascending:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share this code with the series impl (might be better / easier to create a helper to actually do this) and put it in pandas/core/sorting.py

Copy link
Member Author

@AlexKirko AlexKirko Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Found such a function in sorting.py: it's called nargsort. With a minor addition to insure that mergesort defaults to quicksort when sorting object dtype, it can be used for our purposes.

But I've uncovered a bit of a problem when trying to unify Index and Series behavior: when we set ascending=False and the object containes duplicates, in general, we expect different behavior from Series.sort_index and Series.sort_values. We expect sort_values to reverse the order of equal elements, but we expect sort_index to be stable and maintain it. This also creates weirdness in DataFrame.sort_values.

I've switched Series.sort_values to using nargsort and tried adding an argument to nargsort to signify if we want to preserve order or not, but there are still edge cases in the test sutie because of this inconsistency.

It is probably best to settle on one convention and carefully alter all existing tests to fit it. I personally think that equal elements should maintain order when we sort in descending order, that is, the descending sort should also be stable. I don't know if we should do this in this PR though, or leave Series as it is and only add NaN-handling to Index.sort_values through nargsort. Then I can make another PR to handle Series there and alter the tests to the convention we agree upon.

What do you think?

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2020

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-06 04:24:45 UTC

@AlexKirko
Copy link
Member Author

For some weird reason, we expect different behavior from Index.sort_values and Series.sort_values when ascending=False and the object contains duplicates. I'll give more details once I figure out a candidate solution that passes the tests without altering the expectation of this behavior.

@AlexKirko
Copy link
Member Author

Guess I'll limit this PR to implementing missing values support for Index.sort_values using nargsort.

Switching the Series.sort_values to it, I think, belongs in another issue and PR, along with picking a convention for sorting duplicates and enforcing it across our test suite.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments on the tests; we need to expand to test all index types here (of course can xfail ones which don't work for now)

@@ -395,3 +395,26 @@ def test_astype_preserves_name(self, index, dtype, copy):
assert result.names == index.names
else:
assert result.name == index.name


@pytest.mark.parametrize("idx", [Index(["a", None, "c", None, "e"])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you can use the index fix here ? (might need to create a derivative one that has missing values), but we want to cover all indexes as possible (i see your comment about PI), that you can xfail fo rnow.

Copy link
Member Author

@AlexKirko AlexKirko Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a new fixture with all kinds of indices.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2020

Guess I'll limit this PR to implementing missing values support for Index.sort_values using nargsort.

Switching the Series.sort_values to it, I think, belongs in another issue and PR, along with picking a convention for sorting duplicates and enforcing it across our test suite.

yes absolutely. after this PR is merged we for sure want to combine implementations (there even might be some open issues about this); this impacted adding the key arg as well.

@AlexKirko
Copy link
Member Author

Great, I'll make the changes tomorrow.

exp = np.array([2, 1, 3, 4, 0])
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
_check_freq(ordered, idx)
# GH 35584. Index sort is now stable when sorting in descending order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a new test and xfail it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the new test to the end of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok you don't need this comment then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

@AlexKirko
Copy link
Member Author

AlexKirko commented Aug 25, 2020

Made the changes to the tests, all green. Please take a look.
I wonder what's up with the test fails on Windows. Looks unrelated, will look into it next morning.

@AlexKirko
Copy link
Member Author

@jreback
Added a fixture, reworked the tests. All green, please take a look whether this is what you had in mind.

@AlexKirko AlexKirko requested a review from jreback August 26, 2020 08:43
exp = np.array([2, 1, 3, 4, 0])
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
_check_freq(ordered, idx)
# GH 35584. Index sort is now stable when sorting in descending order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok you don't need this comment then

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
# warning for setter
with pytest.raises(AttributeError, match="can't set attribute"):
idx.freq = pd.offsets.Day()


@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well its doesn't work :->, do we have an issue tracking this, pls add the refernce in the xfail (and if we don't pls create an issue)

Copy link
Member Author

@AlexKirko AlexKirko Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find an existing issue, so I opened #35922

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk add the issue number to the xfail message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I think the comprehensive test below covers this? (not averse to having this as well, just asking)

tm.makeRangeIndex(10),
tm.makeCategoricalIndex(10),
tm.makeMultiIndex(10),
tm.makeDateIndex(10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have the 'index' fixtures, pls use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need a derivative one, then you can create it similary (e.g. use indices_dict). you could just add it in the pandas/conftest.py is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't know that we keep all the general fixtures in pandas/conftest.py. I'll add another fixture there that adds missing values to the indices.

@AlexKirko
Copy link
Member Author

Numpy dev fails now, but I don't see Index.sort_values being called in the failing tests. Will investigate why this is happening.

@AlexKirko
Copy link
Member Author

AlexKirko commented Sep 3, 2020

Waiting for the mypy fix (#36085). The py37_np16 appears to be unrelated too, will look into it.

@AlexKirko
Copy link
Member Author

AlexKirko commented Sep 3, 2020

@jreback
Finally all green, please take another look.
Apologies for the delay: ran into an elusive bug with Index.copy on one of the pipelines, which I wasn't able to reproduce. Took a while to track down and circumvent.

@AlexKirko AlexKirko requested a review from jreback September 3, 2020 14:13
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, couple more comments

# Azure pipeline that writes into indices_dict despite copy
ind = indices_dict[request.param].copy(deep=True)
vals = ind.values
if type(vals[0]) == tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just check request.parm == 'multindex'? as its more explict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

try:
indexer = non_nan_idx[non_nans.argsort(kind=kind)]
except TypeError:
# For compatibility with Series: fall back to quicksort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a test that hits this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is a remnant from trying to bring Series.sort_values in line with Index.sort_values. Reverted for now. This change doesn't belong in this PR. I'll bring it back in the next PR when I'll be synchronizing Index and Series behavior.

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
# warning for setter
with pytest.raises(AttributeError, match="can't set attribute"):
idx.freq = pd.offsets.Day()


@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk add the issue number to the xfail message

@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self):
# warning for setter
with pytest.raises(AttributeError, match="can't set attribute"):
idx.freq = pd.offsets.Day()


@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I think the comprehensive test below covers this? (not averse to having this as well, just asking)

def test_sort_values_invalid_na_position(index_with_missing, na_position):

if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:
pytest.xfail("stable descending order sort not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. maybe just enumerate the PeriodIndex case here (do the others have an issue?) pls open if they don't (you can make a single issue with checkboxes)

# GH 35584. Test that sort_values works with missing values,
# sort non-missing and place missing according to na_position

if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Sorry, I wasn't at all clear in the xfail messages as to how the tests are different. The one in period/test_ops.py checks whether PeriodIndex and another non-datetime-like subtype sort duplicates into the same order when descending=True. As part of syncing duplicate sorting behavior in the next PR, I'd like to write a general test with an index_with_duplicates fixture to check for sorting stability and replace this test.

The tests in test_common.py xfail, because the na_position kwarg isn't implemented yet for datetime-like indices. Implementing it would go with implementing stable duplicate sorting behavior and break a bunch of tests. I also intend to deal with that in the next PR as I bring sort_values in sync between all Index subtypes and Series.

I've clarified the xfail messages everywhere and added comments with xrefs to the tests in test_common.py. Is that acceptable? I think keeping the tests separate makes sense for now.

Sorry if I misunderstood your question.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, ping on green.

# synchronizing duplicate-sorting behavior, because we currently expect
# them, other indices, and Series to sort differently (xref 35922)
pytest.xfail("sort_values does not support na_position kwarg")
elif type(index_with_missing) in [CategoricalIndex, MultiIndex]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you isinstance(....) instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added this to the 1.2 milestone Sep 5, 2020
@AlexKirko
Copy link
Member Author

@jreback
Changes made, all green.

@AlexKirko AlexKirko requested a review from jreback September 6, 2020 05:03
@jreback jreback merged commit ba552ec into pandas-dev:master Sep 6, 2020
@jreback
Copy link
Contributor

jreback commented Sep 6, 2020

thanks @AlexKirko very nice! happy to take more to consolidate the Index/Series code for this :->

@AlexKirko
Copy link
Member Author

Thanks!
A bit sick atm, but I'll get back to this in a few days with a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Index.sort_values fails with TypeError
3 participants