-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: allow missing values in Index when calling Index.sort_values #35604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Couldn't find a better home for the test than |
Exempted MultiIndex from this fix, as it's completely unclear where you should put missing values in this case. |
Hm. There are a bunch of tests that expect NaNs to be at the beginning. Will have to add |
I think this is ready for review. Since this PR changes the API, I put it in 1.2.0 whatsnew. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. opportunity to clean up some code here.
pandas/core/indexes/base.py
Outdated
self, | ||
return_indexer=False, | ||
ascending=True, | ||
key: Optional[Callable] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you order these the same as in series (na_position before key)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/indexes/base.py
Outdated
_as[: np.sum(good)] = _as[: np.sum(good)][::-1] | ||
elif na_position == "first": | ||
_as = np.concatenate([_as[bad], _as[good][idx[good].argsort()]]) | ||
if not ascending: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share this code with the series impl (might be better / easier to create a helper to actually do this) and put it in pandas/core/sorting.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Found such a function in sorting.py
: it's called nargsort
. With a minor addition to insure that mergesort defaults to quicksort when sorting object dtype, it can be used for our purposes.
But I've uncovered a bit of a problem when trying to unify Index
and Series
behavior: when we set ascending=False
and the object containes duplicates, in general, we expect different behavior from Series.sort_index
and Series.sort_values
. We expect sort_values
to reverse the order of equal elements, but we expect sort_index
to be stable and maintain it. This also creates weirdness in DataFrame.sort_values
.
I've switched Series.sort_values
to using nargsort
and tried adding an argument to nargsort
to signify if we want to preserve order or not, but there are still edge cases in the test sutie because of this inconsistency.
It is probably best to settle on one convention and carefully alter all existing tests to fit it. I personally think that equal elements should maintain order when we sort in descending order, that is, the descending sort should also be stable. I don't know if we should do this in this PR though, or leave Series as it is and only add NaN-handling to Index.sort_values
through nargsort. Then I can make another PR to handle Series there and alter the tests to the convention we agree upon.
What do you think?
Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-09-06 04:24:45 UTC |
For some weird reason, we expect different behavior from |
Guess I'll limit this PR to implementing missing values support for Switching the Series.sort_values to it, I think, belongs in another issue and PR, along with picking a convention for sorting duplicates and enforcing it across our test suite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments on the tests; we need to expand to test all index types here (of course can xfail ones which don't work for now)
pandas/tests/indexes/test_common.py
Outdated
@@ -395,3 +395,26 @@ def test_astype_preserves_name(self, index, dtype, copy): | |||
assert result.names == index.names | |||
else: | |||
assert result.name == index.name | |||
|
|||
|
|||
@pytest.mark.parametrize("idx", [Index(["a", None, "c", None, "e"])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally you can use the index fix here ? (might need to create a derivative one that has missing values), but we want to cover all indexes as possible (i see your comment about PI), that you can xfail fo rnow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a new fixture with all kinds of indices.
yes absolutely. after this PR is merged we for sure want to combine implementations (there even might be some open issues about this); this impacted adding the key arg as well. |
Great, I'll make the changes tomorrow. |
exp = np.array([2, 1, 3, 4, 0]) | ||
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) | ||
_check_freq(ordered, idx) | ||
# GH 35584. Index sort is now stable when sorting in descending order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create a new test and xfail it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the new test to the end of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok you don't need this comment then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment.
|
@jreback |
exp = np.array([2, 1, 3, 4, 0]) | ||
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) | ||
_check_freq(ordered, idx) | ||
# GH 35584. Index sort is now stable when sorting in descending order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok you don't need this comment then
@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self): | |||
# warning for setter | |||
with pytest.raises(AttributeError, match="can't set attribute"): | |||
idx.freq = pd.offsets.Day() | |||
|
|||
|
|||
@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well its doesn't work :->, do we have an issue tracking this, pls add the refernce in the xfail (and if we don't pls create an issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't find an existing issue, so I opened #35922
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk add the issue number to the xfail message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I think the comprehensive test below covers this? (not averse to having this as well, just asking)
pandas/tests/indexes/test_common.py
Outdated
tm.makeRangeIndex(10), | ||
tm.makeCategoricalIndex(10), | ||
tm.makeMultiIndex(10), | ||
tm.makeDateIndex(10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have the 'index' fixtures, pls use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need a derivative one, then you can create it similary (e.g. use indices_dict). you could just add it in the pandas/conftest.py is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't know that we keep all the general fixtures in pandas/conftest.py
. I'll add another fixture there that adds missing values to the indices.
Numpy dev fails now, but I don't see Index.sort_values being called in the failing tests. Will investigate why this is happening. |
bd1276b
to
a968f6e
Compare
This reverts commit a3310ba.
Waiting for the mypy fix (#36085). The |
@jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, couple more comments
pandas/conftest.py
Outdated
# Azure pipeline that writes into indices_dict despite copy | ||
ind = indices_dict[request.param].copy(deep=True) | ||
vals = ind.values | ||
if type(vals[0]) == tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just check request.parm
== 'multindex'? as its more explict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/sorting.py
Outdated
try: | ||
indexer = non_nan_idx[non_nans.argsort(kind=kind)] | ||
except TypeError: | ||
# For compatibility with Series: fall back to quicksort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a test that hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is a remnant from trying to bring Series.sort_values
in line with Index.sort_values
. Reverted for now. This change doesn't belong in this PR. I'll bring it back in the next PR when I'll be synchronizing Index
and Series
behavior.
@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self): | |||
# warning for setter | |||
with pytest.raises(AttributeError, match="can't set attribute"): | |||
idx.freq = pd.offsets.Day() | |||
|
|||
|
|||
@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk add the issue number to the xfail message
@@ -333,3 +334,16 @@ def test_freq_setter_deprecated(self): | |||
# warning for setter | |||
with pytest.raises(AttributeError, match="can't set attribute"): | |||
idx.freq = pd.offsets.Day() | |||
|
|||
|
|||
@pytest.mark.xfail(reason="PeriodIndex.sort_values currently unstable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I think the comprehensive test below covers this? (not averse to having this as well, just asking)
pandas/tests/indexes/test_common.py
Outdated
def test_sort_values_invalid_na_position(index_with_missing, na_position): | ||
|
||
if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]: | ||
pytest.xfail("stable descending order sort not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. maybe just enumerate the PeriodIndex case here (do the others have an issue?) pls open if they don't (you can make a single issue with checkboxes)
pandas/tests/indexes/test_common.py
Outdated
# GH 35584. Test that sort_values works with missing values, | ||
# sort non-missing and place missing according to na_position | ||
|
||
if type(index_with_missing) in [DatetimeIndex, PeriodIndex, TimedeltaIndex]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Sorry, I wasn't at all clear in the xfail messages as to how the tests are different. The one in period/test_ops.py
checks whether PeriodIndex
and another non-datetime-like subtype sort duplicates into the same order when descending=True
. As part of syncing duplicate sorting behavior in the next PR, I'd like to write a general test with an index_with_duplicates
fixture to check for sorting stability and replace this test.
The tests in test_common.py
xfail, because the na_position
kwarg isn't implemented yet for datetime-like indices. Implementing it would go with implementing stable duplicate sorting behavior and break a bunch of tests. I also intend to deal with that in the next PR as I bring sort_values in sync between all Index subtypes and Series.
I've clarified the xfail messages everywhere and added comments with xrefs to the tests in test_common.py
. Is that acceptable? I think keeping the tests separate makes sense for now.
Sorry if I misunderstood your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, ping on green.
pandas/tests/indexes/test_common.py
Outdated
# synchronizing duplicate-sorting behavior, because we currently expect | ||
# them, other indices, and Series to sort differently (xref 35922) | ||
pytest.xfail("sort_values does not support na_position kwarg") | ||
elif type(index_with_missing) in [CategoricalIndex, MultiIndex]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you isinstance(....) instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jreback |
thanks @AlexKirko very nice! happy to take more to consolidate the Index/Series code for this :-> |
Thanks! |
Index.sort_values
fails with TypeError #35584, xref BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Problem
Index.sort_values
breaks whenIndex
includesNone
. The OP points out that aSeries
doesn't break in a similar case. In my opinion, there is a number of ways missing values can creep into an Index, andsort_values
shoudn't be breaking.Details
When we do
Series.sort_values
we shunt the missing values to the end or the beginning of the Series depending onna_position
, and thenargsort
the rest. When we currently dosort_values
on anIndex
, we try toargsort
the whole thing, and it expectedly breaks.Solution and alternatives
I propose we do the same stuff for
Index
thatSeries
does, that is send the missing values to the end or the beginning. Added thena_position
kwarg that works similar to how it works inSeries
.Another possible solution is to explicitly forbid Indices with missing values and raise a clear error. I don't think it's advisable, since there isn't much of a practical difference between non-unique indices, which we allow, and this case.