Skip to content

ENH: Support na_position for sort_index and sortlevel #51672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ enhancement2

Other enhancements
^^^^^^^^^^^^^^^^^^
- :meth:`MultiIndex.sortlevel` and :meth:`Index.sortlevel` gained a new keyword ``na_position`` (:issue:`51612`)
- Improve error message when setting :class:`DataFrame` with wrong number of columns through :meth:`DataFrame.isetitem` (:issue:`51701`)
-

Expand Down Expand Up @@ -108,6 +109,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.first_valid_index` and :meth:`DataFrame.last_valid_index` for extension array dtypes (:issue:`51549`)
- Performance improvement in :meth:`DataFrame.where` when ``cond`` is backed by an extension dtype (:issue:`51574`)
- Performance improvement in :meth:`read_orc` when reading a remote URI file path. (:issue:`51609`)
- Performance improvement in :meth:`MultiIndex.sortlevel` when ``ascending`` is a list (:issue:`51612`)
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.isna` when array has zero nulls or is all nulls (:issue:`51630`)

.. ---------------------------------------------------------------------------
Expand Down
15 changes: 13 additions & 2 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1893,7 +1893,11 @@ def _get_level_number(self, level) -> int:
return 0

def sortlevel(
self, level=None, ascending: bool | list[bool] = True, sort_remaining=None
self,
level=None,
ascending: bool | list[bool] = True,
sort_remaining=None,
na_position: str_t = "first",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl - any reason to default to first here? With a quick grep I'm seeing we default to last everywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure that this was done for backwards compatibility reasons. Does this break anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - makes sense. I'll experiment with changing the default and see what breaks; depending on that may put up an issue to deprecate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to deprecate. IIRC this is used internally a bunch, so deprecating might be tricky

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I meant just deprecating the default and changing it to "last" to be consistent with other sorting methods. We can still specify "first" internally so hopefully it will be straight forward unless I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you. No that makes sense

):
"""
For internal compatibility with the Index API.
Expand All @@ -1904,6 +1908,11 @@ def sortlevel(
----------
ascending : bool, default True
False to sort in descending order
na_position : {'first' or 'last'}, default 'first'
Argument 'first' puts NaNs at the beginning, 'last' puts NaNs at
the end.

.. versionadded:: 2.1.0

level, sort_remaining are compat parameters

Expand All @@ -1925,7 +1934,9 @@ def sortlevel(
if not isinstance(ascending, bool):
raise TypeError("ascending must be a bool value")

return self.sort_values(return_indexer=True, ascending=ascending)
return self.sort_values(
return_indexer=True, ascending=ascending, na_position=na_position
)

def _get_level_values(self, level) -> Index:
"""
Expand Down
42 changes: 14 additions & 28 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@
from pandas.core.ops.invalid import make_invalid_op
from pandas.core.sorting import (
get_group_index,
indexer_from_factorized,
lexsort_indexer,
)

Expand Down Expand Up @@ -2367,6 +2366,7 @@ def sortlevel(
level: IndexLabel = 0,
ascending: bool | list[bool] = True,
sort_remaining: bool = True,
na_position: str = "first",
) -> tuple[MultiIndex, npt.NDArray[np.intp]]:
"""
Sort MultiIndex at the requested level.
Expand All @@ -2383,6 +2383,11 @@ def sortlevel(
False to sort in descending order.
Can also be a list to specify a directed ordering.
sort_remaining : sort by the remaining levels after level
na_position : {'first' or 'last'}, default 'first'
Argument 'first' puts NaNs at the beginning, 'last' puts NaNs at
the end.

.. versionadded:: 2.1.0

Returns
-------
Expand Down Expand Up @@ -2428,40 +2433,21 @@ def sortlevel(
]
sortorder = None

codes = [self.codes[lev] for lev in level]
# we have a directed ordering via ascending
if isinstance(ascending, list):
if not len(level) == len(ascending):
raise ValueError("level must have same length as ascending")

indexer = lexsort_indexer(
[self.codes[lev] for lev in level], orders=ascending
elif sort_remaining:
codes.extend(
[self.codes[lev] for lev in range(len(self.levels)) if lev not in level]
)

# level ordering
else:
codes = list(self.codes)
shape = list(self.levshape)

# partition codes and shape
primary = tuple(codes[lev] for lev in level)
primshp = tuple(shape[lev] for lev in level)

# Reverse sorted to retain the order of
# smaller indices that needs to be removed
for lev in sorted(level, reverse=True):
codes.pop(lev)
shape.pop(lev)

if sort_remaining:
primary += primary + tuple(codes)
primshp += primshp + tuple(shape)
else:
sortorder = level[0]

indexer = indexer_from_factorized(primary, primshp, compress=False)
sortorder = level[0]

if not ascending:
indexer = indexer[::-1]
indexer = lexsort_indexer(
codes, orders=ascending, na_position=na_position, codes_given=True
)

indexer = ensure_platform_int(indexer)
new_codes = [level_codes.take(indexer) for level_codes in self.codes]
Expand Down
41 changes: 30 additions & 11 deletions pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,14 @@ def get_indexer_indexer(

if level is not None:
_, indexer = target.sortlevel(
level, ascending=ascending, sort_remaining=sort_remaining
level,
ascending=ascending,
sort_remaining=sort_remaining,
na_position=na_position,
)
elif isinstance(target, ABCMultiIndex):
indexer = lexsort_indexer(
target._get_codes_for_sorting(), orders=ascending, na_position=na_position
target.codes, orders=ascending, na_position=na_position, codes_given=True
)
else:
# Check monotonic-ness before sort an index (GH 11080)
Expand Down Expand Up @@ -302,7 +305,11 @@ def indexer_from_factorized(


def lexsort_indexer(
keys, orders=None, na_position: str = "last", key: Callable | None = None
keys,
orders=None,
na_position: str = "last",
key: Callable | None = None,
codes_given: bool = False,
) -> npt.NDArray[np.intp]:
"""
Performs lexical sorting on a set of keys
Expand All @@ -321,6 +328,8 @@ def lexsort_indexer(
Determines placement of NA elements in the sorted list ("last" or "first")
key : Callable, optional
Callable key function applied to every element in keys before sorting
codes_given: bool, False
Avoid categorical materialization if codes are already provided.

Returns
-------
Expand All @@ -338,15 +347,27 @@ def lexsort_indexer(
keys = [ensure_key_mapped(k, key) for k in keys]

for k, order in zip(keys, orders):
cat = Categorical(k, ordered=True)

if na_position not in ["last", "first"]:
raise ValueError(f"invalid na_position: {na_position}")

n = len(cat.categories)
codes = cat.codes.copy()
if codes_given:
mask = k == -1
codes = k.copy()
n = len(codes)
mask_n = n
if mask.any():
n -= 1

else:
cat = Categorical(k, ordered=True)
n = len(cat.categories)
codes = cat.codes.copy()
mask = cat.codes == -1
if mask.any():
mask_n = n + 1
else:
mask_n = n

mask = cat.codes == -1
if order: # ascending
if na_position == "last":
codes = np.where(mask, n, codes)
Expand All @@ -357,10 +378,8 @@ def lexsort_indexer(
codes = np.where(mask, n, n - codes - 1)
elif na_position == "first":
codes = np.where(mask, 0, n - codes)
if mask.any():
n += 1

shape.append(n)
shape.append(mask_n)
labels.append(codes)

return indexer_from_factorized(labels, tuple(shape))
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/frame/methods/test_sort_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -907,3 +907,10 @@ def test_sort_index_multiindex_sparse_column(self):
result = expected.sort_index(level=0)

tm.assert_frame_equal(result, expected)

def test_sort_index_na_position(self):
# GH#51612
df = DataFrame([1, 2], index=MultiIndex.from_tuples([(1, 1), (1, pd.NA)]))
expected = df.copy()
result = df.sort_index(level=[0, 1], na_position="last")
tm.assert_frame_equal(result, expected)
8 changes: 8 additions & 0 deletions pandas/tests/indexes/multi/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,14 @@ def test_sortlevel_deterministic():
assert sorted_idx.equals(expected[::-1])


def test_sortlevel_na_position():
# GH#51612
midx = MultiIndex.from_tuples([(1, np.nan), (1, 1)])
result = midx.sortlevel(level=[0, 1], na_position="last")[0]
expected = MultiIndex.from_tuples([(1, 1), (1, np.nan)])
tm.assert_index_equal(result, expected)


def test_numpy_argsort(idx):
result = np.argsort(idx)
expected = idx.argsort()
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1262,6 +1262,13 @@ def test_sortlevel(self):
result = index.sortlevel(ascending=False)
tm.assert_index_equal(result[0], expected)

def test_sortlevel_na_position(self):
# GH#51612
idx = Index([1, np.nan])
result = idx.sortlevel(na_position="first")[0]
expected = Index([np.nan, 1])
tm.assert_index_equal(result, expected)


class TestMixedIntIndex(Base):
# Mostly the tests from common.py for which the results differ
Expand Down