Skip to content

Commit dec736f

Browse files
ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting (pandas-dev#27237)
1 parent 51018f6 commit dec736f

File tree

19 files changed

+909
-97
lines changed

19 files changed

+909
-97
lines changed

doc/source/user_guide/basics.rst

+58
Original file line numberDiff line numberDiff line change
@@ -1781,6 +1781,31 @@ used to sort a pandas object by its index levels.
17811781
# Series
17821782
unsorted_df['three'].sort_index()
17831783
1784+
.. _basics.sort_index_key:
1785+
1786+
.. versionadded:: 1.1.0
1787+
1788+
Sorting by index also supports a ``key`` parameter that takes a callable
1789+
function to apply to the index being sorted. For `MultiIndex` objects,
1790+
the key is applied per-level to the levels specified by `level`.
1791+
1792+
.. ipython:: python
1793+
1794+
s1 = pd.DataFrame({
1795+
"a": ['B', 'a', 'C'],
1796+
"b": [1, 2, 3],
1797+
"c": [2, 3, 4]
1798+
}).set_index(list("ab"))
1799+
s1
1800+
1801+
.. ipython:: python
1802+
1803+
s1.sort_index(level="a")
1804+
s1.sort_index(level="a", key=lambda idx: idx.str.lower())
1805+
1806+
For information on key sorting by value, see :ref:`value sorting
1807+
<basics.sort_value_key>`.
1808+
17841809
.. _basics.sort_values:
17851810

17861811
By values
@@ -1813,6 +1838,39 @@ argument:
18131838
s.sort_values()
18141839
s.sort_values(na_position='first')
18151840
1841+
.. _basics.sort_value_key:
1842+
1843+
.. versionadded:: 1.1.0
1844+
1845+
Sorting also supports a ``key`` parameter that takes a callable function
1846+
to apply to the values being sorted.
1847+
1848+
.. ipython:: python
1849+
1850+
s1 = pd.Series(['B', 'a', 'C'])
1851+
1852+
.. ipython:: python
1853+
1854+
s1.sort_values()
1855+
s1.sort_values(key=lambda x: x.str.lower())
1856+
1857+
`key` will be given the :class:`Series` of values and should return a ``Series``
1858+
or array of the same shape with the transformed values. For `DataFrame` objects,
1859+
the key is applied per column, so the key should still expect a Series and return
1860+
a Series, e.g.
1861+
1862+
.. ipython:: python
1863+
1864+
df = pd.DataFrame({"a": ['B', 'a', 'C'], "b": [1, 2, 3]})
1865+
1866+
.. ipython:: python
1867+
1868+
df.sort_values(by='a')
1869+
df.sort_values(by='a', key=lambda col: col.str.lower())
1870+
1871+
The name or type of each column can be used to apply different functions to
1872+
different columns.
1873+
18161874
.. _basics.sort_indexes_and_values:
18171875

18181876
By indexes and values

doc/source/whatsnew/v1.1.0.rst

+47
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,53 @@ For example:
3636
ser["2014"]
3737
ser.loc["May 2015"]
3838
39+
.. _whatsnew_110.key_sorting:
40+
41+
Sorting with keys
42+
^^^^^^^^^^^^^^^^^
43+
44+
We've added a ``key`` argument to the DataFrame and Series sorting methods, including
45+
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
46+
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
47+
column-by-column to each column used for sorting, before sorting is performed (:issue:`27237`).
48+
See :ref:`sort_values with keys <basics.sort_value_key>` and :ref:`sort_index with keys
49+
<basics.sort_index_key>` for more information.
50+
51+
.. ipython:: python
52+
53+
s = pd.Series(['C', 'a', 'B'])
54+
s
55+
56+
.. ipython:: python
57+
58+
s.sort_values()
59+
60+
61+
Note how this is sorted with capital letters first. If we apply the :meth:`Series.str.lower`
62+
method, we get
63+
64+
.. ipython:: python
65+
66+
s.sort_values(key=lambda x: x.str.lower())
67+
68+
69+
When applied to a `DataFrame`, they key is applied per-column to all columns or a subset if
70+
`by` is specified, e.g.
71+
72+
.. ipython:: python
73+
74+
df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'],
75+
'b': [1, 2, 3, 4, 5, 6]})
76+
df
77+
78+
.. ipython:: python
79+
80+
df.sort_values(by=['a'], key=lambda col: col.str.lower())
81+
82+
83+
For more details, see examples and documentation in :meth:`DataFrame.sort_values`,
84+
:meth:`Series.sort_values`, and :meth:`~DataFrame.sort_index`.
85+
3986
.. _whatsnew_110.timestamp_fold_support:
4087

4188
Fold argument support in Timestamp constructor

pandas/_typing.py

+6
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,13 @@
7575

7676
# to maintain type information across generic functions and parametrization
7777
T = TypeVar("T")
78+
7879
# used in decorators to preserve the signature of the function it decorates
7980
# see https://mypy.readthedocs.io/en/stable/generics.html#declaring-decorators
8081
FuncType = Callable[..., Any]
8182
F = TypeVar("F", bound=FuncType)
83+
84+
# types of vectorized key functions for DataFrame::sort_values and
85+
# DataFrame::sort_index, among others
86+
ValueKeyFunc = Optional[Callable[["Series"], Union["Series", AnyArrayLike]]]
87+
IndexKeyFunc = Optional[Callable[["Index"], Union["Index", AnyArrayLike]]]

pandas/conftest.py

+9
Original file line numberDiff line numberDiff line change
@@ -1189,3 +1189,12 @@ def tick_classes(request):
11891189
Fixture for Tick based datetime offsets available for a time series.
11901190
"""
11911191
return request.param
1192+
1193+
1194+
@pytest.fixture(params=[None, lambda x: x])
1195+
def sort_by_key(request):
1196+
"""
1197+
Simple fixture for testing keys in sorting methods.
1198+
Tests None (no key) and the identity key.
1199+
"""
1200+
return request.param

pandas/core/arrays/categorical.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -1495,7 +1495,9 @@ def argsort(self, ascending=True, kind="quicksort", **kwargs):
14951495
"""
14961496
return super().argsort(ascending=ascending, kind=kind, **kwargs)
14971497

1498-
def sort_values(self, inplace=False, ascending=True, na_position="last"):
1498+
def sort_values(
1499+
self, inplace: bool = False, ascending: bool = True, na_position: str = "last",
1500+
):
14991501
"""
15001502
Sort the Categorical by category value returning a new
15011503
Categorical by default.

pandas/core/frame.py

+42-5
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,11 @@
4747
Axis,
4848
Dtype,
4949
FilePathOrBuffer,
50+
IndexKeyFunc,
5051
Label,
5152
Level,
5253
Renamer,
54+
ValueKeyFunc,
5355
)
5456
from pandas.compat import PY37
5557
from pandas.compat._optional import import_optional_dependency
@@ -139,6 +141,7 @@
139141
)
140142
from pandas.core.ops.missing import dispatch_fill_zeros
141143
from pandas.core.series import Series
144+
from pandas.core.sorting import ensure_key_mapped
142145

143146
from pandas.io.common import get_filepath_or_buffer
144147
from pandas.io.formats import console, format as fmt
@@ -5054,10 +5057,10 @@ def f(vals):
50545057

50555058
# ----------------------------------------------------------------------
50565059
# Sorting
5057-
5060+
# TODO: Just move the sort_values doc here.
50585061
@Substitution(**_shared_doc_kwargs)
50595062
@Appender(NDFrame.sort_values.__doc__)
5060-
def sort_values(
5063+
def sort_values( # type: ignore[override] # NOQA # issue 27237
50615064
self,
50625065
by,
50635066
axis=0,
@@ -5066,6 +5069,7 @@ def sort_values(
50665069
kind="quicksort",
50675070
na_position="last",
50685071
ignore_index=False,
5072+
key: ValueKeyFunc = None,
50695073
):
50705074
inplace = validate_bool_kwarg(inplace, "inplace")
50715075
axis = self._get_axis_number(axis)
@@ -5080,19 +5084,30 @@ def sort_values(
50805084
from pandas.core.sorting import lexsort_indexer
50815085

50825086
keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
5083-
indexer = lexsort_indexer(keys, orders=ascending, na_position=na_position)
5087+
5088+
# need to rewrap columns in Series to apply key function
5089+
if key is not None:
5090+
keys = [Series(k, name=name) for (k, name) in zip(keys, by)]
5091+
5092+
indexer = lexsort_indexer(
5093+
keys, orders=ascending, na_position=na_position, key=key
5094+
)
50845095
indexer = ensure_platform_int(indexer)
50855096
else:
50865097
from pandas.core.sorting import nargsort
50875098

50885099
by = by[0]
50895100
k = self._get_label_or_level_values(by, axis=axis)
50905101

5102+
# need to rewrap column in Series to apply key function
5103+
if key is not None:
5104+
k = Series(k, name=by)
5105+
50915106
if isinstance(ascending, (tuple, list)):
50925107
ascending = ascending[0]
50935108

50945109
indexer = nargsort(
5095-
k, kind=kind, ascending=ascending, na_position=na_position
5110+
k, kind=kind, ascending=ascending, na_position=na_position, key=key
50965111
)
50975112

50985113
new_data = self._mgr.take(
@@ -5118,6 +5133,7 @@ def sort_index(
51185133
na_position: str = "last",
51195134
sort_remaining: bool = True,
51205135
ignore_index: bool = False,
5136+
key: IndexKeyFunc = None,
51215137
):
51225138
"""
51235139
Sort object by labels (along an axis).
@@ -5153,6 +5169,16 @@ def sort_index(
51535169
51545170
.. versionadded:: 1.0.0
51555171
5172+
key : callable, optional
5173+
If not None, apply the key function to the index values
5174+
before sorting. This is similar to the `key` argument in the
5175+
builtin :meth:`sorted` function, with the notable difference that
5176+
this `key` function should be *vectorized*. It should expect an
5177+
``Index`` and return an ``Index`` of the same shape. For MultiIndex
5178+
inputs, the key is applied *per level*.
5179+
5180+
.. versionadded:: 1.1.0
5181+
51565182
Returns
51575183
-------
51585184
DataFrame
@@ -5186,6 +5212,17 @@ def sort_index(
51865212
100 1
51875213
29 2
51885214
1 4
5215+
5216+
A key function can be specified which is applied to the index before
5217+
sorting. For a ``MultiIndex`` this is applied to each level separately.
5218+
5219+
>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
5220+
>>> df.sort_index(key=lambda x: x.str.lower())
5221+
a
5222+
A 1
5223+
b 2
5224+
C 3
5225+
d 4
51895226
"""
51905227
# TODO: this can be combined with Series.sort_index impl as
51915228
# almost identical
@@ -5194,12 +5231,12 @@ def sort_index(
51945231

51955232
axis = self._get_axis_number(axis)
51965233
labels = self._get_axis(axis)
5234+
labels = ensure_key_mapped(labels, key, levels=level)
51975235

51985236
# make sure that the axis is lexsorted to start
51995237
# if not we need to reconstruct to get the correct indexer
52005238
labels = labels._sort_levels_monotonic()
52015239
if level is not None:
5202-
52035240
new_axis, indexer = labels.sortlevel(
52045241
level, ascending=ascending, sort_remaining=sort_remaining
52055242
)

0 commit comments

Comments
 (0)