Skip to content

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
Apr 27, 2020
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
eddd918
ENH: added df/series.sort_values(key=...) and df/series.sort_index(ke…
jacobaustin123 Jul 4, 2019
e05462a
fixed a few small bugs
jacobaustin123 Jan 28, 2020
0f33c5c
bug fixes
jacobaustin123 Jan 28, 2020
b7d76cd
fixed
jacobaustin123 Jan 28, 2020
cf1fb5a
Merge branch 'master' of http://github.com/pandas-dev/pandas
jacobaustin123 Jan 28, 2020
8343f76
fixed
jacobaustin123 Jan 28, 2020
94281d3
fixed
jacobaustin123 Jan 28, 2020
c505dd9
updated docstrings
jacobaustin123 Jan 28, 2020
ecb6910
fixed documentation
jacobaustin123 Jan 28, 2020
55c444e
fixed
jacobaustin123 Jan 29, 2020
9d6762b
merged with master
jacobaustin123 Feb 11, 2020
d774b15
updated docs
jacobaustin123 Feb 11, 2020
64e70b4
linting
jacobaustin123 Feb 11, 2020
03d6573
fixed tests
jacobaustin123 Feb 11, 2020
9f5209e
merged
jacobaustin123 Mar 22, 2020
81c0172
reformatted
jacobaustin123 Mar 22, 2020
6d0d725
fixed linting issue
jacobaustin123 Mar 22, 2020
ef72542
fixed conflicts
jacobaustin123 Mar 27, 2020
0aabf56
fixed formatting
jacobaustin123 Mar 27, 2020
210df50
ENH: made sort_index apply the key to each level separately
jacobaustin123 Mar 28, 2020
b40a963
fixed a bug with duplicate names
jacobaustin123 Mar 28, 2020
90e2cfe
fixed strange bug with duplicate column names
jacobaustin123 Mar 28, 2020
8e12404
Merge branch 'master' of http://github.com/pandas-dev/pandas
jacobaustin123 Mar 28, 2020
447c48f
fixed bug
jacobaustin123 Mar 28, 2020
46171f0
fixed linting
jacobaustin123 Mar 28, 2020
a44a999
fixed linting issues
jacobaustin123 Mar 28, 2020
94b795c
disabled tests temporarily
jacobaustin123 Mar 28, 2020
6e651c0
fixed linting
jacobaustin123 Mar 28, 2020
5a92484
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Mar 31, 2020
fbdfc1e
reverted changes due to 33134
jacobaustin123 Mar 31, 2020
c56dbd6
updated documentation
jacobaustin123 Apr 1, 2020
77f44bf
fixed merge conflict
jacobaustin123 Apr 7, 2020
2106d86
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 7, 2020
620f57a
updated docs
jacobaustin123 Apr 7, 2020
6a5bc32
fixed linting issue
jacobaustin123 Apr 7, 2020
5b244fb
try to recover from invalid type in output
jacobaustin123 Apr 7, 2020
6f15e66
fixed linting issue
jacobaustin123 Apr 7, 2020
7d2037b
added more tests
jacobaustin123 Apr 7, 2020
5048944
added some more tests
jacobaustin123 Apr 8, 2020
3b2d176
merged
jacobaustin123 Apr 10, 2020
0e239c8
fixed linting issue
jacobaustin123 Apr 10, 2020
bc44d0d
major documentation additions, removed key for Categorical
jacobaustin123 Apr 10, 2020
07d903c
doc linting issue
jacobaustin123 Apr 10, 2020
ecdbf4c
another linting fix
jacobaustin123 Apr 10, 2020
c376a74
fixed linting actually
jacobaustin123 Apr 10, 2020
f5e5808
moved apply_key to sorting.py
jacobaustin123 Apr 11, 2020
1058839
fixed tests
jacobaustin123 Apr 11, 2020
c87a527
satisfied mypy
jacobaustin123 Apr 11, 2020
e6026d6
fixed isort issues
jacobaustin123 Apr 11, 2020
ab0b887
fixed a doc issue
jacobaustin123 Apr 11, 2020
364cc5e
wow linting is hard
jacobaustin123 Apr 11, 2020
8db09d0
updated whatsnew
jacobaustin123 Apr 13, 2020
7477fd1
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 13, 2020
1d0319c
cleaned up sorting.py
jacobaustin123 Apr 13, 2020
1f60689
fixed indentation
jacobaustin123 Apr 13, 2020
2957e60
removed trailing whitespace
jacobaustin123 Apr 13, 2020
7c6c2f0
linting
jacobaustin123 Apr 13, 2020
ad745c4
fixed small bug with datetimelike, updated docs
jacobaustin123 Apr 13, 2020
3ad3358
fixed trailing whitespace
jacobaustin123 Apr 13, 2020
e87a9a9
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 13, 2020
a5d5c6d
reverted and updated documentation
jacobaustin123 Apr 27, 2020
56f73ba
merged and updated
jacobaustin123 Apr 27, 2020
4250e31
fixed linting issue and added comments
jacobaustin123 Apr 27, 2020
4d5ba53
fixed small issue in tests
jacobaustin123 Apr 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/source/user_guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1813,6 +1813,18 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

Sorting also supports a ``key`` parameter that takes a callable function
to apply to the values being sorted.

.. ipython:: python

s1 = pd.Series(['B', 'a', 'C'])
s1.sort_values()
s1.sort_values(key=lambda x: x.str.lower())

`key` will be given the :class:`Series` of values and should return a ``Series``
or array of the same shape with the transformed values.

.. _basics.sort_indexes_and_values:

By indexes and values
Expand Down
36 changes: 36 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,42 @@ For example:
ser["2014"]
ser.loc["May 2015"]

.. _whatsnew_110.key_sorting:

Sorting with keys
^^^^^^^^^^^^^^^^^

We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
to the each column of a DataFrame before sorting is performed (:issue:`27237`).

.. ipython:: python

s = pd.Series(['C', 'a', 'B'])
s.sort_values()


Note how this is sorted with capital letters first. If we apply the `ser.str.lower()` method, we get

.. ipython:: python

s.sort_values(key=lambda x: x.str.lower())


When applied to a `DataFrame`, they key is applied per-column to all columns or a subset if
`by` is specified, e.g.

.. ipython:: python

df = pd.DataFrame({'a' : ['C', 'C', 'a', 'a', 'B', 'B'], 'b' : [1, 2, 3, 4, 5, 6]})
df.sort_values(by=['a', 'b'], key=lambda col : col.str.lower() if col.name == 'a' else -col)


For :meth:`DataFrame.sort_index` with `MultiIndex`, the key function is applied per level. For
more details, see examples and documentation in :meth:`DataFrame.sort_values`, :meth:`Series.sort_values`,
and :meth:`~DataFrame.sort_index`.

.. _whatsnew_110.timestamp_fold_support:

Fold argument support in Timestamp constructor
Expand Down
5 changes: 5 additions & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,8 @@

# to maintain type information across generic functions and parametrization
T = TypeVar("T")

# types of vectorized key functions for DataFrame::sort_values and
# DataFrame::sort_index, among others
ValueKeyFunc = Optional[Callable[["Series"], Union["Series", AnyArrayLike]]]
IndexKeyFunc = Optional[Callable[["Index"], Union["Index", AnyArrayLike]]]
9 changes: 9 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1221,3 +1221,12 @@ def tick_classes(request):
Fixture for Tick based datetime offsets available for a time series.
"""
return request.param


@pytest.fixture(params=[None, lambda x: x])
def sort_by_key(request):
"""
Simple fixture for testing keys in sorting methods.
Tests None (no key) and the identity key.
"""
return request.param
23 changes: 20 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import operator
from shutil import get_terminal_size
from typing import Dict, Hashable, List, Type, Union, cast
from typing import Callable, Dict, Hashable, List, Optional, Type, Union, cast
from warnings import warn

import numpy as np
Expand Down Expand Up @@ -1532,7 +1532,13 @@ def argsort(self, ascending=True, kind="quicksort", **kwargs):
"""
return super().argsort(ascending=ascending, kind=kind, **kwargs)

def sort_values(self, inplace=False, ascending=True, na_position="last"):
def sort_values(
self,
inplace=False,
ascending=True,
na_position="last",
key: Optional[Callable] = None,
):
"""
Sort the Categorical by category value returning a new
Categorical by default.
Expand All @@ -1554,6 +1560,15 @@ def sort_values(self, inplace=False, ascending=True, na_position="last"):
na_position : {'first', 'last'} (optional, default='last')
'first' puts NaNs at the beginning
'last' puts NaNs at the end
key : callable, optional
Apply the key function to the values before sorting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to give a sample of the key function here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I'm inclined to remove the key function here. I can't think of a use case for Categorical because it supports basically no vectorized operations and cat.map doesn't work well with key sorting because it just transforms the codes.

This is similar to the `key` argument in the builtin
:meth:`sorted` function, with the notable difference that
this `key` function should be *vectorized*. It should expect
a ``Categorical`` and return an object with the same shape
as the input.

.. versionadded:: 1.1.0

Returns
-------
Expand Down Expand Up @@ -1610,7 +1625,9 @@ def sort_values(self, inplace=False, ascending=True, na_position="last"):
if na_position not in ["last", "first"]:
raise ValueError(f"invalid na_position: {repr(na_position)}")

sorted_idx = nargsort(self, ascending=ascending, na_position=na_position)
sorted_idx = nargsort(
self, ascending=ascending, na_position=na_position, key=key
)

if inplace:
self._codes = self._codes[sorted_idx]
Expand Down
51 changes: 45 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,17 @@
from pandas._config import get_option

from pandas._libs import algos as libalgos, lib, properties
from pandas._typing import Axes, Axis, Dtype, FilePathOrBuffer, Label, Level, Renamer
from pandas._typing import (
Axes,
Axis,
Dtype,
FilePathOrBuffer,
IndexKeyFunc,
Label,
Level,
Renamer,
ValueKeyFunc,
)
from pandas.compat import PY37
from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
Expand Down Expand Up @@ -129,6 +139,7 @@
)
from pandas.core.ops.missing import dispatch_fill_zeros
from pandas.core.series import Series
from pandas.core.sorting import ensure_key_mapped

from pandas.io.common import get_filepath_or_buffer
from pandas.io.formats import console, format as fmt
Expand Down Expand Up @@ -4746,10 +4757,10 @@ def f(vals):

# ----------------------------------------------------------------------
# Sorting

# TODO: Just move the sort_values doc here.
@Substitution(**_shared_doc_kwargs)
@Appender(NDFrame.sort_values.__doc__)
def sort_values(
def sort_values( # type: ignore[override] # NOQA # issue 27237
self,
by,
axis=0,
Expand All @@ -4758,6 +4769,7 @@ def sort_values(
kind="quicksort",
na_position="last",
ignore_index=False,
key: ValueKeyFunc = None,
):
inplace = validate_bool_kwarg(inplace, "inplace")
axis = self._get_axis_number(axis)
Expand All @@ -4772,19 +4784,30 @@ def sort_values(
from pandas.core.sorting import lexsort_indexer

keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
indexer = lexsort_indexer(keys, orders=ascending, na_position=na_position)

# need to rewrap columns in Series to apply key function
if key is not None:
keys = [Series(k, name=name) for (k, name) in zip(keys, by)]

indexer = lexsort_indexer(
keys, orders=ascending, na_position=na_position, key=key
)
indexer = ensure_platform_int(indexer)
else:
from pandas.core.sorting import nargsort

by = by[0]
k = self._get_label_or_level_values(by, axis=axis)

# need to rewrap column in Series to apply key function
if key is not None:
k = Series(k)

if isinstance(ascending, (tuple, list)):
ascending = ascending[0]

indexer = nargsort(
k, kind=kind, ascending=ascending, na_position=na_position
k, kind=kind, ascending=ascending, na_position=na_position, key=key
)

new_data = self._mgr.take(
Expand All @@ -4810,6 +4833,7 @@ def sort_index(
na_position: str = "last",
sort_remaining: bool = True,
ignore_index: bool = False,
key: IndexKeyFunc = None,
):
"""
Sort object by labels (along an axis).
Expand Down Expand Up @@ -4845,6 +4869,16 @@ def sort_index(

.. versionadded:: 1.0.0

key : callable, optional
If not None, apply the key function to the index values
before sorting. This is similar to the `key` argument in the
builtin :meth:`sorted` function, with the notable difference that
this `key` function should be *vectorized*. It should expect an
``Index`` and return an ``Index`` of the same shape. For MultiIndex
inputs, the key is applied *per level*.

.. versionadded:: 1.1.0

Returns
-------
DataFrame
Expand Down Expand Up @@ -4887,11 +4921,16 @@ def sort_index(
axis = self._get_axis_number(axis)
labels = self._get_axis(axis)

# apply key to each level separately and create a new index
if isinstance(labels, ABCMultiIndex):
labels = labels.apply_key(key, level=level)
else:
labels = ensure_key_mapped(labels, key)

# make sure that the axis is lexsorted to start
# if not we need to reconstruct to get the correct indexer
labels = labels._sort_levels_monotonic()
if level is not None:

new_axis, indexer = labels.sortlevel(
level, ascending=ascending, sort_remaining=sort_remaining
)
Expand Down
Loading