Skip to content

Commit 7663a63

Browse files
jonmmeasejreback
authored andcommitted
ENH: Support sorting frames by a combo of columns and index levels (GH 14353) (#17361)
1 parent d0477b2 commit 7663a63

File tree

11 files changed

+291
-74
lines changed

11 files changed

+291
-74
lines changed

doc/source/basics.rst

+69-27
Original file line numberDiff line numberDiff line change
@@ -226,11 +226,11 @@ We can also do elementwise :func:`divmod`:
226226
Missing data / operations with fill values
227227
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
228228

229-
In Series and DataFrame, the arithmetic functions have the option of inputting
230-
a *fill_value*, namely a value to substitute when at most one of the values at
231-
a location are missing. For example, when adding two DataFrame objects, you may
232-
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
233-
case the result will be NaN (you can later replace NaN with some other value
229+
In Series and DataFrame, the arithmetic functions have the option of inputting
230+
a *fill_value*, namely a value to substitute when at most one of the values at
231+
a location are missing. For example, when adding two DataFrame objects, you may
232+
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
233+
case the result will be NaN (you can later replace NaN with some other value
234234
using ``fillna`` if you wish).
235235

236236
.. ipython:: python
@@ -260,8 +260,8 @@ arithmetic operations described above:
260260
df.gt(df2)
261261
df2.ne(df)
262262
263-
These operations produce a pandas object of the same type as the left-hand-side
264-
input that is of dtype ``bool``. These ``boolean`` objects can be used in
263+
These operations produce a pandas object of the same type as the left-hand-side
264+
input that is of dtype ``bool``. These ``boolean`` objects can be used in
265265
indexing operations, see the section on :ref:`Boolean indexing<indexing.boolean>`.
266266

267267
.. _basics.reductions:
@@ -452,7 +452,7 @@ So, for instance, to reproduce :meth:`~DataFrame.combine_first` as above:
452452
Descriptive statistics
453453
----------------------
454454

455-
There exists a large number of methods for computing descriptive statistics and
455+
There exists a large number of methods for computing descriptive statistics and
456456
other related operations on :ref:`Series <api.series.stats>`, :ref:`DataFrame
457457
<api.dataframe.stats>`, and :ref:`Panel <api.panel.stats>`. Most of these
458458
are aggregations (hence producing a lower-dimensional result) like
@@ -540,7 +540,7 @@ will exclude NAs on Series input by default:
540540
np.mean(df['one'])
541541
np.mean(df['one'].values)
542542
543-
:meth:`Series.nunique` will return the number of unique non-NA values in a
543+
:meth:`Series.nunique` will return the number of unique non-NA values in a
544544
Series:
545545

546546
.. ipython:: python
@@ -852,7 +852,7 @@ Aggregation API
852852
The aggregation API allows one to express possibly multiple aggregation operations in a single concise way.
853853
This API is similar across pandas objects, see :ref:`groupby API <groupby.aggregate>`, the
854854
:ref:`window functions API <stats.aggregate>`, and the :ref:`resample API <timeseries.aggregate>`.
855-
The entry point for aggregation is :meth:`DataFrame.aggregate`, or the alias
855+
The entry point for aggregation is :meth:`DataFrame.aggregate`, or the alias
856856
:meth:`DataFrame.agg`.
857857

858858
We will use a similar starting frame from above:
@@ -864,8 +864,8 @@ We will use a similar starting frame from above:
864864
tsdf.iloc[3:7] = np.nan
865865
tsdf
866866
867-
Using a single function is equivalent to :meth:`~DataFrame.apply`. You can also
868-
pass named methods as strings. These will return a ``Series`` of the aggregated
867+
Using a single function is equivalent to :meth:`~DataFrame.apply`. You can also
868+
pass named methods as strings. These will return a ``Series`` of the aggregated
869869
output:
870870

871871
.. ipython:: python
@@ -887,7 +887,7 @@ Single aggregations on a ``Series`` this will return a scalar value:
887887
Aggregating with multiple functions
888888
+++++++++++++++++++++++++++++++++++
889889

890-
You can pass multiple aggregation arguments as a list.
890+
You can pass multiple aggregation arguments as a list.
891891
The results of each of the passed functions will be a row in the resulting ``DataFrame``.
892892
These are naturally named from the aggregation function.
893893

@@ -1430,7 +1430,7 @@ Series can also be used:
14301430
df.rename(columns={'one': 'foo', 'two': 'bar'},
14311431
index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
14321432
1433-
If the mapping doesn't include a column/index label, it isn't renamed. Note that
1433+
If the mapping doesn't include a column/index label, it isn't renamed. Note that
14341434
extra labels in the mapping don't throw an error.
14351435

14361436
.. versionadded:: 0.21.0
@@ -1740,19 +1740,26 @@ description.
17401740
Sorting
17411741
-------
17421742

1743-
There are two obvious kinds of sorting that you may be interested in: sorting
1744-
by label and sorting by actual values.
1743+
Pandas supports three kinds of sorting: sorting by index labels,
1744+
sorting by column values, and sorting by a combination of both.
1745+
1746+
.. _basics.sort_index:
17451747

17461748
By Index
17471749
~~~~~~~~
17481750

1749-
The primary method for sorting axis
1750-
labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index()`` methods.
1751+
The :meth:`Series.sort_index` and :meth:`DataFrame.sort_index` methods are
1752+
used to sort a pandas object by its index levels.
17511753

17521754
.. ipython:: python
17531755
1756+
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
1757+
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
1758+
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
1759+
17541760
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
17551761
columns=['three', 'two', 'one'])
1762+
unsorted_df
17561763
17571764
# DataFrame
17581765
unsorted_df.sort_index()
@@ -1762,20 +1769,22 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index(
17621769
# Series
17631770
unsorted_df['three'].sort_index()
17641771
1772+
.. _basics.sort_values:
1773+
17651774
By Values
17661775
~~~~~~~~~
17671776

1768-
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` are the entry points for **value** sorting (i.e. the values in a column or row).
1769-
:meth:`DataFrame.sort_values` can accept an optional ``by`` argument for ``axis=0``
1770-
which will use an arbitrary vector or a column name of the DataFrame to
1771-
determine the sort order:
1777+
The :meth:`Series.sort_values` method is used to sort a `Series` by its values. The
1778+
:meth:`DataFrame.sort_values` method is used to sort a `DataFrame` by its column or row values.
1779+
The optional ``by`` parameter to :meth:`DataFrame.sort_values` may used to specify one or more columns
1780+
to use to determine the sorted order.
17721781

17731782
.. ipython:: python
17741783
17751784
df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
17761785
df1.sort_values(by='two')
17771786
1778-
The ``by`` argument can take a list of column names, e.g.:
1787+
The ``by`` parameter can take a list of column names, e.g.:
17791788

17801789
.. ipython:: python
17811790
@@ -1790,6 +1799,39 @@ argument:
17901799
s.sort_values()
17911800
s.sort_values(na_position='first')
17921801
1802+
.. _basics.sort_indexes_and_values:
1803+
1804+
By Indexes and Values
1805+
~~~~~~~~~~~~~~~~~~~~~
1806+
1807+
.. versionadded:: 0.23.0
1808+
1809+
Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may
1810+
refer to either columns or index level names.
1811+
1812+
.. ipython:: python
1813+
1814+
# Build MultiIndex
1815+
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
1816+
('b', 2), ('b', 1), ('b', 1)])
1817+
idx.names = ['first', 'second']
1818+
1819+
# Build DataFrame
1820+
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
1821+
index=idx)
1822+
df_multi
1823+
1824+
Sort by 'second' (index) and 'A' (column)
1825+
1826+
.. ipython:: python
1827+
1828+
df_multi.sort_values(by=['second', 'A'])
1829+
1830+
.. note::
1831+
1832+
If a string matches both a column name and an index level name then a
1833+
warning is issued and the column takes precedence. This will result in an
1834+
ambiguity error in a future version.
17931835

17941836
.. _basics.searchsorted:
17951837

@@ -1881,7 +1923,7 @@ The main types stored in pandas objects are ``float``, ``int``, ``bool``,
18811923
``int64`` and ``int32``. See :ref:`Series with TZ <timeseries.timezone_series>`
18821924
for more detail on ``datetime64[ns, tz]`` dtypes.
18831925

1884-
A convenient :attr:`~DataFrame.dtypes` attribute for DataFrame returns a Series
1926+
A convenient :attr:`~DataFrame.dtypes` attribute for DataFrame returns a Series
18851927
with the data type of each column.
18861928

18871929
.. ipython:: python
@@ -1902,8 +1944,8 @@ On a ``Series`` object, use the :attr:`~Series.dtype` attribute.
19021944
19031945
dft['A'].dtype
19041946
1905-
If a pandas object contains data with multiple dtypes *in a single column*, the
1906-
dtype of the column will be chosen to accommodate all of the data types
1947+
If a pandas object contains data with multiple dtypes *in a single column*, the
1948+
dtype of the column will be chosen to accommodate all of the data types
19071949
(``object`` is the most general).
19081950

19091951
.. ipython:: python
@@ -1941,7 +1983,7 @@ defaults
19411983
~~~~~~~~
19421984

19431985
By default integer types are ``int64`` and float types are ``float64``,
1944-
*regardless* of platform (32-bit or 64-bit).
1986+
*regardless* of platform (32-bit or 64-bit).
19451987
The following will all result in ``int64`` dtypes.
19461988

19471989
.. ipython:: python

doc/source/whatsnew/v0.23.0.txt

+26
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,32 @@ levels <merging.merge_on_columns_and_levels>` documentation section.
6262

6363
left.merge(right, on=['key1', 'key2'])
6464

65+
.. _whatsnew_0230.enhancements.sort_by_columns_and_levels:
66+
67+
Sorting by a combination of columns and index levels
68+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
69+
70+
Strings passed to :meth:`DataFrame.sort_values` as the ``by`` parameter may
71+
now refer to either column names or index level names. This enables sorting
72+
``DataFrame`` instances by a combination of index levels and columns without
73+
resetting indexes. See the :ref:`Sorting by Indexes and Values
74+
<basics.sort_indexes_and_values>` documentation section.
75+
(:issue:`14353`)
76+
77+
.. ipython:: python
78+
79+
# Build MultiIndex
80+
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
81+
('b', 2), ('b', 1), ('b', 1)])
82+
idx.names = ['first', 'second']
83+
84+
# Build DataFrame
85+
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
86+
index=idx)
87+
df_multi
88+
89+
# Sort by 'second' (index) and 'A' (column)
90+
df_multi.sort_values(by=['second', 'A'])
6591

6692
.. _whatsnew_0230.enhancements.ran_inf:
6793

pandas/core/frame.py

+14-16
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,15 @@
113113
axes_single_arg="{0 or 'index', 1 or 'columns'}",
114114
optional_by="""
115115
by : str or list of str
116-
Name or list of names which refer to the axis items.""",
116+
Name or list of names to sort by.
117+
118+
- if `axis` is 0 or `'index'` then `by` may contain index
119+
levels and/or column labels
120+
- if `axis` is 1 or `'columns'` then `by` may contain column
121+
levels and/or index labels
122+
123+
.. versionmodified:: 0.23.0
124+
Allow specifying index or column level names.""",
117125
versionadded_to_excel='',
118126
optional_labels="""labels : array-like, optional
119127
New labels / index to conform the axis specified by 'axis' to.""",
@@ -3623,7 +3631,7 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
36233631
kind='quicksort', na_position='last'):
36243632
inplace = validate_bool_kwarg(inplace, 'inplace')
36253633
axis = self._get_axis_number(axis)
3626-
other_axis = 0 if axis == 1 else 1
3634+
stacklevel = 2 # Number of stack levels from df.sort_values
36273635

36283636
if not isinstance(by, list):
36293637
by = [by]
@@ -3635,10 +3643,8 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
36353643

36363644
keys = []
36373645
for x in by:
3638-
k = self.xs(x, axis=other_axis).values
3639-
if k.ndim == 2:
3640-
raise ValueError('Cannot sort by duplicate column %s' %
3641-
str(x))
3646+
k = self._get_label_or_level_values(x, axis=axis,
3647+
stacklevel=stacklevel)
36423648
keys.append(k)
36433649
indexer = lexsort_indexer(keys, orders=ascending,
36443650
na_position=na_position)
@@ -3647,17 +3653,9 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
36473653
from pandas.core.sorting import nargsort
36483654

36493655
by = by[0]
3650-
k = self.xs(by, axis=other_axis).values
3651-
if k.ndim == 2:
3652-
3653-
# try to be helpful
3654-
if isinstance(self.columns, MultiIndex):
3655-
raise ValueError('Cannot sort by column %s in a '
3656-
'multi-index you need to explicitly '
3657-
'provide all the levels' % str(by))
3656+
k = self._get_label_or_level_values(by, axis=axis,
3657+
stacklevel=stacklevel)
36583658

3659-
raise ValueError('Cannot sort by duplicate column %s' %
3660-
str(by))
36613659
if isinstance(ascending, (tuple, list)):
36623660
ascending = ascending[0]
36633661

0 commit comments

Comments
 (0)