-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
6e05de5
10b4e24
5712269
89a7f5f
42d5ec3
7c7edfe
a6dfd0a
acb13a4
14baf33
4a05ffa
ceacad4
85f0363
85cafb6
de0f336
bbbda0f
bbda441
3ba4ef6
71748b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1726,8 +1726,9 @@ Sorting | |
The sorting API is substantially changed in 0.17.0, see :ref:`here <whatsnew_0170.api_breaking.sorting>` for these changes. | ||
In particular, all sorting methods now return a new object by default, and **DO NOT** operate in-place (except by passing ``inplace=True``). | ||
|
||
There are two obvious kinds of sorting that you may be interested in: sorting | ||
by label and sorting by actual values. | ||
There are three obvious kinds of sorting that you may be interested in: sorting | ||
by labels (indexes), sorting by values (columns), and sorting by a | ||
combination of both. | ||
|
||
By Index | ||
~~~~~~~~ | ||
|
@@ -1737,8 +1738,13 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index( | |
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']), | ||
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']), | ||
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])}) | ||
|
||
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'], | ||
columns=['three', 'two', 'one']) | ||
unsorted_df | ||
|
||
# DataFrame | ||
unsorted_df.sort_index() | ||
|
@@ -1751,7 +1757,8 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index( | |
By Values | ||
~~~~~~~~~ | ||
|
||
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` are the entry points for **value** sorting (that is the values in a column or row). | ||
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are | ||
the entry points for **value** sorting (that is the values in a column or row). | ||
:meth:`DataFrame.sort_values` can accept an optional ``by`` argument for ``axis=0`` | ||
which will use an arbitrary vector or a column name of the DataFrame to | ||
determine the sort order: | ||
|
@@ -1776,6 +1783,34 @@ argument: | |
s.sort_values() | ||
s.sort_values(na_position='first') | ||
|
||
By Indexes and Values | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
.. versionadded:: 0.21 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure you need this version-added tag here since you add it below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I left this versionadded tag and removed the one below in the note. This will help people quickly realize that the feature is new and it should be clear that the note below wouldn't apply without the feature. |
||
Strings passed as the ``by`` argument to :meth:`DataFrame.sort_values` may | ||
refer to either columns or index levels. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to? Side-question, does this work on a regular (non-multiindex) index? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed to "index level names". And yes, this does work for both the |
||
|
||
.. ipython:: python | ||
|
||
# Build MultiIndex | ||
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2), | ||
('b', 2), ('b', 1), ('b', 1)]) | ||
idx.names = ['first', 'second'] | ||
|
||
# Build DataFrame | ||
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)}, | ||
index=idx) | ||
df_multi | ||
|
||
# Sort by 'second' (index) and 'A' (column) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good |
||
df_multi.sort_values(by=['second', 'A']) | ||
|
||
.. note:: | ||
|
||
.. versionadded:: 0.21 | ||
|
||
If a string matches both a column name and an index level name then a | ||
warning is issued and the column takes precedence. This will result in an | ||
ambiguity error in a future version. | ||
|
||
.. _basics.searchsorted: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3436,6 +3436,36 @@ def f(vals): | |
|
||
# ---------------------------------------------------------------------- | ||
# Sorting | ||
def _get_column_or_level_values(self, key, axis=1, | ||
op_description='retrieve'): | ||
if (is_integer(key) or | ||
(axis == 1 and key in self) or | ||
(axis == 0 and key in self.index)): | ||
|
||
if axis == 1 and key in self.index.names: | ||
warnings.warn( | ||
("'%s' is both a column name and an index level.\n" | ||
"Defaulting to column but " | ||
"this will raise an ambiguity error in a " | ||
"future version") % key, | ||
FutureWarning, stacklevel=2) | ||
|
||
k = self.xs(key, axis=axis)._values | ||
if k.ndim == 2: | ||
|
||
# try to be helpful | ||
if isinstance(self.columns, MultiIndex): | ||
raise ValueError('Cannot %s column "%s" in a multi-index. ' | ||
'All levels must be provided explicitly' | ||
% (op_description, str(key))) | ||
|
||
raise ValueError('Cannot %s duplicate column "%s"' % | ||
(op_description, str(key))) | ||
elif key in self.index.names: | ||
k = self.index.get_level_values(key).values | ||
else: | ||
raise KeyError(key) | ||
return k | ||
|
||
@Appender(_shared_docs['sort_values'] % _shared_doc_kwargs) | ||
def sort_values(self, by, axis=0, ascending=True, inplace=False, | ||
|
@@ -3459,10 +3489,8 @@ def trans(v): | |
|
||
keys = [] | ||
for x in by: | ||
k = self.xs(x, axis=other_axis).values | ||
if k.ndim == 2: | ||
raise ValueError('Cannot sort by duplicate column %s' % | ||
str(x)) | ||
k = self._get_column_or_level_values(x, axis=other_axis, | ||
op_description="sort by") | ||
keys.append(trans(k)) | ||
indexer = lexsort_indexer(keys, orders=ascending, | ||
na_position=na_position) | ||
|
@@ -3471,17 +3499,9 @@ def trans(v): | |
from pandas.core.sorting import nargsort | ||
|
||
by = by[0] | ||
k = self.xs(by, axis=other_axis).values | ||
if k.ndim == 2: | ||
|
||
# try to be helpful | ||
if isinstance(self.columns, MultiIndex): | ||
raise ValueError('Cannot sort by column %s in a ' | ||
'multi-index you need to explicitly ' | ||
'provide all the levels' % str(by)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See comment below as well, so it would be nice to somehow keep this message |
||
k = self._get_column_or_level_values(by, axis=other_axis, | ||
op_description="sort by") | ||
|
||
raise ValueError('Cannot sort by duplicate column %s' % | ||
str(by)) | ||
if isinstance(ascending, (tuple, list)): | ||
ascending = ascending[0] | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,7 +67,8 @@ | |
args_transpose='axes to permute (int or label for object)', | ||
optional_by=""" | ||
by : str or list of str | ||
Name or list of names which refer to the axis items.""") | ||
Name or list of names which refer to the axis items or index | ||
levels.""") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not blocking: My OCD really wants you to not "orphan" the "levels" word at the end of this sentence 😄 . However, that's largely a me problem, but if you do happen to come up with a wording that doesn't do this orphaning, by all means go for it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Heh, now it bothers me too! I reworked it to fit on one line. "Name or list of names matching axis items or index levels." |
||
|
||
|
||
def _single_replace(self, to_replace, method, inplace, limit): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,7 @@ | |
import pandas as pd | ||
from pandas.compat import lrange | ||
from pandas import (DataFrame, Series, MultiIndex, Timestamp, | ||
date_range, NaT, IntervalIndex) | ||
Index, date_range, NaT, IntervalIndex) | ||
|
||
from pandas.util.testing import assert_series_equal, assert_frame_equal | ||
|
||
|
@@ -85,6 +85,13 @@ def test_sort_values(self): | |
expected = frame.reindex(columns=['C', 'B', 'A']) | ||
assert_frame_equal(sorted_df, expected) | ||
|
||
# by row (axis=1) with string index | ||
frame = DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]}, | ||
index=['row1', 'row2']) | ||
sorted_df = frame.sort_values(by='row2', axis=1) | ||
expected = frame.reindex(columns=['B', 'A', 'C']) | ||
assert_frame_equal(sorted_df, expected) | ||
|
||
msg = r'Length of ascending \(5\) != length of by \(2\)' | ||
with tm.assert_raises_regex(ValueError, msg): | ||
frame.sort_values(by=['A', 'B'], axis=0, ascending=[True] * 5) | ||
|
@@ -552,3 +559,185 @@ def test_sort_index_intervalindex(self): | |
closed='right') | ||
result = result.columns.levels[1].categories | ||
tm.assert_index_equal(result, expected) | ||
|
||
def test_sort_index_and_column(self): | ||
# Build MultiIndex | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add the issue number as a comment |
||
idx = MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2), | ||
('b', 2), ('b', 1), ('b', 1)]) | ||
idx.names = ['outer', 'inner'] | ||
|
||
# Build DataFrames | ||
df_multi = DataFrame({'A': np.arange(6, 0, -1), | ||
'B': ['one', 'one', 'two', | ||
'two', 'one', 'one']}, | ||
index=idx) | ||
df_single = df_multi.reset_index('outer') | ||
df_none = df_multi.reset_index() | ||
|
||
# Sort by single index | ||
# - On single index frame | ||
expected = df_none.sort_values('inner').set_index('inner') | ||
result = df_single.sort_values('inner') | ||
assert_frame_equal(result, expected) | ||
# - Descending | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. blank line |
||
expected = df_none.sort_values('inner', | ||
ascending=False).set_index('inner') | ||
result = df_single.sort_values('inner', ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - On multi index frame | ||
expected = df_none.sort_values('inner', | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
|
||
result = df_multi.sort_values('inner', ascending=False) | ||
assert_frame_equal(result, expected) | ||
# - Descending | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
expected = df_none.sort_values('inner', | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values('inner', ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# Sort by multiple indexes | ||
# - Ascending | ||
expected = df_none.sort_values(['inner', 'outer'] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer']) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Descending | ||
expected = df_none.sort_values(['inner', 'outer'], | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer'], | ||
ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Mixed | ||
expected = df_none.sort_values(['inner', 'outer'], | ||
ascending=[False, True] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer'], | ||
ascending=[False, True]) | ||
assert_frame_equal(result, expected) | ||
|
||
# Sort by single index and single column | ||
# - Ascending | ||
expected = df_none.sort_values(['outer', 'B'] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['outer', 'B']) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Descending | ||
expected = df_none.sort_values(['outer', 'B'], | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['outer', 'B'], ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Mixed | ||
expected = df_none.sort_values(['outer', 'B'], | ||
ascending=[False, True] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['outer', 'B'], | ||
ascending=[False, True]) | ||
assert_frame_equal(result, expected) | ||
|
||
# Sort by single column and single index | ||
# - Ascending | ||
expected = df_none.sort_values(['B', 'outer'] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['B', 'outer']) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Descending | ||
expected = df_none.sort_values(['B', 'outer'], | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['B', 'outer'], ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Mixed | ||
expected = df_none.sort_values(['B', 'outer'], | ||
ascending=[False, True] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['B', 'outer'], | ||
ascending=[False, True]) | ||
assert_frame_equal(result, expected) | ||
|
||
# Sort by multiple indexes and a single column | ||
# - Ascending | ||
expected = df_none.sort_values(['inner', 'outer', 'A'] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'A']) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Descending | ||
expected = df_none.sort_values(['inner', 'outer', 'A'], | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'A'], | ||
ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Mixed | ||
expected = df_none.sort_values(['inner', 'outer', 'A'], | ||
ascending=[True, True, False] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'A'], | ||
ascending=[True, True, False]) | ||
assert_frame_equal(result, expected) | ||
|
||
# Sort by multiple indexes and multiple columns | ||
# - Ascending | ||
expected = df_none.sort_values(['inner', 'outer', 'B', 'A'] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'B', 'A']) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Descending | ||
expected = df_none.sort_values(['inner', 'outer', 'B', 'A'], | ||
ascending=False | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'B', 'A'], | ||
ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# - Mixed | ||
expected = df_none.sort_values(['inner', 'outer', 'B', 'A'], | ||
ascending=[False, True, True, False] | ||
).set_index(['outer', 'inner']) | ||
result = df_multi.sort_values(['inner', 'outer', 'B', 'A'], | ||
ascending=[False, True, True, False]) | ||
assert_frame_equal(result, expected) | ||
|
||
def test_sort_values_column_index_level_precedence(self): | ||
# GH 14355, when a string passed as the `by` parameter | ||
# matches a column and an index level the column takes | ||
# precedence | ||
|
||
# Construct DataFrame with index and column named 'idx' | ||
idx = Index(np.arange(1, 7), name='idx') | ||
df = DataFrame({'A': np.arange(11, 17), | ||
'idx': np.arange(6, 0, -1)}, | ||
index=idx) | ||
|
||
# Sorting by 'idx' should sort by the idx column and raise a | ||
# FutureWarning | ||
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): | ||
result = df.sort_values(by='idx') | ||
|
||
# This should be equivalent to sorting by the 'idx' index level in | ||
# descending order | ||
expected = df.sort_index(level='idx', ascending=False) | ||
assert_frame_equal(result, expected) | ||
|
||
# Perform same test with MultiIndex | ||
df_multi = df.set_index('A', append=True) | ||
|
||
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): | ||
result = df_multi.sort_values(by='idx') | ||
|
||
expected = df_multi.sort_index(level='idx', ascending=False) | ||
assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this word was in the original doc, but I'm a little uneasy about judgmental words like "obvious." I think we can just remove it without impacting much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've reworded it.