Skip to content

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 5, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 38 additions & 3 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1726,8 +1726,9 @@ Sorting
The sorting API is substantially changed in 0.17.0, see :ref:`here <whatsnew_0170.api_breaking.sorting>` for these changes.
In particular, all sorting methods now return a new object by default, and **DO NOT** operate in-place (except by passing ``inplace=True``).

There are two obvious kinds of sorting that you may be interested in: sorting
by label and sorting by actual values.
There are three obvious kinds of sorting that you may be interested in: sorting
Copy link
Member

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this word was in the original doc, but I'm a little uneasy about judgmental words like "obvious." I think we can just remove it without impacting much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've reworded it.

by labels (indexes), sorting by values (columns), and sorting by a
combination of both.

By Index
~~~~~~~~
Expand All @@ -1737,8 +1738,13 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index(

.. ipython:: python

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
columns=['three', 'two', 'one'])
unsorted_df

# DataFrame
unsorted_df.sort_index()
Expand All @@ -1751,7 +1757,8 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index(
By Values
~~~~~~~~~

The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` are the entry points for **value** sorting (that is the values in a column or row).
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are
the entry points for **value** sorting (that is the values in a column or row).
:meth:`DataFrame.sort_values` can accept an optional ``by`` argument for ``axis=0``
which will use an arbitrary vector or a column name of the DataFrame to
determine the sort order:
Expand All @@ -1776,6 +1783,34 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

By Indexes and Values
~~~~~~~~~~~~~~~~~~~~~
.. versionadded:: 0.21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need this version-added tag here since you add it below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this versionadded tag and removed the one below in the note. This will help people quickly realize that the feature is new and it should be clear that the note below wouldn't apply without the feature.

Strings passed as the ``by`` argument to :meth:`DataFrame.sort_values` may
refer to either columns or index levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to?

Side-question, does this work on a regular (non-multiindex) index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "index level names".

And yes, this does work for both the Index and MultIndex cases.
@TomAugspurger


.. ipython:: python

# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
index=idx)
df_multi

# Sort by 'second' (index) and 'A' (column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

df_multi.sort_values(by=['second', 'A'])

.. note::

.. versionadded:: 0.21

If a string matches both a column name and an index level name then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

.. _basics.searchsorted:

Expand Down
48 changes: 34 additions & 14 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3436,6 +3436,36 @@ def f(vals):

# ----------------------------------------------------------------------
# Sorting
def _get_column_or_level_values(self, key, axis=1,
op_description='retrieve'):
if (is_integer(key) or
(axis == 1 and key in self) or
(axis == 0 and key in self.index)):

if axis == 1 and key in self.index.names:
warnings.warn(
("'%s' is both a column name and an index level.\n"
"Defaulting to column but "
"this will raise an ambiguity error in a "
"future version") % key,
FutureWarning, stacklevel=2)

k = self.xs(key, axis=axis)._values
if k.ndim == 2:

# try to be helpful
if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot %s column "%s" in a multi-index. '
'All levels must be provided explicitly'
% (op_description, str(key)))

raise ValueError('Cannot %s duplicate column "%s"' %
(op_description, str(key)))
elif key in self.index.names:
k = self.index.get_level_values(key).values
else:
raise KeyError(key)
return k

@Appender(_shared_docs['sort_values'] % _shared_doc_kwargs)
def sort_values(self, by, axis=0, ascending=True, inplace=False,
Expand All @@ -3459,10 +3489,8 @@ def trans(v):

keys = []
for x in by:
k = self.xs(x, axis=other_axis).values
if k.ndim == 2:
raise ValueError('Cannot sort by duplicate column %s' %
str(x))
k = self._get_column_or_level_values(x, axis=other_axis,
op_description="sort by")
keys.append(trans(k))
indexer = lexsort_indexer(keys, orders=ascending,
na_position=na_position)
Expand All @@ -3471,17 +3499,9 @@ def trans(v):
from pandas.core.sorting import nargsort

by = by[0]
k = self.xs(by, axis=other_axis).values
if k.ndim == 2:

# try to be helpful
if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot sort by column %s in a '
'multi-index you need to explicitly '
'provide all the levels' % str(by))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below as well, so it would be nice to somehow keep this message

k = self._get_column_or_level_values(by, axis=other_axis,
op_description="sort by")

raise ValueError('Cannot sort by duplicate column %s' %
str(by))
if isinstance(ascending, (tuple, list)):
ascending = ascending[0]

Expand Down
3 changes: 2 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@
args_transpose='axes to permute (int or label for object)',
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""")
Name or list of names which refer to the axis items or index
levels.""")
Copy link
Member

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: My OCD really wants you to not "orphan" the "levels" word at the end of this sentence 😄 . However, that's largely a me problem, but if you do happen to come up with a wording that doesn't do this orphaning, by all means go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, now it bothers me too! I reworked it to fit on one line. "Name or list of names matching axis items or index levels."



def _single_replace(self, to_replace, method, inplace, limit):
Expand Down
191 changes: 190 additions & 1 deletion pandas/tests/frame/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import pandas as pd
from pandas.compat import lrange
from pandas import (DataFrame, Series, MultiIndex, Timestamp,
date_range, NaT, IntervalIndex)
Index, date_range, NaT, IntervalIndex)

from pandas.util.testing import assert_series_equal, assert_frame_equal

Expand Down Expand Up @@ -85,6 +85,13 @@ def test_sort_values(self):
expected = frame.reindex(columns=['C', 'B', 'A'])
assert_frame_equal(sorted_df, expected)

# by row (axis=1) with string index
frame = DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]},
index=['row1', 'row2'])
sorted_df = frame.sort_values(by='row2', axis=1)
expected = frame.reindex(columns=['B', 'A', 'C'])
assert_frame_equal(sorted_df, expected)

msg = r'Length of ascending \(5\) != length of by \(2\)'
with tm.assert_raises_regex(ValueError, msg):
frame.sort_values(by=['A', 'B'], axis=0, ascending=[True] * 5)
Expand Down Expand Up @@ -552,3 +559,185 @@ def test_sort_index_intervalindex(self):
closed='right')
result = result.columns.levels[1].categories
tm.assert_index_equal(result, expected)

def test_sort_index_and_column(self):
# Build MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

idx = MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['outer', 'inner']

# Build DataFrames
df_multi = DataFrame({'A': np.arange(6, 0, -1),
'B': ['one', 'one', 'two',
'two', 'one', 'one']},
index=idx)
df_single = df_multi.reset_index('outer')
df_none = df_multi.reset_index()

# Sort by single index
# - On single index frame
expected = df_none.sort_values('inner').set_index('inner')
result = df_single.sort_values('inner')
assert_frame_equal(result, expected)
# - Descending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

expected = df_none.sort_values('inner',
ascending=False).set_index('inner')
result = df_single.sort_values('inner', ascending=False)
assert_frame_equal(result, expected)

# - On multi index frame
expected = df_none.sort_values('inner',
ascending=False
).set_index(['outer', 'inner'])

result = df_multi.sort_values('inner', ascending=False)
assert_frame_equal(result, expected)
# - Descending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

expected = df_none.sort_values('inner',
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values('inner', ascending=False)
assert_frame_equal(result, expected)

# Sort by multiple indexes
# - Ascending
expected = df_none.sort_values(['inner', 'outer']
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer'])
assert_frame_equal(result, expected)

# - Descending
expected = df_none.sort_values(['inner', 'outer'],
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer'],
ascending=False)
assert_frame_equal(result, expected)

# - Mixed
expected = df_none.sort_values(['inner', 'outer'],
ascending=[False, True]
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer'],
ascending=[False, True])
assert_frame_equal(result, expected)

# Sort by single index and single column
# - Ascending
expected = df_none.sort_values(['outer', 'B']
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['outer', 'B'])
assert_frame_equal(result, expected)

# - Descending
expected = df_none.sort_values(['outer', 'B'],
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['outer', 'B'], ascending=False)
assert_frame_equal(result, expected)

# - Mixed
expected = df_none.sort_values(['outer', 'B'],
ascending=[False, True]
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['outer', 'B'],
ascending=[False, True])
assert_frame_equal(result, expected)

# Sort by single column and single index
# - Ascending
expected = df_none.sort_values(['B', 'outer']
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['B', 'outer'])
assert_frame_equal(result, expected)

# - Descending
expected = df_none.sort_values(['B', 'outer'],
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['B', 'outer'], ascending=False)
assert_frame_equal(result, expected)

# - Mixed
expected = df_none.sort_values(['B', 'outer'],
ascending=[False, True]
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['B', 'outer'],
ascending=[False, True])
assert_frame_equal(result, expected)

# Sort by multiple indexes and a single column
# - Ascending
expected = df_none.sort_values(['inner', 'outer', 'A']
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'A'])
assert_frame_equal(result, expected)

# - Descending
expected = df_none.sort_values(['inner', 'outer', 'A'],
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'A'],
ascending=False)
assert_frame_equal(result, expected)

# - Mixed
expected = df_none.sort_values(['inner', 'outer', 'A'],
ascending=[True, True, False]
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'A'],
ascending=[True, True, False])
assert_frame_equal(result, expected)

# Sort by multiple indexes and multiple columns
# - Ascending
expected = df_none.sort_values(['inner', 'outer', 'B', 'A']
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'B', 'A'])
assert_frame_equal(result, expected)

# - Descending
expected = df_none.sort_values(['inner', 'outer', 'B', 'A'],
ascending=False
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'B', 'A'],
ascending=False)
assert_frame_equal(result, expected)

# - Mixed
expected = df_none.sort_values(['inner', 'outer', 'B', 'A'],
ascending=[False, True, True, False]
).set_index(['outer', 'inner'])
result = df_multi.sort_values(['inner', 'outer', 'B', 'A'],
ascending=[False, True, True, False])
assert_frame_equal(result, expected)

def test_sort_values_column_index_level_precedence(self):
# GH 14355, when a string passed as the `by` parameter
# matches a column and an index level the column takes
# precedence

# Construct DataFrame with index and column named 'idx'
idx = Index(np.arange(1, 7), name='idx')
df = DataFrame({'A': np.arange(11, 17),
'idx': np.arange(6, 0, -1)},
index=idx)

# Sorting by 'idx' should sort by the idx column and raise a
# FutureWarning
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
result = df.sort_values(by='idx')

# This should be equivalent to sorting by the 'idx' index level in
# descending order
expected = df.sort_index(level='idx', ascending=False)
assert_frame_equal(result, expected)

# Perform same test with MultiIndex
df_multi = df.set_index('A', append=True)

with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
result = df_multi.sort_values(by='idx')

expected = df_multi.sort_index(level='idx', ascending=False)
assert_frame_equal(result, expected)