Skip to content

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 5, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 48 additions & 9 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1738,19 +1738,26 @@ description.
Sorting
-------

There are two obvious kinds of sorting that you may be interested in: sorting
by label and sorting by actual values.
Pandas supports three kinds of sorting: sorting by index levels,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index levels" -> "index labels" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this isn't correct. But I was trying to be consistent in using level to refer to the names of the indexes or columns (df.index.names, df.columns.names) and label to refer to the values referenced by the levels (df.index.get_level_values(lv1), df.columns.get_level_values(lv1)). Is that in line with accepted vocabulary? @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think "index level" is the equivalent of "column", and from there it would follow that the equivalent of "column values" is "index labels of one level / index level labels / index level values" (but none of those sound that good) and the equivalent of "column names/labels" would then be "index level name".
But it's not that we do that very consistently currently, or that this is necessarily agreed upon vocabulary. But I think it is in line more or less with what you say?

The main reason I commented it here, is because the next line uses "column values", and as you say "label to refer to the values" then to be consistent it would be "index labels".
And a second reason I commented is that I think "index level" is more complicated, as then a user needs to be familiar with the concept of MultiIndexes, while for this section about sorting, that is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. After looking at it again, I do agree that "index labels" is the better parallel to "columns values".

sorting by column values, and sorting by a combination of both.

.. _basics.sort_index:

By Index
~~~~~~~~

The primary method for sorting axis
labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index()`` methods.
The :meth:`Series.sort_index` and :meth:`DataFrame.sort_index` methods are
used to sort a pandas object by its index levels.

.. ipython:: python

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
columns=['three', 'two', 'one'])
unsorted_df

# DataFrame
unsorted_df.sort_index()
Expand All @@ -1760,20 +1767,22 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index(
# Series
unsorted_df['three'].sort_index()

.. _basics.sort_values:

By Values
~~~~~~~~~

The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` are the entry points for **value** sorting (that is the values in a column or row).
:meth:`DataFrame.sort_values` can accept an optional ``by`` argument for ``axis=0``
which will use an arbitrary vector or a column name of the DataFrame to
determine the sort order:
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are
used to sort a pandas object by its values. The optional ``by`` parameter to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use just "values" but indicate "column values" or "columns". Values in the index are also values of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that is because it can also be rows I suppose :-) Then I would keep the explicit "column or row values" like before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this into two sentences so that a Series is sorted by "values" and a DataFrame is sorted by "column or row values".

:meth:`DataFrame.sort_values` may used to specify one or more columns to
use to determine the sorted order.

.. ipython:: python

df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
df1.sort_values(by='two')

The ``by`` argument can take a list of column names, e.g.:
The ``by`` parameter can take a list of column names, e.g.:

.. ipython:: python

Expand All @@ -1788,6 +1797,36 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

.. _basics.sort_indexes_and_values:

By Indexes and Values
~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.22.0

Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may
refer to either columns or index levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to?

Side-question, does this work on a regular (non-multiindex) index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "index level names".

And yes, this does work for both the Index and MultIndex cases.
@TomAugspurger


.. ipython:: python

# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
index=idx)
df_multi

# Sort by 'second' (index) and 'A' (column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

df_multi.sort_values(by=['second', 'A'])

.. note::

If a string matches both a column name and an index level name then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

.. _basics.searchsorted:

Expand Down
25 changes: 25 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,31 @@ levels <merging.merge_on_columns_and_levels>` documentation section.

.. _whatsnew_0220.enhancements.other:

Sorting by a combination of columns and index levels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Strings passed to :meth:`DataFrame.sort_values` as the ``by`` parameter may
now refer to either column names or index level names. This enables sorting
``DataFrame`` instances by a combination of index levels and columns without
resetting indexes. See the :ref:`Sorting by Indexes and Values
<basics.sort_indexes_and_values>` documentation section.
(:issue:`14353`)

.. ipython:: python

# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
index=idx)
df_multi

# Sort by 'second' (index) and 'A' (column)
df_multi.sort_values(by=['second', 'A'])

Other Enhancements
^^^^^^^^^^^^^^^^^^

Expand Down
27 changes: 11 additions & 16 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,15 @@
axes_single_arg="{0 or 'index', 1 or 'columns'}",
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""",
Name or list of names matching axis levels or off-axis labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "off-axis labels" very complex. I mean I know what you mean and theoretically it is correct, but I think novice users will not understand this (the previous "axis items" was also not that good).

I am still thinking of a better wording (the annoying thing is that you cannot just say "column labels or index level names" which is probably the common case how this is used)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Name or list of names to sort by"? And then things are clarified below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Going with "Name or list of names to sort by"


- if `axis` is 0 or `'index'` then `by` may contain index
levels and/or column labels
- if `axis` is 1 or `'columns'` then `by` may contain column
levels and/or index labels

Support for specify index/column levels was added in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a ..versionmodified:

.. versionmodified:: 0.22.0
   Allow specifying index or column level names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

version 0.22.0""",
versionadded_to_excel='',
optional_labels="""labels : array-like, optional
New labels / index to conform the axis specified by 'axis' to.""",
Expand Down Expand Up @@ -3612,7 +3620,6 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
kind='quicksort', na_position='last'):
inplace = validate_bool_kwarg(inplace, 'inplace')
axis = self._get_axis_number(axis)
other_axis = 0 if axis == 1 else 1

if not isinstance(by, list):
by = [by]
Expand All @@ -3624,10 +3631,7 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,

keys = []
for x in by:
k = self.xs(x, axis=other_axis).values
if k.ndim == 2:
raise ValueError('Cannot sort by duplicate column %s' %
str(x))
k = self._get_label_or_level_values(x, axis=axis)
keys.append(k)
indexer = lexsort_indexer(keys, orders=ascending,
na_position=na_position)
Expand All @@ -3636,17 +3640,8 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
from pandas.core.sorting import nargsort

by = by[0]
k = self.xs(by, axis=other_axis).values
if k.ndim == 2:

# try to be helpful
if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot sort by column %s in a '
'multi-index you need to explicitly '
'provide all the levels' % str(by))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below as well, so it would be nice to somehow keep this message

k = self._get_label_or_level_values(by, axis=axis)

raise ValueError('Cannot sort by duplicate column %s' %
str(by))
if isinstance(ascending, (tuple, list)):
ascending = ascending[0]

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
args_transpose='axes to permute (int or label for object)',
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""")
Name or list of names matching axis levels or off-axis labels.""")


def _single_replace(self, to_replace, method, inplace, limit):
Expand Down Expand Up @@ -2932,7 +2932,7 @@ def add_suffix(self, suffix):
Parameters
----------%(optional_by)s
axis : %(axes_single_arg)s, default 0
Axis to direct sorting
Axis to be sorted
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
Expand Down
122 changes: 122 additions & 0 deletions pandas/tests/frame/test_sort_values_level_as_str.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
import numpy as np
import pytest

from pandas import DataFrame, Index
from pandas.errors import PerformanceWarning
from pandas.util import testing as tm
from pandas.util.testing import assert_frame_equal


@pytest.fixture
def df_none():
return DataFrame({
'outer': ['a', 'a', 'a', 'b', 'b', 'b'],
'inner': [1, 2, 2, 2, 1, 1],
'A': np.arange(6, 0, -1),
('B', 5): ['one', 'one', 'two', 'two', 'one', 'one']})


@pytest.fixture(params=[
['outer'],
['outer', 'inner']
])
def df_idx(request, df_none):
levels = request.param
return df_none.set_index(levels)


@pytest.fixture(params=[
'inner', # index level
['outer'], # list of index level
'A', # column
[('B', 5)], # list of column
['inner', 'outer'], # two index levels
[('B', 5), 'outer'], # index level and column
['A', ('B', 5)], # Two columns
['inner', 'outer'] # two index levels and column
])
def sort_names(request):
return request.param


@pytest.fixture(params=[True, False])
def ascending(request):
return request.param


def test_sort_index_level_and_column_label(
df_none, df_idx, sort_names, ascending):

# Get index levels from df_idx
levels = df_idx.index.names

# Compute expected by sorting on columns and the setting index
expected = df_none.sort_values(by=sort_names,
ascending=ascending,
axis=0).set_index(levels)

# Compute result sorting on mix on columns and index levels
result = df_idx.sort_values(by=sort_names,
ascending=ascending,
axis=0)

assert_frame_equal(result, expected)


def test_sort_column_level_and_index_label(
df_none, df_idx, sort_names, ascending):

# Get levels from df_idx
levels = df_idx.index.names

# Compute expected by sorting on axis=0, setting index levels, and then
# transposing. For some cases this will result in a frame with
# multiple column levels
expected = df_none.sort_values(by=sort_names,
ascending=ascending,
axis=0).set_index(levels).T

# Compute result by transposing and sorting on axis=1.
result = df_idx.T.sort_values(by=sort_names,
ascending=ascending,
axis=1)

if len(levels) > 1:
# Accessing multi-level columns that are not lexsorted raises a
# performance warning
with tm.assert_produces_warning(PerformanceWarning,
check_stacklevel=False):
assert_frame_equal(result, expected)
else:
assert_frame_equal(result, expected)


def test_sort_values_column_index_level_precedence():
# GH 14353, when a string passed as the `by` parameter
# matches a column and an index level the column takes
# precedence

# Construct DataFrame with index and column named 'idx'
idx = Index(np.arange(1, 7), name='idx')
df = DataFrame({'A': np.arange(11, 17),
'idx': np.arange(6, 0, -1)},
index=idx)

# Sorting by 'idx' should sort by the idx column and raise a
# FutureWarning
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if you can get this working without check_stacklevel=False ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Could you offer any insight into what check_stacklevel is actually checking? I've had trouble understanding that @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #9584 and #10676.
The stacklevel is specified in the FutureWarning in _check_label_or_level_ambiguity, and is currenlty set at 2, which is not correct. This should be higher, depending on how many call steps there are between sort_values and calling that function.
The main problem here will probably be that _check_label_or_level_ambiguity is used in multiple places, so the needed stacklevel might differ for the different usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning there's an optional stacklevel parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.

It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was able to remove check_stacklevel=False for these tests, and for the related tests for groupby and merge. I added a stacklevel parameter to _check_label_or_level_ambiguity and _get_label_or_level_values that behaves like the stacklevel parameter of warnings.warn itself (stacklevel=1 blames the caller, stacklevel=2 blames the callers parent, etc.).

Does this look like a good way to handle it? @TomAugspurger @jorisvandenbossche

result = df.sort_values(by='idx')

# This should be equivalent to sorting by the 'idx' index level in
# descending order
expected = df.sort_index(level='idx', ascending=False)
assert_frame_equal(result, expected)

# Perform same test with MultiIndex
df_multi = df.set_index('A', append=True)

with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
result = df_multi.sort_values(by='idx')

expected = df_multi.sort_index(level='idx', ascending=False)
assert_frame_equal(result, expected)
16 changes: 8 additions & 8 deletions pandas/tests/frame/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,38 +455,38 @@ def test_sort_index_duplicates(self):
df = DataFrame([lrange(5, 9), lrange(4)],
columns=['a', 'a', 'b', 'b'])

with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
# use .sort_values #9816
with tm.assert_produces_warning(FutureWarning):
df.sort_index(by='a')
with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
df.sort_values(by='a')

with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
# use .sort_values #9816
with tm.assert_produces_warning(FutureWarning):
df.sort_index(by=['a'])
with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
df.sort_values(by=['a'])

with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
# use .sort_values #9816
with tm.assert_produces_warning(FutureWarning):
# multi-column 'by' is separate codepath
df.sort_index(by=['a', 'b'])
with tm.assert_raises_regex(ValueError, 'duplicate'):
with tm.assert_raises_regex(ValueError, 'not unique'):
# multi-column 'by' is separate codepath
df.sort_values(by=['a', 'b'])

# with multi-index
# GH4370
df = DataFrame(np.random.randn(4, 2),
columns=MultiIndex.from_tuples([('a', 0), ('a', 1)]))
with tm.assert_raises_regex(ValueError, 'levels'):
with tm.assert_raises_regex(ValueError, 'not unique'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this case the old error message was more informative, so it would be nice if we could keep a separate message for the case of incomplete key for MI compared to just non-unique key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure. The new error message will read as follows (for the MultiIndex case):

ValueError: The column label 'a' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.

Does that sound clear? @jorisvandenbossche

# use .sort_values #9816
with tm.assert_produces_warning(FutureWarning):
df.sort_index(by='a')
with tm.assert_raises_regex(ValueError, 'levels'):
with tm.assert_raises_regex(ValueError, 'not unique'):
df.sort_values(by='a')

# convert tuples to a list of tuples
Expand Down