Skip to content

DOC: improve docs to clarify MultiIndex indexing #19507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 15, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 59 additions & 27 deletions doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,13 @@ of the index is up to you:
pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

We've "sparsified" the higher levels of the indexes to make the console output a
bit easier on the eyes.
bit easier on the eyes. Note that how the index is displayed can be controlled using the
``multi_sparse`` option in ``pandas.set_options()``:

.. ipython:: python

with pd.option_context('display.multi_sparse', False):
df

It's worth keeping in mind that there's nothing preventing you from using
tuples as atomic labels on an axis:
Expand All @@ -129,15 +135,6 @@ can find yourself working with hierarchically-indexed data without creating a
``MultiIndex`` explicitly yourself. However, when loading data from a file, you
may wish to generate your own ``MultiIndex`` when preparing the data set.

Note that how the index is displayed by be controlled using the
``multi_sparse`` option in ``pandas.set_options()``:

.. ipython:: python

pd.set_option('display.multi_sparse', False)
df
pd.set_option('display.multi_sparse', True)

.. _advanced.get_level_values:

Reconstructing the level labels
Expand Down Expand Up @@ -180,14 +177,13 @@ For example:

.. ipython:: python

  # original MultiIndex
  df.columns
  df.columns # original MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put comments on a separate line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I put them next to the commands because otherwise they look like they don't belong to the code (since the prompts are also shown). See http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels for an example how it looks right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in the actual html output it might be clearer to have it on a single line (in general we should avoid comments in long code blocks, and just put that as text between multiple code-blocks, but in this case I think it is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback OK with putting the comments on the same lines?


# sliced
df[['foo','qux']].columns
df[['foo','qux']].columns # sliced

This is done to avoid a recomputation of the levels in order to make slicing
highly performant. If you want to see the actual used levels.
highly performant. If you want to see only the used levels, you can use the
:func:`MultiIndex.get_level_values` method.

.. ipython:: python

Expand All @@ -196,7 +192,7 @@ highly performant. If you want to see the actual used levels.
# for a specific level
df[['foo','qux']].columns.get_level_values(0)

To reconstruct the ``MultiIndex`` with only the used levels, the
To reconstruct the ``MultiIndex`` with only the used levels, the
``remove_unused_levels`` method may be used.

.. versionadded:: 0.20.0
Expand Down Expand Up @@ -231,15 +227,33 @@ Advanced indexing with hierarchical index
-----------------------------------------

Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc`` is a
bit challenging, but we've made every effort to do so. For example the
following works as you would expect:
bit challenging, but we've made every effort to do so. In general, MultiIndex
keys take the form of tuples. For example, the following works as you would expect:

.. ipython:: python

df = df.T
df
df.loc['bar']
df.loc['bar', 'two']
df.loc[('bar', 'two'),]

Note that ``df.loc['bar', 'two']`` would also work in this example, but this shorthand
notation can lead to ambiguity in general.

If you also want to index a specific column with ``.loc``, you must use a tuple
like this:

.. ipython:: python

df.loc[('bar', 'two'), 'A']

You don't have to specify all levels of the ``MultiIndex`` by passing only the
first elements of the tuple. For example, you can use "partial" indexing to
get all elements with ``bar`` in the first level as follows:

df.loc['bar']

This is a shortcut for the slightly more verbose notation ``df.loc[('bar',),]`` (equivalent
to ``df.loc['bar',]`` in this example).

"Partial" slicing also works quite nicely.

Expand All @@ -260,6 +274,24 @@ Passing a list of labels or tuples works similar to reindexing:

df.loc[[('bar', 'two'), ('qux', 'one')]]

.. info::

It is important to note that tuples and lists are not treated identically
in pandas when it comes to indexing. Whereas a tuple is interpreted as one
multi-level key, a list is used to specify several keys. Or in other words,
tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete ``MultiIndex`` keys,
whereas a tuple of lists refer to several values within a level:

.. ipython:: python

s = pd.Series([1, 2, 3, 4, 5, 6],
index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
s.loc[[("A", "c"), ("B", "d")]] # list of tuples
s.loc[(["A", "B"], ["c", "d"])] # tuple of lists


.. _advanced.mi_slicers:

Using slicers
Expand Down Expand Up @@ -317,7 +349,7 @@ Basic multi-index slicing using slices, lists, and labels.
dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :]


You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax
You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax
using ``:``, rather than using ``slice(None)``.

.. ipython:: python
Expand Down Expand Up @@ -626,7 +658,7 @@ Index Types
-----------

We have discussed ``MultiIndex`` in the previous sections pretty extensively. ``DatetimeIndex`` and ``PeriodIndex``
are shown :ref:`here <timeseries.overview>`, and information about
are shown :ref:`here <timeseries.overview>`, and information about
`TimedeltaIndex`` is found :ref:`here <timedeltas.timedeltas>`.

In the following sub-sections we will highlight some other index types.
Expand Down Expand Up @@ -671,9 +703,9 @@ The ``CategoricalIndex`` is **preserved** after indexing:

df2.loc['a'].index

Sorting the index will sort by the order of the categories (Recall that we
created the index with ``CategoricalDtype(list('cab'))``, so the sorted
order is ``cab``.).
Sorting the index will sort by the order of the categories (recall that we
created the index with ``CategoricalDtype(list('cab'))``, so the sorted
order is ``cab``).

.. ipython:: python

Expand Down Expand Up @@ -726,7 +758,7 @@ Int64Index and RangeIndex

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see :ref:`here <whatsnew_0180.float_indexers>`.

``Int64Index`` is a fundamental basic index in pandas.
``Int64Index`` is a fundamental basic index in pandas.
This is an Immutable array implementing an ordered, sliceable set.
Prior to 0.18.0, the ``Int64Index`` would provide the default index for all ``NDFrame`` objects.

Expand Down Expand Up @@ -765,7 +797,7 @@ The only positional indexing is via ``iloc``.
sf.iloc[3]

A scalar index that is not found will raise a ``KeyError``.
Slicing is primarily on the values of the index when using ``[],ix,loc``, and
Slicing is primarily on the values of the index when using ``[],ix,loc``, and
**always** positional when using ``iloc``. The exception is when the slice is
boolean, in which case it will always be positional.

Expand Down