DOC: improve docs to clarify MultiIndex indexing (#19507)

cbrnr · jorisvandenbossche · commit 405ed25b2147 · 2018-02-15T10:00:32.000+01:00
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -113,7 +113,13 @@ of the index is up to you:
    pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
 
 We've "sparsified" the higher levels of the indexes to make the console output a
-bit easier on the eyes.
+bit easier on the eyes. Note that how the index is displayed can be controlled using the
+``multi_sparse`` option in ``pandas.set_options()``:
+
+.. ipython:: python
+
+   with pd.option_context('display.multi_sparse', False):
+       df
 
 It's worth keeping in mind that there's nothing preventing you from using
 tuples as atomic labels on an axis:
@@ -129,15 +135,6 @@ can find yourself working with hierarchically-indexed data without creating a
 ``MultiIndex`` explicitly yourself. However, when loading data from a file, you
 may wish to generate your own ``MultiIndex`` when preparing the data set.
 
-Note that how the index is displayed by be controlled using the
-``multi_sparse`` option in ``pandas.set_options()``:
-
-.. ipython:: python
-
-   pd.set_option('display.multi_sparse', False)
-   df
-   pd.set_option('display.multi_sparse', True)
-
 .. _advanced.get_level_values:
 
 Reconstructing the level labels
@@ -180,14 +177,13 @@ For example:
 
 .. ipython:: python
 
-   # original MultiIndex
-   df.columns
+   df.columns  # original MultiIndex
 
-   # sliced
-   df[['foo','qux']].columns
+   df[['foo','qux']].columns  # sliced
 
 This is done to avoid a recomputation of the levels in order to make slicing
-highly performant. If you want to see the actual used levels.
+highly performant. If you want to see only the used levels, you can use the
+:func:`MultiIndex.get_level_values` method.
 
 .. ipython:: python
 
@@ -196,7 +192,7 @@ highly performant. If you want to see the actual used levels.
    # for a specific level
    df[['foo','qux']].columns.get_level_values(0)
 
-To reconstruct the ``MultiIndex`` with only the used levels, the 
+To reconstruct the ``MultiIndex`` with only the used levels, the
 ``remove_unused_levels`` method may be used.
 
 .. versionadded:: 0.20.0
@@ -231,15 +227,33 @@ Advanced indexing with hierarchical index
 -----------------------------------------
 
 Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc`` is a
-bit challenging, but we've made every effort to do so. For example the
-following works as you would expect:
+bit challenging, but we've made every effort to do so. In general, MultiIndex
+keys take the form of tuples. For example, the following works as you would expect:
 
 .. ipython:: python
 
    df = df.T
    df
-   df.loc['bar']
-   df.loc['bar', 'two']
+   df.loc[('bar', 'two'),]
+
+Note that ``df.loc['bar', 'two']`` would also work in this example, but this shorthand
+notation can lead to ambiguity in general.
+
+If you also want to index a specific column with ``.loc``, you must use a tuple
+like this:
+
+.. ipython:: python
+
+   df.loc[('bar', 'two'), 'A']
+
+You don't have to specify all levels of the ``MultiIndex`` by passing only the
+first elements of the tuple. For example, you can use "partial" indexing to
+get all elements with ``bar`` in the first level as follows:
+
+df.loc['bar']
+
+This is a shortcut for the slightly more verbose notation ``df.loc[('bar',),]`` (equivalent
+to ``df.loc['bar',]`` in this example).
 
 "Partial" slicing also works quite nicely.
 
@@ -260,6 +274,24 @@ Passing a list of labels or tuples works similar to reindexing:
 
    df.loc[[('bar', 'two'), ('qux', 'one')]]
 
+.. info::
+
+   It is important to note that tuples and lists are not treated identically
+   in pandas when it comes to indexing. Whereas a tuple is interpreted as one
+   multi-level key, a list is used to specify several keys. Or in other words,
+   tuples go horizontally (traversing levels), lists go vertically (scanning levels).
+
+Importantly, a list of tuples indexes several complete ``MultiIndex`` keys,
+whereas a tuple of lists refer to several values within a level:
+
+.. ipython:: python
+
+   s = pd.Series([1, 2, 3, 4, 5, 6],
+                 index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
+   s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
+   s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
+
+
 .. _advanced.mi_slicers:
 
 Using slicers
@@ -317,7 +349,7 @@ Basic multi-index slicing using slices, lists, and labels.
    dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :]
 
 
-You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax 
+You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax
 using ``:``, rather than using ``slice(None)``.
 
 .. ipython:: python
@@ -626,7 +658,7 @@ Index Types
 -----------
 
 We have discussed ``MultiIndex`` in the previous sections pretty extensively. ``DatetimeIndex`` and ``PeriodIndex``
-are shown :ref:`here <timeseries.overview>`, and information about 
+are shown :ref:`here <timeseries.overview>`, and information about
 `TimedeltaIndex`` is found :ref:`here <timedeltas.timedeltas>`.
 
 In the following sub-sections we will highlight some other index types.
@@ -671,9 +703,9 @@ The ``CategoricalIndex`` is **preserved** after indexing:
 
    df2.loc['a'].index
 
-Sorting the index will sort by the order of the categories (Recall that we 
-created the index with ``CategoricalDtype(list('cab'))``, so the sorted 
-order is ``cab``.). 
+Sorting the index will sort by the order of the categories (recall that we
+created the index with ``CategoricalDtype(list('cab'))``, so the sorted
+order is ``cab``).
 
 .. ipython:: python
 
@@ -726,7 +758,7 @@ Int64Index and RangeIndex
 
    Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see :ref:`here <whatsnew_0180.float_indexers>`.
 
-``Int64Index`` is a fundamental basic index in pandas. 
+``Int64Index`` is a fundamental basic index in pandas.
 This is an Immutable array implementing an ordered, sliceable set.
 Prior to 0.18.0, the ``Int64Index`` would provide the default index for all ``NDFrame`` objects.
 
@@ -765,7 +797,7 @@ The only positional indexing is via ``iloc``.
    sf.iloc[3]
 
 A scalar index that is not found will raise a ``KeyError``.
-Slicing is primarily on the values of the index when using ``[],ix,loc``, and 
+Slicing is primarily on the values of the index when using ``[],ix,loc``, and
 **always** positional when using ``iloc``. The exception is when the slice is
 boolean, in which case it will always be positional.