From 80ee7c31d9e03edb290bf7ffde760c5ba0b6b964 Mon Sep 17 00:00:00 2001 From: Clemens Brunner Date: Fri, 2 Feb 2018 09:25:26 +0100 Subject: [PATCH 1/4] Improve docs to clarify MultiIndex indexing --- doc/source/advanced.rst | 74 ++++++++++++++++++++++++++++------------- 1 file changed, 51 insertions(+), 23 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index ca903dadc6eb1..be115ca08d44e 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -113,7 +113,14 @@ of the index is up to you: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6]) We've "sparsified" the higher levels of the indexes to make the console output a -bit easier on the eyes. +bit easier on the eyes. Note that how the index is displayed can be controlled using the +``multi_sparse`` option in ``pandas.set_options()``: + +.. ipython:: python + + pd.set_option('display.multi_sparse', False) + df + pd.set_option('display.multi_sparse', True) It's worth keeping in mind that there's nothing preventing you from using tuples as atomic labels on an axis: @@ -129,15 +136,6 @@ can find yourself working with hierarchically-indexed data without creating a ``MultiIndex`` explicitly yourself. However, when loading data from a file, you may wish to generate your own ``MultiIndex`` when preparing the data set. -Note that how the index is displayed by be controlled using the -``multi_sparse`` option in ``pandas.set_options()``: - -.. ipython:: python - - pd.set_option('display.multi_sparse', False) - df - pd.set_option('display.multi_sparse', True) - .. _advanced.get_level_values: Reconstructing the level labels @@ -180,14 +178,13 @@ For example: .. ipython:: python -   # original MultiIndex -   df.columns +   df.columns # original MultiIndex - # sliced - df[['foo','qux']].columns + df[['foo','qux']].columns # sliced This is done to avoid a recomputation of the levels in order to make slicing -highly performant. If you want to see the actual used levels. +highly performant. If you want to see only the used levels, you can use the +`get_level_values()` method. .. ipython:: python @@ -196,7 +193,7 @@ highly performant. If you want to see the actual used levels. # for a specific level df[['foo','qux']].columns.get_level_values(0) -To reconstruct the ``MultiIndex`` with only the used levels, the +To reconstruct the ``MultiIndex`` with only the used levels, the ``remove_unused_levels`` method may be used. .. versionadded:: 0.20.0 @@ -231,16 +228,31 @@ Advanced indexing with hierarchical index ----------------------------------------- Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc`` is a -bit challenging, but we've made every effort to do so. For example the -following works as you would expect: +bit challenging, but we've made every effort to do so. In general, MultiIndex +keys take the form of tuples. For example, the following works as you would expect: .. ipython:: python df = df.T df - df.loc['bar'] df.loc['bar', 'two'] +If you also want to index a specific column with ``.loc``, you have to use +parentheses around the tuple like this: + +.. ipython:: python + + df.loc[('bar', 'two'), 'A'] + +You don't have to specify all levels of the ``MultiIndex`` by passing only the +first elements of the tuple. For example, you can use this partially indexing to +get all elements in the ``bar`` level as follows: + +df.loc['bar'] + +This is identical to the slightly more verbose notation ``df.loc['bar',]`` using +a tuple with one element. + "Partial" slicing also works quite nicely. .. ipython:: python @@ -260,6 +272,22 @@ Passing a list of labels or tuples works similar to reindexing: df.loc[[('bar', 'two'), ('qux', 'one')]] +.. warning:: + + It is important to note that tuples and lists are not treated identically + in pandas. + +Importantly, a list of tuples indexes several complete ``MultiIndex`` keys, +whereas a tuple of lists refer to several values within a level: + +.. ipython:: python + + s = pd.Series([1, 2, 3, 4], + index=pd.MultiIndex.from_product([["A", "B"], ["c", "d"]])) + s.loc[[("A", "c"), ("B", "d")]] # list of tuples + s.loc[(["A", "B"], ["c", "d"])] # tuple of lists + + .. _advanced.mi_slicers: Using slicers @@ -317,7 +345,7 @@ Basic multi-index slicing using slices, lists, and labels. dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :] -You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax +You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax using ``:``, rather than using ``slice(None)``. .. ipython:: python @@ -626,7 +654,7 @@ Index Types ----------- We have discussed ``MultiIndex`` in the previous sections pretty extensively. ``DatetimeIndex`` and ``PeriodIndex`` -are shown :ref:`here `, and information about +are shown :ref:`here `, and information about `TimedeltaIndex`` is found :ref:`here `. In the following sub-sections we will highlight some other index types. @@ -726,7 +754,7 @@ Int64Index and RangeIndex Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see :ref:`here `. -``Int64Index`` is a fundamental basic index in pandas. +``Int64Index`` is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the ``Int64Index`` would provide the default index for all ``NDFrame`` objects. @@ -765,7 +793,7 @@ The only positional indexing is via ``iloc``. sf.iloc[3] A scalar index that is not found will raise a ``KeyError``. -Slicing is primarily on the values of the index when using ``[],ix,loc``, and +Slicing is primarily on the values of the index when using ``[],ix,loc``, and **always** positional when using ``iloc``. The exception is when the slice is boolean, in which case it will always be positional. From 14d770c15d979853b70cc43a2ce570f6305e5349 Mon Sep 17 00:00:00 2001 From: Clemens Brunner Date: Fri, 2 Feb 2018 10:01:53 +0100 Subject: [PATCH 2/4] Address comments --- doc/source/advanced.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index be115ca08d44e..eb0c60df2c4d8 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -245,13 +245,13 @@ parentheses around the tuple like this: df.loc[('bar', 'two'), 'A'] You don't have to specify all levels of the ``MultiIndex`` by passing only the -first elements of the tuple. For example, you can use this partially indexing to -get all elements in the ``bar`` level as follows: +first elements of the tuple. For example, you can use *partial* indexing to +get all elements with ``bar`` in the first level as follows: df.loc['bar'] -This is identical to the slightly more verbose notation ``df.loc['bar',]`` using -a tuple with one element. +This is a shortcut for the slightly more verbose notation ``df.loc['bar',]`` (equivalent +to ``df.loc[('bar',)]``). "Partial" slicing also works quite nicely. @@ -275,7 +275,8 @@ Passing a list of labels or tuples works similar to reindexing: .. warning:: It is important to note that tuples and lists are not treated identically - in pandas. + in pandas. Whereas a tuple is interpreted as one multi-level key, a list is + used to specify several keys. Importantly, a list of tuples indexes several complete ``MultiIndex`` keys, whereas a tuple of lists refer to several values within a level: From e9ba3dac2c48639be5f9f9ff8e11f377e585f983 Mon Sep 17 00:00:00 2001 From: Clemens Brunner Date: Tue, 13 Feb 2018 09:49:24 +0100 Subject: [PATCH 3/4] Address comments --- doc/source/advanced.rst | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index eb0c60df2c4d8..c583509bef98e 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -118,9 +118,8 @@ bit easier on the eyes. Note that how the index is displayed can be controlled u .. ipython:: python - pd.set_option('display.multi_sparse', False) - df - pd.set_option('display.multi_sparse', True) + with pd.option_context('display.multi_sparse', False): + df It's worth keeping in mind that there's nothing preventing you from using tuples as atomic labels on an axis: @@ -184,7 +183,7 @@ For example: This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the -`get_level_values()` method. +:func:`MultiIndex.get_level_values` method. .. ipython:: python @@ -235,23 +234,26 @@ keys take the form of tuples. For example, the following works as you would expe df = df.T df - df.loc['bar', 'two'] + df.loc[('bar', 'two'),] + +Note that ``df.loc['bar', 'two']`` would also work in this example, but this shorthand +notation can lead to ambiguity in general. -If you also want to index a specific column with ``.loc``, you have to use -parentheses around the tuple like this: +If you also want to index a specific column with ``.loc``, you must use a tuple +like this: .. ipython:: python df.loc[('bar', 'two'), 'A'] You don't have to specify all levels of the ``MultiIndex`` by passing only the -first elements of the tuple. For example, you can use *partial* indexing to +first elements of the tuple. For example, you can use "partial" indexing to get all elements with ``bar`` in the first level as follows: df.loc['bar'] -This is a shortcut for the slightly more verbose notation ``df.loc['bar',]`` (equivalent -to ``df.loc[('bar',)]``). +This is a shortcut for the slightly more verbose notation ``df.loc[('bar',),]`` (equivalent +to ``df.loc['bar',]`` in this example). "Partial" slicing also works quite nicely. @@ -272,11 +274,12 @@ Passing a list of labels or tuples works similar to reindexing: df.loc[[('bar', 'two'), ('qux', 'one')]] -.. warning:: +.. info:: It is important to note that tuples and lists are not treated identically - in pandas. Whereas a tuple is interpreted as one multi-level key, a list is - used to specify several keys. + in pandas when it comes to indexing. Whereas a tuple is interpreted as one + multi-level key, a list is used to specify several keys. Or in other words, + tuples go horizontally (traversing levels), lists go vertically (scanning levels). Importantly, a list of tuples indexes several complete ``MultiIndex`` keys, whereas a tuple of lists refer to several values within a level: From 7cef2d37b9383d492230ea9a1245261dd17c88b6 Mon Sep 17 00:00:00 2001 From: Clemens Brunner Date: Tue, 13 Feb 2018 11:11:23 +0100 Subject: [PATCH 4/4] Update example and fix typo --- doc/source/advanced.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index c583509bef98e..c455fbb8d0687 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -286,8 +286,8 @@ whereas a tuple of lists refer to several values within a level: .. ipython:: python - s = pd.Series([1, 2, 3, 4], - index=pd.MultiIndex.from_product([["A", "B"], ["c", "d"]])) + s = pd.Series([1, 2, 3, 4, 5, 6], + index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]])) s.loc[[("A", "c"), ("B", "d")]] # list of tuples s.loc[(["A", "B"], ["c", "d"])] # tuple of lists @@ -703,9 +703,9 @@ The ``CategoricalIndex`` is **preserved** after indexing: df2.loc['a'].index -Sorting the index will sort by the order of the categories (Recall that we -created the index with ``CategoricalDtype(list('cab'))``, so the sorted -order is ``cab``.). +Sorting the index will sort by the order of the categories (recall that we +created the index with ``CategoricalDtype(list('cab'))``, so the sorted +order is ``cab``). .. ipython:: python