Skip to content

DOC: improve docs to clarify MultiIndex indexing #19507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 15, 2018
Merged

DOC: improve docs to clarify MultiIndex indexing #19507

merged 4 commits into from
Feb 15, 2018

Conversation

cbrnr
Copy link
Contributor

@cbrnr cbrnr commented Feb 2, 2018

As per our discussion in #16943. Let me know what you think. I'm not quite happy with the new warning box, ideas how to improve the message are welcome.

df.loc[('bar', 'two'), 'A']

You don't have to specify all levels of the ``MultiIndex`` by passing only the
first elements of the tuple. For example, you can use this partially indexing to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this partially indexing" -> "partial indexing"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe I would also put "partial" between quotes, or in italic)


You don't have to specify all levels of the ``MultiIndex`` by passing only the
first elements of the tuple. For example, you can use this partially indexing to
get all elements in the ``bar`` level as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with bar in the first level?


df.loc['bar']

This is identical to the slightly more verbose notation ``df.loc['bar',]`` using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "This is a shortcut for the sligthly more verbose notation df.loc['bar',] (equivalent to df.loc[('bar',)])"

.. warning::

It is important to note that tuples and lists are not treated identically
in pandas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording is OK, but I would add another sentence stating the different role (multi-level key vs. list of keys).

@cbrnr
Copy link
Contributor Author

cbrnr commented Feb 2, 2018

@toobaz done

@codecov
Copy link

codecov bot commented Feb 2, 2018

Codecov Report

Merging #19507 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19507      +/-   ##
==========================================
+ Coverage   91.57%    91.6%   +0.02%     
==========================================
  Files         150      150              
  Lines       48817    48817              
==========================================
+ Hits        44704    44718      +14     
+ Misses       4113     4099      -14
Flag Coverage Δ
#multiple 89.97% <ø> (+0.02%) ⬆️
#single 41.72% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9551c8...7cef2d3. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Added some comments


.. ipython:: python

df = df.T
df
df.loc['bar']
df.loc['bar', 'two']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also a bit a dubious example IMO (try df.loc['bar', 'A'] ..). Should we maybe recommend to do df.loc[('bar', 'two')] as good practice? (I know in practice it does not matter, as df.loc[('bar', 'A')] works just as well, but it "looks" clearer. Unless we recommend df.loc[('bar', 'two'),] (extra comma) which is not ambiguous I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. showing lists just perpetuates the confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is really not obvious. It gets confusing once we mix row and column indices inside loc. Also, we're getting a bit into Python syntax issues here. Parentheses are not required to specify a tuple, so df.loc[('bar', 'two')] is exactly identical to df.loc['bar', 'two']. I think we should try to relate MultiIndex usage to the standard use case of loc where you specify rows, columns. In fact, we should probably include the worst case scenario where both rows and columns have a MultIindex...

Not sure what to do, let's discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct they are not required (and that's the problem). However it is much better to be explicit when using mutli-indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then let's recommend the use of parens around tuples then. But I'm not sure recommending a nested tuple adds much to resolve the confusion (i.e. df.loc[('bar', 'two'),] is more confusing than df.loc[('bar', 'two')], because I thought that we now distinguish between tuples and lists).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't even personally put in terms of recommentations... I would say that the "default" is df.loc[('bar', 'two'),], and that df.loc['bar', 'two'] is a shortcut which however can lead to ambiguity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... although then we need to mention that ambiguity is resolved in favour of multiple levels, rather than multiple axes (yeah, there are exceptions currently, but they are bugs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, +1 on using df.loc[('bar', 'two'),] in the example itself, and mentioning df.loc['bar', 'two'] gives the same in this case but can lead to ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

df.loc['bar']

This is a shortcut for the slightly more verbose notation ``df.loc['bar',]`` (equivalent
to ``df.loc[('bar',)]``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a final comma to be fully explicit? df.loc[('bar', ),]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, see my reply above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.. warning::

It is important to note that tuples and lists are not treated identically
in pandas. Whereas a tuple is interpreted as one multi-level key, a list is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pandas "when it comes to multi-indexing" ? to make clear that in many other places we don't make this distinction (or is that already clear from the context?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "when it comes to indexing", since pd.Series(range(3)).loc[(1,2)] doesn't work (and I think this is good).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

used to specify several keys.

Importantly, a list of tuples indexes several complete ``MultiIndex`` keys,
whereas a tuple of lists refer to several values within a level:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain this a bit more in detail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could also be just an info box. It is more an advanced example than something you should really know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with an info box, but I wish I had really known this in my analysis (which I think isn't that advanced). Anyway, any kind of box is good to attract attention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it is "serious" enough to be written in the docs, but when I happen to discuss this in talks, I always say "tuples go horizontally [traversing levels], lists go vertically [scanning levels]".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


.. ipython:: python

pd.set_option('display.multi_sparse', False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use these with the context manager, e.g. pd.option_context('display.multi_sparse', False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -180,14 +178,13 @@ For example:

.. ipython:: python

  # original MultiIndex
  df.columns
  df.columns # original MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put comments on a separate line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I put them next to the commands because otherwise they look like they don't belong to the code (since the prompts are also shown). See http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels for an example how it looks right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in the actual html output it might be clearer to have it on a single line (in general we should avoid comments in long code blocks, and just put that as text between multiple code-blocks, but in this case I think it is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback OK with putting the comments on the same lines?


This is done to avoid a recomputation of the levels in order to make slicing
highly performant. If you want to see the actual used levels.
highly performant. If you want to see only the used levels, you can use the
`get_level_values()` method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you reference a method, use :func:`MultiIndex.get_level_values`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


.. ipython:: python

df = df.T
df
df.loc['bar']
df.loc['bar', 'two']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. showing lists just perpetuates the confusion.

df.loc['bar', 'two']

If you also want to index a specific column with ``.loc``, you have to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear. You must use a tuple is more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@jorisvandenbossche
Copy link
Member

@cbrnr Can you update based on the comments?

@cbrnr
Copy link
Contributor Author

cbrnr commented Feb 13, 2018

I've updated the docs based on all comments. Please check if it is OK now.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Added one minor comment

.. ipython:: python

s = pd.Series([1, 2, 3, 4],
index=pd.MultiIndex.from_product([["A", "B"], ["c", "d"]]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add here a "e" in ["c", "d", "e"], so the second example below actually does a selection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done!

@cbrnr
Copy link
Contributor Author

cbrnr commented Feb 13, 2018

CircleCI error is unrelated - could someone restart it please?

@jorisvandenbossche jorisvandenbossche changed the title Improve docs to clarify MultiIndex indexing DOC: improve docs to clarify MultiIndex indexing Feb 15, 2018
@jorisvandenbossche jorisvandenbossche merged commit 405ed25 into pandas-dev:master Feb 15, 2018
@jorisvandenbossche
Copy link
Member

@cbrnr Thanks a lot!
(circle ci error was indeed unrelated, connectivity issue)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Feb 15, 2018
@cbrnr cbrnr deleted the multiindex_docs branch February 15, 2018 09:06
harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants