-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: improve docs to clarify MultiIndex indexing #19507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/advanced.rst
Outdated
df.loc[('bar', 'two'), 'A'] | ||
|
||
You don't have to specify all levels of the ``MultiIndex`` by passing only the | ||
first elements of the tuple. For example, you can use this partially indexing to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this partially indexing" -> "partial indexing"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe I would also put "partial" between quotes, or in italic)
doc/source/advanced.rst
Outdated
|
||
You don't have to specify all levels of the ``MultiIndex`` by passing only the | ||
first elements of the tuple. For example, you can use this partially indexing to | ||
get all elements in the ``bar`` level as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with bar
in the first level?
doc/source/advanced.rst
Outdated
|
||
df.loc['bar'] | ||
|
||
This is identical to the slightly more verbose notation ``df.loc['bar',]`` using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "This is a shortcut for the sligthly more verbose notation df.loc['bar',]
(equivalent to df.loc[('bar',)]
)"
doc/source/advanced.rst
Outdated
.. warning:: | ||
|
||
It is important to note that tuples and lists are not treated identically | ||
in pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording is OK, but I would add another sentence stating the different role (multi-level key vs. list of keys).
@toobaz done |
Codecov Report
@@ Coverage Diff @@
## master #19507 +/- ##
==========================================
+ Coverage 91.57% 91.6% +0.02%
==========================================
Files 150 150
Lines 48817 48817
==========================================
+ Hits 44704 44718 +14
+ Misses 4113 4099 -14
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Added some comments
doc/source/advanced.rst
Outdated
|
||
.. ipython:: python | ||
|
||
df = df.T | ||
df | ||
df.loc['bar'] | ||
df.loc['bar', 'two'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also a bit a dubious example IMO (try df.loc['bar', 'A']
..). Should we maybe recommend to do df.loc[('bar', 'two')]
as good practice? (I know in practice it does not matter, as df.loc[('bar', 'A')]
works just as well, but it "looks" clearer. Unless we recommend df.loc[('bar', 'two'),]
(extra comma) which is not ambiguous I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES. showing lists just perpetuates the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this is really not obvious. It gets confusing once we mix row and column indices inside loc
. Also, we're getting a bit into Python syntax issues here. Parentheses are not required to specify a tuple, so df.loc[('bar', 'two')]
is exactly identical to df.loc['bar', 'two']
. I think we should try to relate MultiIndex usage to the standard use case of loc
where you specify rows, columns. In fact, we should probably include the worst case scenario where both rows and columns have a MultIindex...
Not sure what to do, let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct they are not required (and that's the problem). However it is much better to be explicit when using mutli-indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then let's recommend the use of parens around tuples then. But I'm not sure recommending a nested tuple adds much to resolve the confusion (i.e. df.loc[('bar', 'two'),]
is more confusing than df.loc[('bar', 'two')]
, because I thought that we now distinguish between tuples and lists).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't even personally put in terms of recommentations... I would say that the "default" is df.loc[('bar', 'two'),]
, and that df.loc['bar', 'two']
is a shortcut which however can lead to ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, +1 on using df.loc[('bar', 'two'),]
in the example itself, and mentioning df.loc['bar', 'two']
gives the same in this case but can lead to ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/advanced.rst
Outdated
df.loc['bar'] | ||
|
||
This is a shortcut for the slightly more verbose notation ``df.loc['bar',]`` (equivalent | ||
to ``df.loc[('bar',)]``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a final comma to be fully explicit? df.loc[('bar', ),]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, see my reply above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/advanced.rst
Outdated
.. warning:: | ||
|
||
It is important to note that tuples and lists are not treated identically | ||
in pandas. Whereas a tuple is interpreted as one multi-level key, a list is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in pandas "when it comes to multi-indexing" ? to make clear that in many other places we don't make this distinction (or is that already clear from the context?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "when it comes to indexing", since pd.Series(range(3)).loc[(1,2)]
doesn't work (and I think this is good).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/advanced.rst
Outdated
used to specify several keys. | ||
|
||
Importantly, a list of tuples indexes several complete ``MultiIndex`` keys, | ||
whereas a tuple of lists refer to several values within a level: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to explain this a bit more in detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it could also be just an info box. It is more an advanced example than something you should really know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with an info box, but I wish I had really known this in my analysis (which I think isn't that advanced). Anyway, any kind of box is good to attract attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether it is "serious" enough to be written in the docs, but when I happen to discuss this in talks, I always say "tuples go horizontally [traversing levels], lists go vertically [scanning levels]".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/advanced.rst
Outdated
|
||
.. ipython:: python | ||
|
||
pd.set_option('display.multi_sparse', False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use these with the context manager, e.g. pd.option_context('display.multi_sparse', False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -180,14 +178,13 @@ For example: | |||
|
|||
.. ipython:: python | |||
|
|||
# original MultiIndex | |||
df.columns | |||
df.columns # original MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put comments on a separate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I put them next to the commands because otherwise they look like they don't belong to the code (since the prompts are also shown). See http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels for an example how it looks right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in the actual html output it might be clearer to have it on a single line (in general we should avoid comments in long code blocks, and just put that as text between multiple code-blocks, but in this case I think it is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback OK with putting the comments on the same lines?
doc/source/advanced.rst
Outdated
|
||
This is done to avoid a recomputation of the levels in order to make slicing | ||
highly performant. If you want to see the actual used levels. | ||
highly performant. If you want to see only the used levels, you can use the | ||
`get_level_values()` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you reference a method, use :func:`MultiIndex.get_level_values`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/advanced.rst
Outdated
|
||
.. ipython:: python | ||
|
||
df = df.T | ||
df | ||
df.loc['bar'] | ||
df.loc['bar', 'two'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES. showing lists just perpetuates the confusion.
doc/source/advanced.rst
Outdated
df.loc['bar', 'two'] | ||
|
||
If you also want to index a specific column with ``.loc``, you have to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very clear. You must use a tuple is more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
@cbrnr Can you update based on the comments? |
I've updated the docs based on all comments. Please check if it is OK now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Added one minor comment
doc/source/advanced.rst
Outdated
.. ipython:: python | ||
|
||
s = pd.Series([1, 2, 3, 4], | ||
index=pd.MultiIndex.from_product([["A", "B"], ["c", "d"]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add here a "e" in ["c", "d", "e"], so the second example below actually does a selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done!
CircleCI error is unrelated - could someone restart it please? |
@cbrnr Thanks a lot! |
As per our discussion in #16943. Let me know what you think. I'm not quite happy with the new warning box, ideas how to improve the message are welcome.