-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: update the DataFrame.loc[] docstring #20229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexing.py
Outdated
|
||
``.loc[]`` is primarily label based, but may also be used with a | ||
boolean array. | ||
boolean array. Note that if no row or column labels are specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this sentence with "Note that..." is correct. I woudl remove it.
pandas/core/indexing.py
Outdated
@@ -1426,14 +1429,53 @@ class _LocIndexer(_LocationIndexer): | |||
- A list or array of labels, e.g. ``['a', 'b', 'c']``. | |||
- A slice object with labels, e.g. ``'a':'f'`` (note that contrary | |||
to usual python slices, **both** the start and the stop are included!). | |||
- A boolean array. | |||
- A boolean array, e.g. [True, False, True]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify that this should be the same length as the axis being sliced.
pandas/core/indexing.py
Outdated
- A ``callable`` function with one argument (the calling Series, DataFrame | ||
or Panel) and that returns valid output for indexing (one of the above) | ||
|
||
``.loc`` will raise a ``KeyError`` when the items are not found. | ||
|
||
See more at :ref:`Selection by Label <indexing.label>` | ||
|
||
See Also | ||
-------- | ||
at : Selects a single value for a row/column label pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.
for all these.
pandas/core/indexing.py
Outdated
r2 10 20 30 | ||
>>> df.loc[df['c1'] > 10, ['c0', 'c2']] | ||
c0 c2 | ||
r2 10 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a second example series or DataFrame where the index values are integers, but not 0-len(df)
? And then show how .loc
uses the labels and not the positions?
In that second example, could you also show a slice like df.loc[2:5]
and show that it's closed on the right, so the label 5 is included?
pandas/core/indexing.py
Outdated
r0 12 | ||
r1 0 | ||
Name: c0, dtype: int64 | ||
>>> df.loc[[False, False, True]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have small bits of text breaking these up. Like "Indexing with a boolean array."
pandas/core/indexing.py
Outdated
r0 12 2 3 | ||
r1 0 4 1 | ||
r2 10 20 30 | ||
>>> df.loc['r1'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank lines in between cases
Thank you for the comments! I added commentary on each of the different examples. I also added another example |
pandas/core/indexing.py
Outdated
@@ -1413,7 +1413,8 @@ def _get_slice_axis(self, slice_obj, axis=None): | |||
|
|||
|
|||
class _LocIndexer(_LocationIndexer): | |||
"""Purely label-location based indexer for selection by label. | |||
""" | |||
Selects a group of rows and columns by label(s) or a boolean array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Select" is not fully correct as it is not only for getting, but also for setting?
Or is that general enough? (@jreback @TomAugspurger ) To set you of course also need to select the location to set ..
The "see also" is using "access" now instead of "select"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I had meant to switch to using access
all around, although I'm not sure if that gets around the problem you mentioned. Perhaps, it's enough to mention you can use loc
to get and set in the extended summary? I'm struggling to think of a good word that implies both getting and setting...but I will keep thinking on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I can also add some examples of using loc
for setting values below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that would be good
pandas/core/indexing.py
Outdated
c2 1 | ||
Name: r1, dtype: int64 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a single blank line (below as well)
pandas/core/indexing.py
Outdated
See more at :ref:`Selection by Label <indexing.label>` | ||
|
||
See Also | ||
-------- | ||
DateFrame.at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Series.loc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added below this
pandas/core/indexing.py
Outdated
r1 0 4 1 | ||
r2 10 20 30 | ||
|
||
Single label for row (note it would be faster to use ``DateFrame.at`` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the note about perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can mention that this returns a Series and using [[]]
returns a DataFrame (or maybe just show that as an example right after)
pandas/core/indexing.py
Outdated
c2 1 | ||
Name: r1, dtype: int64 | ||
|
||
Single label for row and column (note it would be faster to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, remove the commentary
pandas/core/indexing.py
Outdated
>>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]], | ||
... index=[7, 8, 9], columns=['c0', 'c1', 'c2']) | ||
>>> df | ||
c0 c1 c2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice examples! can you add one using a MultiIndex for the index, and show selecting with tuples. sure this is getting long, but these examples are useful.
pandas/core/indexing.py
Outdated
-------- | ||
DateFrame.at | ||
Access a single value for a row/column label pair | ||
DateFrame.iat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove .iat from here (leave the .iloc though)
Thank you again for the comments. I added a number of other examples with DataFrame that has a MultiIndex. Let me know if it was overkill, or if I missed an important use case (or included one that is not very useful). |
pandas/core/indexing.py
Outdated
DateFrame.at | ||
Access a single value for a row/column label pair | ||
DateFrame.iloc | ||
Access group of rows and columns by integer position(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as on the other PR:
DateFrame.iloc : explanation ..
pandas/core/indexing.py
Outdated
r1 0 4 1 | ||
r2 10 20 30 | ||
|
||
Single label. Note this returns a Series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify "Note this returns a Series." even further as "Note this returns the row as a Series."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some great examples you added! Given that it are quite some, can you add some subsection titles? (you can use bold for that (**Sub section**
) as done in the docstring guide).
For example subsection for getting, setting, MultiIndex (you can see a bit how it can be divided, it is meant to give the long Examples section a bit more structure)
pandas/core/indexing.py
Outdated
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe number the values consectively, so in the output it is easier to see which row was returned
pandas/core/indexing.py
Outdated
|
||
Callable that returns valid output for indexing | ||
|
||
>>> df.loc[df['c1'] > 10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not a callable but a boolean series, same for the example below. I think this is a nice example to keep though, but would explain it a bit different (frame it as a boolean Series that is calculated from the frame itself)
Codecov Report
@@ Coverage Diff @@
## master #20229 +/- ##
==========================================
+ Coverage 91.7% 91.76% +0.05%
==========================================
Files 150 150
Lines 49168 49151 -17
==========================================
+ Hits 45090 45102 +12
+ Misses 4078 4049 -29
Continue to review full report at Codecov.
|
@jorisvandenbossche Thank you for the recommendation on using subsections within the examples--I think that's a great idea. Please let me know if you think there is anything else I should add or modify. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few comments. looks pretty good!. @jorisvandenbossche @TomAugspurger @datapythonista
pandas/core/indexing.py
Outdated
c2 6 | ||
Name: r1, dtype: int64 | ||
|
||
List with a single label. Note using ``[[]]`` returns a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is slightly redudant as you are showing the example with a list below
pandas/core/indexing.py
Outdated
|
||
Boolean list with the same length as the row axis | ||
|
||
>>> df.loc[[False, False, True]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a very common thing to do (directly), the boolean indexing right below is MUCH more important.
pandas/core/indexing.py
Outdated
|
||
Boolean list | ||
|
||
>>> df.loc[[True, False, True, False, True, True]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above (remove this example)
pandas/core/indexing.py
Outdated
Raises | ||
------ | ||
KeyError: | ||
when items are not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when any items are not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job for a quite difficult class to document. Added couple of comments with ideas.
pandas/core/indexing.py
Outdated
-------- | ||
DateFrame.at : Access a single value for a row/column label pair | ||
DateFrame.iloc : Access group of rows and columns by integer position(s) | ||
Series.loc : Access group of values using labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add DataFrame.xs
too.
pandas/core/indexing.py
Outdated
**Getting values** | ||
|
||
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], | ||
... index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this cases makes more sense to use a dataframe with data looking more real. It's just an opinion, but I'd understand easier/faster .loc['falcon', 'max_speed']
than .loc['r1', 'c2']
. I'd also use just 2 columns, I think it should be enough and makes things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I have added some labels some more meaningful labels. Let me know if you like it or have any other feedback on this matter.
pandas/core/indexing.py
Outdated
@@ -1426,14 +1427,231 @@ class _LocIndexer(_LocationIndexer): | |||
- A list or array of labels, e.g. ``['a', 'b', 'c']``. | |||
- A slice object with labels, e.g. ``'a':'f'`` (note that contrary | |||
to usual python slices, **both** the start and the stop are included!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we could use the .. warning::
directive for this comment (instead of a note in brackets ended with the exclamation mark).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, really clear and useful examples.
This looks really good. Question: do we want to mention / document indexing with duplicates? Having duplicates in the index is such a mess that I'm fine with avoiding it, but curious what others think. |
I think we need to discuss a bit to what extent the docstring of methods should be the full reference, or more the basic usage and having the full reference with the many examples in the reference guide (as it is now). But that's a more general discussion, as then we should also need a better way to structure the docstrings with subsections I think. So given that, I would leave this docstring now as it is (keeping it to basic usage). As starting to discuss duplicates as well, it will get much longer. |
👍 Thanks all! |
Thank you all once again! |
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks:
If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.