Skip to content

DOC: update the DataFrame.loc[] docstring #20229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 14, 2018

Conversation

akosel
Copy link
Contributor

@akosel akosel commented Mar 10, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

# paste output of "scripts/validate_docstrings.py <your-function-or-method>" here
# between the "```" (remove this comment, but keep the "```")
################################################################################
####################### Docstring (pandas.DataFrame.loc) #######################
################################################################################

Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

.. warning:: Note that contrary to usual python slices, **both** the start
    and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
DateFrame.at : Access a single value for a row/column label pair
DateFrame.iloc : Access group of rows and columns by integer position(s)
DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the
    Series/DataFrame.
Series.loc : Access group of values using labels

Examples
--------
**Getting values**

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[['viper', 'sidewinder']]
            max_speed  shield
viper               4       5
sidewinder          7       8

Single label for row and column

>>> df.loc['cobra', 'shield']
2

Slice with labels for row and single label for column. As mentioned
above, note that both the start and stop of the slice are included.

>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Boolean list with the same length as the row axis

>>> df.loc[[False, False, True]]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series

>>> df.loc[df['shield'] > 6]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series with column labels specified

>>> df.loc[df['shield'] > 6, ['max_speed']]
            max_speed
sidewinder          7

Callable that returns a boolean Series

>>> df.loc[lambda df: df['shield'] == 8]
            max_speed  shield
sidewinder          7       8

**Setting values**

Set value for all items matching the list of labels

>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
            max_speed  shield
cobra               1       2
viper               4      50
sidewinder          7      50

Set value for an entire row

>>> df.loc['cobra'] = 10
>>> df
            max_speed  shield
cobra              10      10
viper               4      50
sidewinder          7      50

Set value for an entire column

>>> df.loc[:, 'max_speed'] = 30
>>> df
            max_speed  shield
cobra              30      10
viper              30      50
sidewinder         30      50

Set value for rows matching callable condition

>>> df.loc[df['shield'] > 35] = 0
>>> df
            max_speed  shield
cobra              30      10
viper               0       0
sidewinder          0       0

**Getting values on a DataFrame with an index that has integer labels**

Another example using integers for the index

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=[7, 8, 9], columns=['max_speed', 'shield'])
>>> df
   max_speed  shield
7          1       2
8          4       5
9          7       8

Slice with integer labels for rows. As mentioned above, note that both
the start and stop of the slice are included.

>>> df.loc[7:9]
   max_speed  shield
7          1       2
8          4       5
9          7       8

**Getting values with a MultiIndex**

A number of examples using a DataFrame with a MultiIndex

>>> tuples = [
...    ('cobra', 'mark i'), ('cobra', 'mark ii'),
...    ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
...    ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
...         [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Single label. Note this returns a DataFrame with a single index.

>>> df.loc['cobra']
         max_speed  shield
mark i          12       2
mark ii          0       4

Single index tuple. Note this returns a Series.

>>> df.loc[('cobra', 'mark ii')]
max_speed    0
shield       4
Name: (cobra, mark ii), dtype: int64

Single label for row and column. Similar to passing in a tuple, this
returns a Series.

>>> df.loc['cobra', 'mark i']
max_speed    12
shield        2
Name: (cobra, mark i), dtype: int64

Single tuple. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[[('cobra', 'mark ii')]]
               max_speed  shield
cobra mark ii          0       4

Single tuple for the index with a single label for the column

>>> df.loc[('cobra', 'mark i'), 'shield']
2

Slice from index tuple to single label

>>> df.loc[('cobra', 'mark i'):'viper']
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Slice from index tuple to index tuple

>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
                    max_speed  shield
cobra      mark i          12       2
           mark ii          0       4
sidewinder mark i          10      20
           mark ii          1       4
viper      mark ii          7       1

Raises
------
KeyError:
    when any items are not found

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	No returns section found

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

@TomAugspurger TomAugspurger added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 10, 2018

``.loc[]`` is primarily label based, but may also be used with a
boolean array.
boolean array. Note that if no row or column labels are specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this sentence with "Note that..." is correct. I woudl remove it.

@@ -1426,14 +1429,53 @@ class _LocIndexer(_LocationIndexer):
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'`` (note that contrary
to usual python slices, **both** the start and the stop are included!).
- A boolean array.
- A boolean array, e.g. [True, False, True].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify that this should be the same length as the axis being sliced.

- A ``callable`` function with one argument (the calling Series, DataFrame
or Panel) and that returns valid output for indexing (one of the above)

``.loc`` will raise a ``KeyError`` when the items are not found.

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
at : Selects a single value for a row/column label pair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame. for all these.

r2 10 20 30
>>> df.loc[df['c1'] > 10, ['c0', 'c2']]
c0 c2
r2 10 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a second example series or DataFrame where the index values are integers, but not 0-len(df)? And then show how .loc uses the labels and not the positions?

In that second example, could you also show a slice like df.loc[2:5] and show that it's closed on the right, so the label 5 is included?

r0 12
r1 0
Name: c0, dtype: int64
>>> df.loc[[False, False, True]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have small bits of text breaking these up. Like "Indexing with a boolean array."

r0 12 2 3
r1 0 4 1
r2 10 20 30
>>> df.loc['r1']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank lines in between cases

@akosel
Copy link
Contributor Author

akosel commented Mar 10, 2018

Thank you for the comments! I added commentary on each of the different examples. I also added another example df with integers set for the index labels. Please let me know if you have any other changes you'd like to see.

@@ -1413,7 +1413,8 @@ def _get_slice_axis(self, slice_obj, axis=None):


class _LocIndexer(_LocationIndexer):
"""Purely label-location based indexer for selection by label.
"""
Selects a group of rows and columns by label(s) or a boolean array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Select" is not fully correct as it is not only for getting, but also for setting?
Or is that general enough? (@jreback @TomAugspurger ) To set you of course also need to select the location to set ..
The "see also" is using "access" now instead of "select"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I had meant to switch to using access all around, although I'm not sure if that gets around the problem you mentioned. Perhaps, it's enough to mention you can use loc to get and set in the extended summary? I'm struggling to think of a good word that implies both getting and setting...but I will keep thinking on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I can also add some examples of using loc for setting values below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that would be good

c2 1
Name: r1, dtype: int64


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a single blank line (below as well)

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
DateFrame.at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Series.loc

Copy link
Contributor Author

@akosel akosel Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below this

r1 0 4 1
r2 10 20 30

Single label for row (note it would be faster to use ``DateFrame.at`` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the note about perf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can mention that this returns a Series and using [[]] returns a DataFrame (or maybe just show that as an example right after)

c2 1
Name: r1, dtype: int64

Single label for row and column (note it would be faster to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, remove the commentary

>>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]],
... index=[7, 8, 9], columns=['c0', 'c1', 'c2'])
>>> df
c0 c1 c2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice examples! can you add one using a MultiIndex for the index, and show selecting with tuples. sure this is getting long, but these examples are useful.

--------
DateFrame.at
Access a single value for a row/column label pair
DateFrame.iat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove .iat from here (leave the .iloc though)

@akosel
Copy link
Contributor Author

akosel commented Mar 11, 2018

Thank you again for the comments. I added a number of other examples with DataFrame that has a MultiIndex. Let me know if it was overkill, or if I missed an important use case (or included one that is not very useful).

DateFrame.at
Access a single value for a row/column label pair
DateFrame.iloc
Access group of rows and columns by integer position(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as on the other PR:

DateFrame.iloc : explanation ..

r1 0 4 1
r2 10 20 30

Single label. Note this returns a Series.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify "Note this returns a Series." even further as "Note this returns the row as a Series."

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some great examples you added! Given that it are quite some, can you add some subsection titles? (you can use bold for that (**Sub section**) as done in the docstring guide).
For example subsection for getting, setting, MultiIndex (you can see a bit how it can be divided, it is meant to give the long Examples section a bit more structure)


Examples
--------
>>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe number the values consectively, so in the output it is easier to see which row was returned


Callable that returns valid output for indexing

>>> df.loc[df['c1'] > 10]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not a callable but a boolean series, same for the example below. I think this is a nice example to keep though, but would explain it a bit different (frame it as a boolean Series that is calculated from the frame itself)

@codecov
Copy link

codecov bot commented Mar 13, 2018

Codecov Report

Merging #20229 into master will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20229      +/-   ##
==========================================
+ Coverage    91.7%   91.76%   +0.05%     
==========================================
  Files         150      150              
  Lines       49168    49151      -17     
==========================================
+ Hits        45090    45102      +12     
+ Misses       4078     4049      -29
Flag Coverage Δ
#multiple 90.14% <ø> (+0.05%) ⬆️
#single 41.9% <ø> (+0.03%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexing.py 93.02% <ø> (ø) ⬆️
pandas/core/window.py 96.3% <0%> (-0.03%) ⬇️
pandas/core/algorithms.py 94.17% <0%> (-0.01%) ⬇️
pandas/core/resample.py 96.43% <0%> (ø) ⬆️
pandas/core/series.py 93.85% <0%> (ø) ⬆️
pandas/core/strings.py 98.32% <0%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.72% <0%> (ø) ⬆️
pandas/errors/__init__.py 92.3% <0%> (ø) ⬆️
pandas/core/groupby.py 92.14% <0%> (ø) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df2e361...a23a8e9. Read the comment docs.

@akosel
Copy link
Contributor Author

akosel commented Mar 13, 2018

@jorisvandenbossche Thank you for the recommendation on using subsections within the examples--I think that's a great idea. Please let me know if you think there is anything else I should add or modify.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments. looks pretty good!. @jorisvandenbossche @TomAugspurger @datapythonista

c2 6
Name: r1, dtype: int64

List with a single label. Note using ``[[]]`` returns a DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is slightly redudant as you are showing the example with a list below


Boolean list with the same length as the row axis

>>> df.loc[[False, False, True]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a very common thing to do (directly), the boolean indexing right below is MUCH more important.


Boolean list

>>> df.loc[[True, False, True, False, True, True]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (remove this example)

Raises
------
KeyError:
when items are not found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when any items are not found

@jreback jreback added this to the 0.23.0 milestone Mar 13, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job for a quite difficult class to document. Added couple of comments with ideas.

--------
DateFrame.at : Access a single value for a row/column label pair
DateFrame.iloc : Access group of rows and columns by integer position(s)
Series.loc : Access group of values using labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add DataFrame.xs too.

**Getting values**

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
... index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this cases makes more sense to use a dataframe with data looking more real. It's just an opinion, but I'd understand easier/faster .loc['falcon', 'max_speed'] than .loc['r1', 'c2']. I'd also use just 2 columns, I think it should be enough and makes things simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I have added some labels some more meaningful labels. Let me know if you like it or have any other feedback on this matter.

@@ -1426,14 +1427,231 @@ class _LocIndexer(_LocationIndexer):
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'`` (note that contrary
to usual python slices, **both** the start and the stop are included!).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we could use the .. warning:: directive for this comment (instead of a note in brackets ended with the exclamation mark).

@jreback
Copy link
Contributor

jreback commented Mar 14, 2018

lgtm. @TomAugspurger @jorisvandenbossche

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, really clear and useful examples.

@TomAugspurger
Copy link
Contributor

This looks really good.

Question: do we want to mention / document indexing with duplicates? Having duplicates in the index is such a mess that I'm fine with avoiding it, but curious what others think.

@jorisvandenbossche
Copy link
Member

Question: do we want to mention / document indexing with duplicates? Having duplicates in the index is such a mess that I'm fine with avoiding it, but curious what others think

I think we need to discuss a bit to what extent the docstring of methods should be the full reference, or more the basic usage and having the full reference with the many examples in the reference guide (as it is now). But that's a more general discussion, as then we should also need a better way to structure the docstrings with subsections I think.

So given that, I would leave this docstring now as it is (keeping it to basic usage). As starting to discuss duplicates as well, it will get much longer.

@TomAugspurger
Copy link
Contributor

👍

Thanks all!

@TomAugspurger TomAugspurger merged commit 28eb190 into pandas-dev:master Mar 14, 2018
@akosel
Copy link
Contributor Author

akosel commented Mar 14, 2018

Thank you all once again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants