DOC: update the DataFrame.loc[] docstring #20229

akosel · 2018-03-10T20:08:39Z

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

PR title is "DOC: update the docstring"
The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
The html version looks good: python doc/make.py --single <your-function-or-method>
It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

# paste output of "scripts/validate_docstrings.py <your-function-or-method>" here
# between the "```" (remove this comment, but keep the "```")
################################################################################
####################### Docstring (pandas.DataFrame.loc) #######################
################################################################################

Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

.. warning:: Note that contrary to usual python slices, **both** the start
    and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
DateFrame.at : Access a single value for a row/column label pair
DateFrame.iloc : Access group of rows and columns by integer position(s)
DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the
    Series/DataFrame.
Series.loc : Access group of values using labels

Examples
--------
**Getting values**

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[['viper', 'sidewinder']]
            max_speed  shield
viper               4       5
sidewinder          7       8

Single label for row and column

>>> df.loc['cobra', 'shield']
2

Slice with labels for row and single label for column. As mentioned
above, note that both the start and stop of the slice are included.

>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Boolean list with the same length as the row axis

>>> df.loc[[False, False, True]]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series

>>> df.loc[df['shield'] > 6]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series with column labels specified

>>> df.loc[df['shield'] > 6, ['max_speed']]
            max_speed
sidewinder          7

Callable that returns a boolean Series

>>> df.loc[lambda df: df['shield'] == 8]
            max_speed  shield
sidewinder          7       8

**Setting values**

Set value for all items matching the list of labels

>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
            max_speed  shield
cobra               1       2
viper               4      50
sidewinder          7      50

Set value for an entire row

>>> df.loc['cobra'] = 10
>>> df
            max_speed  shield
cobra              10      10
viper               4      50
sidewinder          7      50

Set value for an entire column

>>> df.loc[:, 'max_speed'] = 30
>>> df
            max_speed  shield
cobra              30      10
viper              30      50
sidewinder         30      50

Set value for rows matching callable condition

>>> df.loc[df['shield'] > 35] = 0
>>> df
            max_speed  shield
cobra              30      10
viper               0       0
sidewinder          0       0

**Getting values on a DataFrame with an index that has integer labels**

Another example using integers for the index

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=[7, 8, 9], columns=['max_speed', 'shield'])
>>> df
   max_speed  shield
7          1       2
8          4       5
9          7       8

Slice with integer labels for rows. As mentioned above, note that both
the start and stop of the slice are included.

>>> df.loc[7:9]
   max_speed  shield
7          1       2
8          4       5
9          7       8

**Getting values with a MultiIndex**

A number of examples using a DataFrame with a MultiIndex

>>> tuples = [
...    ('cobra', 'mark i'), ('cobra', 'mark ii'),
...    ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
...    ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
...         [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Single label. Note this returns a DataFrame with a single index.

>>> df.loc['cobra']
         max_speed  shield
mark i          12       2
mark ii          0       4

Single index tuple. Note this returns a Series.

>>> df.loc[('cobra', 'mark ii')]
max_speed    0
shield       4
Name: (cobra, mark ii), dtype: int64

Single label for row and column. Similar to passing in a tuple, this
returns a Series.

>>> df.loc['cobra', 'mark i']
max_speed    12
shield        2
Name: (cobra, mark i), dtype: int64

Single tuple. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[[('cobra', 'mark ii')]]
               max_speed  shield
cobra mark ii          0       4

Single tuple for the index with a single label for the column

>>> df.loc[('cobra', 'mark i'), 'shield']
2

Slice from index tuple to single label

>>> df.loc[('cobra', 'mark i'):'viper']
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Slice from index tuple to index tuple

>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
                    max_speed  shield
cobra      mark i          12       2
           mark ii          0       4
sidewinder mark i          10      20
           mark ii          1       4
viper      mark ii          7       1

Raises
------
KeyError:
    when any items are not found

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	No returns section found

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

TomAugspurger · 2018-03-10T20:12:07Z

pandas/core/indexing.py


    ``.loc[]`` is primarily label based, but may also be used with a
-    boolean array.
+    boolean array. Note that if no row or column labels are specified


I don't think this sentence with "Note that..." is correct. I woudl remove it.

TomAugspurger · 2018-03-10T20:12:32Z

pandas/core/indexing.py

@@ -1426,14 +1429,53 @@ class _LocIndexer(_LocationIndexer):
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'`` (note that contrary
      to usual python slices, **both** the start and the stop are included!).
-    - A boolean array.
+    - A boolean array, e.g. [True, False, True].


specify that this should be the same length as the axis being sliced.

TomAugspurger · 2018-03-10T20:12:45Z

pandas/core/indexing.py

    - A ``callable`` function with one argument (the calling Series, DataFrame
      or Panel) and that returns valid output for indexing (one of the above)

    ``.loc`` will raise a ``KeyError`` when the items are not found.

    See more at :ref:`Selection by Label <indexing.label>`

+    See Also
+    --------
+    at : Selects a single value for a row/column label pair


DataFrame. for all these.

TomAugspurger · 2018-03-10T20:14:26Z

pandas/core/indexing.py

+    r2  10  20  30
+    >>> df.loc[df['c1'] > 10, ['c0', 'c2']]
+        c0  c2
+    r2  10  30


Can you make a second example series or DataFrame where the index values are integers, but not 0-len(df)? And then show how .loc uses the labels and not the positions?

In that second example, could you also show a slice like df.loc[2:5] and show that it's closed on the right, so the label 5 is included?

TomAugspurger · 2018-03-10T20:15:01Z

pandas/core/indexing.py

+    r0    12
+    r1     0
+    Name: c0, dtype: int64
+    >>> df.loc[[False, False, True]]


Would be nice to have small bits of text breaking these up. Like "Indexing with a boolean array."

jreback · 2018-03-10T21:07:31Z

pandas/core/indexing.py

+    r0  12   2   3
+    r1   0   4   1
+    r2  10  20  30
+    >>> df.loc['r1']


blank lines in between cases

…omments

akosel · 2018-03-10T22:30:20Z

Thank you for the comments! I added commentary on each of the different examples. I also added another example df with integers set for the index labels. Please let me know if you have any other changes you'd like to see.

jorisvandenbossche · 2018-03-10T22:57:04Z

pandas/core/indexing.py

@@ -1413,7 +1413,8 @@ def _get_slice_axis(self, slice_obj, axis=None):


 class _LocIndexer(_LocationIndexer):
-    """Purely label-location based indexer for selection by label.
+    """
+    Selects a group of rows and columns by label(s) or a boolean array.


"Select" is not fully correct as it is not only for getting, but also for setting?
Or is that general enough? (@jreback @TomAugspurger ) To set you of course also need to select the location to set ..
The "see also" is using "access" now instead of "select"

Ah yes, I had meant to switch to using access all around, although I'm not sure if that gets around the problem you mentioned. Perhaps, it's enough to mention you can use loc to get and set in the extended summary? I'm struggling to think of a good word that implies both getting and setting...but I will keep thinking on it.

And I can also add some examples of using loc for setting values below.

yes, that would be good

jorisvandenbossche · 2018-03-10T22:57:45Z

pandas/core/indexing.py

+    c2    1
+    Name: r1, dtype: int64
+
+


only a single blank line (below as well)

jreback · 2018-03-11T14:25:23Z

pandas/core/indexing.py

    See more at :ref:`Selection by Label <indexing.label>`

+    See Also
+    --------
+    DateFrame.at


add Series.loc

Added below this

jreback · 2018-03-11T14:25:50Z

pandas/core/indexing.py

+    r1   0   4   1
+    r2  10  20  30
+
+    Single label for row (note it would be faster to use ``DateFrame.at`` in


don't need the note about perf

you can mention that this returns a Series and using [[]] returns a DataFrame (or maybe just show that as an example right after)

jreback · 2018-03-11T14:26:03Z

pandas/core/indexing.py

+    c2    1
+    Name: r1, dtype: int64
+
+    Single label for row and column (note it would be faster to use


same, remove the commentary

jreback · 2018-03-11T14:27:17Z

pandas/core/indexing.py

+    >>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]],
+    ...      index=[7, 8, 9], columns=['c0', 'c1', 'c2'])
+    >>> df
+        c0  c1  c2


nice examples! can you add one using a MultiIndex for the index, and show selecting with tuples. sure this is getting long, but these examples are useful.

jreback · 2018-03-11T14:29:25Z

pandas/core/indexing.py

+    --------
+    DateFrame.at
+        Access a single value for a row/column label pair
+    DateFrame.iat


you can remove .iat from here (leave the .iloc though)

akosel · 2018-03-11T23:14:20Z

Thank you again for the comments. I added a number of other examples with DataFrame that has a MultiIndex. Let me know if it was overkill, or if I missed an important use case (or included one that is not very useful).

jorisvandenbossche · 2018-03-12T13:21:59Z

pandas/core/indexing.py

+    DateFrame.at
+        Access a single value for a row/column label pair
+    DateFrame.iloc
+        Access group of rows and columns by integer position(s)


Same comment here as on the other PR:

DateFrame.iloc : explanation ..

jorisvandenbossche · 2018-03-12T13:22:35Z

pandas/core/indexing.py

+    r1   0   4   1
+    r2  10  20  30
+
+    Single label. Note this returns a Series.


Maybe clarify "Note this returns a Series." even further as "Note this returns the row as a Series."

jorisvandenbossche

Some great examples you added! Given that it are quite some, can you add some subsection titles? (you can use bold for that (**Sub section**) as done in the docstring guide).
For example subsection for getting, setting, MultiIndex (you can see a bit how it can be divided, it is meant to give the long Examples section a bit more structure)

jorisvandenbossche · 2018-03-12T13:24:01Z

pandas/core/indexing.py

+
+    Examples
+    --------
+    >>> df = pd.DataFrame([[12, 2, 3], [0, 4, 1], [10, 20, 30]],


I would maybe number the values consectively, so in the output it is easier to see which row was returned

jorisvandenbossche · 2018-03-12T13:25:42Z

pandas/core/indexing.py

+
+    Callable that returns valid output for indexing
+
+    >>> df.loc[df['c1'] > 10]


This is actually not a callable but a boolean series, same for the example below. I think this is a nice example to keep though, but would explain it a bit different (frame it as a boolean Series that is calculated from the frame itself)

codecov · 2018-03-13T04:54:14Z

Codecov Report

Merging #20229 into master will increase coverage by 0.05%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #20229      +/-   ##
==========================================
+ Coverage    91.7%   91.76%   +0.05%     
==========================================
  Files         150      150              
  Lines       49168    49151      -17     
==========================================
+ Hits        45090    45102      +12     
+ Misses       4078     4049      -29

Flag	Coverage Δ
#multiple	`90.14% <ø> (+0.05%)`	⬆️
#single	`41.9% <ø> (+0.03%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexing.py	`93.02% <ø> (ø)`	⬆️
pandas/core/window.py	`96.3% <0%> (-0.03%)`	⬇️
pandas/core/algorithms.py	`94.17% <0%> (-0.01%)`	⬇️
pandas/core/resample.py	`96.43% <0%> (ø)`	⬆️
pandas/core/series.py	`93.85% <0%> (ø)`	⬆️
pandas/core/strings.py	`98.32% <0%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`96.72% <0%> (ø)`	⬆️
pandas/errors/__init__.py	`92.3% <0%> (ø)`	⬆️
pandas/core/groupby.py	`92.14% <0%> (ø)`	⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df2e361...a23a8e9. Read the comment docs.

akosel · 2018-03-13T04:55:40Z

@jorisvandenbossche Thank you for the recommendation on using subsections within the examples--I think that's a great idea. Please let me know if you think there is anything else I should add or modify.

jreback

just a few comments. looks pretty good!. @jorisvandenbossche @TomAugspurger @datapythonista

jreback · 2018-03-13T10:25:47Z

pandas/core/indexing.py

+    c2    6
+    Name: r1, dtype: int64
+
+    List with a single label. Note using ``[[]]`` returns a DataFrame.


this is slightly redudant as you are showing the example with a list below

jreback · 2018-03-13T10:26:19Z

pandas/core/indexing.py

+
+    Boolean list with the same length as the row axis
+
+    >>> df.loc[[False, False, True]]


this is not a very common thing to do (directly), the boolean indexing right below is MUCH more important.

jreback · 2018-03-13T10:27:35Z

pandas/core/indexing.py

+
+    Boolean list
+
+    >>> df.loc[[True, False, True, False, True, True]]


same as above (remove this example)

jreback · 2018-03-13T10:27:47Z

pandas/core/indexing.py

+    Raises
+    ------
+    KeyError:
+        when items are not found


when any items are not found

datapythonista

Great job for a quite difficult class to document. Added couple of comments with ideas.

datapythonista · 2018-03-13T14:53:21Z

pandas/core/indexing.py

+    --------
+    DateFrame.at : Access a single value for a row/column label pair
+    DateFrame.iloc : Access group of rows and columns by integer position(s)
+    Series.loc : Access group of values using labels


I'd add DataFrame.xs too.

datapythonista · 2018-03-13T14:57:20Z

pandas/core/indexing.py

+    **Getting values**
+
+    >>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
+    ...      index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2'])


I think in this cases makes more sense to use a dataframe with data looking more real. It's just an opinion, but I'd understand easier/faster .loc['falcon', 'max_speed'] than .loc['r1', 'c2']. I'd also use just 2 columns, I think it should be enough and makes things simpler.

I agree. I have added some labels some more meaningful labels. Let me know if you like it or have any other feedback on this matter.

datapythonista · 2018-03-13T15:04:13Z

pandas/core/indexing.py

@@ -1426,14 +1427,231 @@ class _LocIndexer(_LocationIndexer):
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'`` (note that contrary
      to usual python slices, **both** the start and the stop are included!).


May be we could use the .. warning:: directive for this comment (instead of a note in brackets ended with the exclamation mark).

jreback · 2018-03-14T11:01:22Z

lgtm. @TomAugspurger @jorisvandenbossche

datapythonista

lgtm, really clear and useful examples.

TomAugspurger · 2018-03-14T12:49:01Z

This looks really good.

Question: do we want to mention / document indexing with duplicates? Having duplicates in the index is such a mess that I'm fine with avoiding it, but curious what others think.

jorisvandenbossche · 2018-03-14T12:58:32Z

Question: do we want to mention / document indexing with duplicates? Having duplicates in the index is such a mess that I'm fine with avoiding it, but curious what others think

I think we need to discuss a bit to what extent the docstring of methods should be the full reference, or more the basic usage and having the full reference with the many examples in the reference guide (as it is now). But that's a more general discussion, as then we should also need a better way to structure the docstrings with subsections I think.

So given that, I would leave this docstring now as it is (keeping it to basic usage). As starting to discuss duplicates as well, it will get much longer.

TomAugspurger · 2018-03-14T13:00:32Z

👍

Thanks all!

akosel · 2018-03-14T13:30:57Z

Thank you all once again!

DOC: update the DataFrame.loc[] docstring

2f359b9

TomAugspurger added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 10, 2018

TomAugspurger requested changes Mar 10, 2018

View reviewed changes

jreback requested changes Mar 10, 2018

View reviewed changes

pandas/core/indexing.py Outdated

r0 12 2 3

r1 0 4 1

r2 10 20 30

>>> df.loc['r1']

Copy link

Contributor

jreback Mar 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank lines in between cases

More labels for examples. More examples. Update wording based on PR c…

1a93d2a

…omments

jorisvandenbossche reviewed Mar 10, 2018

View reviewed changes

akosel added 2 commits March 10, 2018 17:08

Remove lines and be consistent with wording

a3238d9

Add examples of setting values with loc

78f342c

jreback requested changes Mar 11, 2018

View reviewed changes

jreback reviewed Mar 11, 2018

View reviewed changes

Update See Also. Add more examples and specifies return type

c28a796

akosel force-pushed the docstring_loc branch from 2e32ba3 to c28a796 Compare March 11, 2018 23:12

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

Better labeling of subsections within docs

64c698b

jreback requested changes Mar 13, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Mar 13, 2018

datapythonista reviewed Mar 13, 2018

View reviewed changes

Update based on feedback

0902b36

jreback approved these changes Mar 14, 2018

View reviewed changes

datapythonista approved these changes Mar 14, 2018

View reviewed changes

rst formatting

a23a8e9

jorisvandenbossche approved these changes Mar 14, 2018

View reviewed changes

TomAugspurger merged commit 28eb190 into pandas-dev:master Mar 14, 2018


		Callable that returns valid output for indexing

		>>> df.loc[df['c1'] > 10]


		Boolean list with the same length as the row axis

		>>> df.loc[[False, False, True]]


		Boolean list

		>>> df.loc[[True, False, True, False, True, True]]

Uh oh!

DOC: update the DataFrame.loc[] docstring #20229

DOC: update the DataFrame.loc[] docstring #20229

Uh oh!

Conversation

akosel commented Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akosel commented Mar 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akosel Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akosel commented Mar 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

akosel commented Mar 13, 2018

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datapythonista left a comment

Choose a reason for hiding this comment

akosel commented Mar 10, 2018 •

edited

Loading

akosel Mar 12, 2018 •

edited

Loading

codecov bot commented Mar 13, 2018 •

edited

Loading