Skip to content

DOC: Harmonize column selection to bracket notation #27562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 26, 2019
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/getting_started/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Getting
~~~~~~~

Selecting a single column, which yields a ``Series``,
equivalent to ``df.A``:
equivalent to ``d.A``:

.. ipython:: python

Expand Down Expand Up @@ -278,7 +278,7 @@ Using a single column's values to select data.

.. ipython:: python

df[df.A > 0]
df[df['A'] > 0]

Selecting values from a DataFrame where a boolean condition is met.

Expand Down
12 changes: 6 additions & 6 deletions doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -926,7 +926,7 @@ Single aggregations on a ``Series`` this will return a scalar value:

.. ipython:: python

tsdf.A.agg('sum')
tsdf['A'].agg('sum')


Aggregating with multiple functions
Expand All @@ -950,13 +950,13 @@ On a ``Series``, multiple functions return a ``Series``, indexed by the function

.. ipython:: python

tsdf.A.agg(['sum', 'mean'])
tsdf['A'].agg(['sum', 'mean'])

Passing a ``lambda`` function will yield a ``<lambda>`` named row:

.. ipython:: python

tsdf.A.agg(['sum', lambda x: x.mean()])
tsdf['A'].agg(['sum', lambda x: x.mean()])

Passing a named function will yield that name for the row:

Expand All @@ -965,7 +965,7 @@ Passing a named function will yield that name for the row:
def mymean(x):
return x.mean()

tsdf.A.agg(['sum', mymean])
tsdf['A'].agg(['sum', mymean])

Aggregating with a dict
+++++++++++++++++++++++
Expand Down Expand Up @@ -1065,7 +1065,7 @@ Passing a single function to ``.transform()`` with a ``Series`` will yield a sin

.. ipython:: python

tsdf.A.transform(np.abs)
tsdf['A'].transform(np.abs)


Transform with multiple functions
Expand All @@ -1084,7 +1084,7 @@ resulting column names will be the transforming functions.

.. ipython:: python

tsdf.A.transform([np.abs, lambda x: x + 1])
tsdf['A'].transform([np.abs, lambda x: x + 1])


Transforming with a dict
Expand Down
8 changes: 4 additions & 4 deletions doc/source/getting_started/comparison/comparison_with_r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ R pandas
=========================================== ===========================================
``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']``
``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})``
``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)``
``mutate(df, c=a-b)`` ``df.assign(c=df['a']-df['b'])``
=========================================== ===========================================


Expand Down Expand Up @@ -258,8 +258,8 @@ index/slice as well as standard boolean indexing:

df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
df.query('a <= b')
df[df.a <= df.b]
df.loc[df.a <= df.b]
df[df['a'] <= df['b']]
df.loc[df['a'] <= df['b']]

For more details and examples see :ref:`the query documentation
<indexing.query>`.
Expand All @@ -284,7 +284,7 @@ In ``pandas`` the equivalent expression, using the

df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
df.eval('a + b')
df.a + df.b # same as the previous expression
df['a'] + df['b'] # same as the previous expression

In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than
evaluation in pure Python. For more details and examples see :ref:`the eval
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,7 @@ and allows efficient indexing and storage of an index with a large number of dup
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
df['B'].cat.categories

Setting the index will create a ``CategoricalIndex``.

Expand Down
6 changes: 3 additions & 3 deletions doc/source/user_guide/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -592,8 +592,8 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
.. ipython:: python

df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
df.A.groupby((df.A != df.A.shift()).cumsum()).groups
df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()

Expanding data
**************
Expand Down Expand Up @@ -719,7 +719,7 @@ Rolling Apply to multiple columns where function calculates a Series before a Sc
df

def gm(df, const):
v = ((((df.A + df.B) + 1).cumprod()) - 1) * const
v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
return v.iloc[-1]

s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
Expand Down
12 changes: 6 additions & 6 deletions doc/source/user_guide/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -393,15 +393,15 @@ Consider the following toy example of doubling each observation:
.. code-block:: ipython

# Custom function without numba
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
In [5]: %timeit df['col1_doubled'] = df['a'].apply(double_every_value_nonumba) # noqa E501
1000 loops, best of 3: 797 us per loop

# Standard implementation (faster than a custom function)
In [6]: %timeit df['col1_doubled'] = df.a * 2
In [6]: %timeit df['col1_doubled'] = df['a'] * 2
1000 loops, best of 3: 233 us per loop

# Custom function with numba
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df['a'].to_numpy())
1000 loops, best of 3: 145 us per loop

Caveats
Expand Down Expand Up @@ -643,8 +643,8 @@ The equivalent in standard Python would be
.. ipython:: python

df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
df['c'] = df.a + df.b
df['d'] = df.a + df.b + df.c
df['c'] = df['a'] + df['b']
df['d'] = df['a'] + df['b'] + df['c']
df['a'] = 1
df

Expand Down Expand Up @@ -688,7 +688,7 @@ name in an expression.

a = np.random.randn()
df.query('@a < a')
df.loc[a < df.a] # same as the previous expression
df.loc[a < df['a']] # same as the previous expression

With :func:`pandas.eval` you cannot use the ``@`` prefix *at all*, because it
isn't defined in that context. ``pandas`` will let you know this if you try to
Expand Down
39 changes: 21 additions & 18 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ as an attribute:
See `here for an explanation of valid identifiers
<https://docs.python.org/3/reference/lexical_analysis.html#identifiers>`__.

- The attribute will not be available if it conflicts with an existing method name, e.g. ``s.min`` is not allowed.
- The attribute will not be available if it conflicts with an existing method name, e.g. ``s.min`` is not allowed, but ``s['min']``is possible.

- Similarly, the attribute will not be available if it conflicts with any of the following list: ``index``,
``major_axis``, ``minor_axis``, ``items``.
Expand Down Expand Up @@ -540,7 +540,7 @@ The ``callable`` must be a function with one argument (the calling Series or Dat
columns=list('ABCD'))
df1

df1.loc[lambda df: df.A > 0, :]
df1.loc[lambda df: df['A'] > 0, :]
df1.loc[:, lambda df: ['A', 'B']]

df1.iloc[:, lambda df: [0, 1]]
Expand All @@ -552,7 +552,7 @@ You can use callable indexing in ``Series``.

.. ipython:: python

df1.A.loc[lambda s: s > 0]
df1['A'].loc[lambda s: s > 0]

Using these methods / indexers, you can chain data selection operations
without using a temporary variable.
Expand All @@ -561,7 +561,7 @@ without using a temporary variable.

bb = pd.read_csv('data/baseball.csv', index_col='id')
(bb.groupby(['year', 'team']).sum()
.loc[lambda df: df.r > 100])
.loc[lambda df: df['r'] > 100])

.. _indexing.deprecate_ix:

Expand Down Expand Up @@ -871,9 +871,9 @@ Boolean indexing
Another common operation is the use of boolean vectors to filter the data.
The operators are: ``|`` for ``or``, ``&`` for ``and``, and ``~`` for ``not``.
These **must** be grouped by using parentheses, since by default Python will
evaluate an expression such as ``df.A > 2 & df.B < 3`` as
``df.A > (2 & df.B) < 3``, while the desired evaluation order is
``(df.A > 2) & (df.B < 3)``.
evaluate an expression such as ``df['A'] > 2 & df['B'] < 3`` as
``df['A'] > (2 & df['B']) < 3``, while the desired evaluation order is
``(df['A > 2) & (df['B'] < 3)``.

Using a boolean vector to index a Series works exactly as in a NumPy ndarray:

Expand Down Expand Up @@ -1134,7 +1134,7 @@ between the values of columns ``a`` and ``c``. For example:
df

# pure python
df[(df.a < df.b) & (df.b < df.c)]
df[(df['a'] < df['b']) & (df['b'] < df['c'])]

# query
df.query('(a < b) & (b < c)')
Expand Down Expand Up @@ -1241,7 +1241,7 @@ Full numpy-like syntax:
df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
df
df.query('(a < b) & (b < c)')
df[(df.a < df.b) & (df.b < df.c)]
df[(df['a'] < df['b']) & (df['b'] < df['c'])]

Slightly nicer by removing the parentheses (by binding making comparison
operators bind tighter than ``&`` and ``|``).
Expand Down Expand Up @@ -1279,12 +1279,12 @@ The ``in`` and ``not in`` operators
df.query('a in b')

# How you'd do it in pure Python
df[df.a.isin(df.b)]
df[df['a'].isin(df['b'])]

df.query('a not in b')

# pure Python
df[~df.a.isin(df.b)]
df[~df['a'].isin(df['b'])]


You can combine this with other expressions for very succinct queries:
Expand All @@ -1297,7 +1297,7 @@ You can combine this with other expressions for very succinct queries:
df.query('a in b and c < d')

# pure Python
df[df.b.isin(df.a) & (df.c < df.d)]
df[df['b'].isin(df['a']) & (df['c'] < df['d'])]


.. note::
Expand Down Expand Up @@ -1326,7 +1326,7 @@ to ``in``/``not in``.
df.query('b == ["a", "b", "c"]')

# pure Python
df[df.b.isin(["a", "b", "c"])]
df[df['b'].isin(["a", "b", "c"])]

df.query('c == [1, 2]')

Expand All @@ -1338,7 +1338,7 @@ to ``in``/``not in``.
df.query('[1, 2] not in c')

# pure Python
df[df.c.isin([1, 2])]
df[df['c'].isin([1, 2])]


Boolean operators
Expand All @@ -1352,7 +1352,7 @@ You can negate boolean expressions with the word ``not`` or the ``~`` operator.
df['bools'] = np.random.rand(len(df)) > 0.5
df.query('~bools')
df.query('not bools')
df.query('not bools') == df[~df.bools]
df.query('not bools') == df[~df['bools']]

Of course, expressions can be arbitrarily complex too:

Expand All @@ -1362,7 +1362,10 @@ Of course, expressions can be arbitrarily complex too:
shorter = df.query('a < b < c and (not bools) or bools > 2')

# equivalent in pure Python
longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]
longer = df[(df['a'] < df['b'])
& (df['b'] < df['c'])
& (~df['bools'])
| (df['bools'] > 2)]

shorter
longer
Expand Down Expand Up @@ -1835,14 +1838,14 @@ chained indexing expression, you can set the :ref:`option <options>`

# This will show the SettingWithCopyWarning
# but the frame values will be set
dfb['c'][dfb.a.str.startswith('o')] = 42
dfb['c'][dfb['a'].str.startswith('o')] = 42

This however is operating on a copy and will not work.

::

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
>>> dfb[dfb['a'].str.startswith('o')]['c'] = 42
Traceback (most recent call last)
...
SettingWithCopyWarning:
Expand Down
10 changes: 5 additions & 5 deletions doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -469,7 +469,7 @@ If ``crosstab`` receives only two Series, it will provide a frequency table.
'C': [1, 1, np.nan, 1, 1]})
df

pd.crosstab(df.A, df.B)
pd.crosstab(df['A'], df['B'])

Any input passed containing ``Categorical`` data will have **all** of its
categories included in the cross-tabulation, even if the actual data does
Expand All @@ -489,21 +489,21 @@ using the ``normalize`` argument:

.. ipython:: python

pd.crosstab(df.A, df.B, normalize=True)
pd.crosstab(df['A'], df['B'], normalize=True)

``normalize`` can also normalize values within each row or within each column:

.. ipython:: python

pd.crosstab(df.A, df.B, normalize='columns')
pd.crosstab(df['A'], df['B'], normalize='columns')

``crosstab`` can also be passed a third ``Series`` and an aggregation function
(``aggfunc``) that will be applied to the values of the third ``Series`` within
each group defined by the first two ``Series``:

.. ipython:: python

pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum)

Adding margins
~~~~~~~~~~~~~~
Expand All @@ -512,7 +512,7 @@ Finally, one can also add margins or normalize this output.

.. ipython:: python

pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum, normalize=True,
margins=True)

.. _reshaping.tile:
Expand Down
Loading