Skip to content

Commit 1d0eceb

Browse files
rhshadrachyehoshuadimarsky
authored andcommitted
DOC: Fix deprecation warnings in docs for groupby nuisance columns (pandas-dev#47065)
1 parent bd5cb28 commit 1d0eceb

File tree

8 files changed

+35
-24
lines changed

8 files changed

+35
-24
lines changed

doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -154,11 +154,11 @@ The apply and combine steps are typically done together in pandas.
154154

155155
In the previous example, we explicitly selected the 2 columns first. If
156156
not, the ``mean`` method is applied to each column containing numerical
157-
columns:
157+
columns by passing ``numeric_only=True``:
158158

159159
.. ipython:: python
160160
161-
titanic.groupby("Sex").mean()
161+
titanic.groupby("Sex").mean(numeric_only=True)
162162
163163
It does not make much sense to get the average value of the ``Pclass``.
164164
If we are only interested in the average age for each gender, the

doc/source/user_guide/10min.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -532,7 +532,7 @@ groups:
532532

533533
.. ipython:: python
534534
535-
df.groupby("A").sum()
535+
df.groupby("A")[["C", "D"]].sum()
536536
537537
Grouping by multiple columns forms a hierarchical index, and again we can
538538
apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:

doc/source/user_guide/groupby.rst

+16-10
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,7 @@ An obvious one is aggregation via the
477477
.. ipython:: python
478478
479479
grouped = df.groupby("A")
480-
grouped.aggregate(np.sum)
480+
grouped[["C", "D"]].aggregate(np.sum)
481481
482482
grouped = df.groupby(["A", "B"])
483483
grouped.aggregate(np.sum)
@@ -492,7 +492,7 @@ changed by using the ``as_index`` option:
492492
grouped = df.groupby(["A", "B"], as_index=False)
493493
grouped.aggregate(np.sum)
494494
495-
df.groupby("A", as_index=False).sum()
495+
df.groupby("A", as_index=False)[["C", "D"]].sum()
496496
497497
Note that you could use the ``reset_index`` DataFrame function to achieve the
498498
same result as the column names are stored in the resulting ``MultiIndex``:
@@ -730,7 +730,7 @@ optimized Cython implementations:
730730

731731
.. ipython:: python
732732
733-
df.groupby("A").sum()
733+
df.groupby("A")[["C", "D"]].sum()
734734
df.groupby(["A", "B"]).mean()
735735
736736
Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above
@@ -1159,13 +1159,12 @@ Again consider the example DataFrame we've been looking at:
11591159
11601160
Suppose we wish to compute the standard deviation grouped by the ``A``
11611161
column. There is a slight problem, namely that we don't care about the data in
1162-
column ``B``. We refer to this as a "nuisance" column. If the passed
1163-
aggregation function can't be applied to some columns, the troublesome columns
1164-
will be (silently) dropped. Thus, this does not pose any problems:
1162+
column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance
1163+
columns by specifying ``numeric_only=True``:
11651164

11661165
.. ipython:: python
11671166
1168-
df.groupby("A").std()
1167+
df.groupby("A").std(numeric_only=True)
11691168
11701169
Note that ``df.groupby('A').colname.std().`` is more efficient than
11711170
``df.groupby('A').std().colname``, so if the result of an aggregation function
@@ -1180,7 +1179,14 @@ is only interesting over one column (here ``colname``), it may be filtered
11801179
If you do wish to include decimal or object columns in an aggregation with
11811180
other non-nuisance data types, you must do so explicitly.
11821181

1182+
.. warning::
1183+
The automatic dropping of nuisance columns has been deprecated and will be removed
1184+
in a future version of pandas. If columns are included that cannot be operated
1185+
on, pandas will instead raise an error. In order to avoid this, either select
1186+
the columns you wish to operate on or specify ``numeric_only=True``.
1187+
11831188
.. ipython:: python
1189+
:okwarning:
11841190
11851191
from decimal import Decimal
11861192
@@ -1304,7 +1310,7 @@ Groupby a specific column with the desired frequency. This is like resampling.
13041310

13051311
.. ipython:: python
13061312
1307-
df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"]).sum()
1313+
df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum()
13081314
13091315
You have an ambiguous specification in that you have a named index and a column
13101316
that could be potential groupers.
@@ -1313,9 +1319,9 @@ that could be potential groupers.
13131319
13141320
df = df.set_index("Date")
13151321
df["Date"] = df.index + pd.offsets.MonthEnd(2)
1316-
df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"]).sum()
1322+
df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum()
13171323
1318-
df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"]).sum()
1324+
df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum()
13191325
13201326
13211327
Taking the first rows of each group

doc/source/user_guide/indexing.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -583,7 +583,7 @@ without using a temporary variable.
583583
.. ipython:: python
584584
585585
bb = pd.read_csv('data/baseball.csv', index_col='id')
586-
(bb.groupby(['year', 'team']).sum()
586+
(bb.groupby(['year', 'team']).sum(numeric_only=True)
587587
.loc[lambda df: df['r'] > 100])
588588
589589

doc/source/user_guide/reshaping.rst

+10-5
Original file line numberDiff line numberDiff line change
@@ -414,12 +414,11 @@ We can produce pivot tables from this data very easily:
414414
415415
The result object is a :class:`DataFrame` having potentially hierarchical indexes on the
416416
rows and columns. If the ``values`` column name is not given, the pivot table
417-
will include all of the data that can be aggregated in an additional level of
418-
hierarchy in the columns:
417+
will include all of the data in an additional level of hierarchy in the columns:
419418

420419
.. ipython:: python
421420
422-
pd.pivot_table(df, index=["A", "B"], columns=["C"])
421+
pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"])
423422
424423
Also, you can use :class:`Grouper` for ``index`` and ``columns`` keywords. For detail of :class:`Grouper`, see :ref:`Grouping with a Grouper specification <groupby.specify>`.
425424

@@ -432,7 +431,7 @@ calling :meth:`~DataFrame.to_string` if you wish:
432431

433432
.. ipython:: python
434433
435-
table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
434+
table = pd.pivot_table(df, index=["A", "B"], columns=["C"], values=["D", "E"])
436435
print(table.to_string(na_rep=""))
437436
438437
Note that :meth:`~DataFrame.pivot_table` is also available as an instance method on DataFrame,
@@ -449,7 +448,13 @@ rows and columns:
449448

450449
.. ipython:: python
451450
452-
table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
451+
table = df.pivot_table(
452+
index=["A", "B"],
453+
columns="C",
454+
values=["D", "E"],
455+
margins=True,
456+
aggfunc=np.std
457+
)
453458
table
454459
455460
Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame

doc/source/user_guide/timeseries.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -1821,15 +1821,15 @@ to resample based on datetimelike column in the frame, it can passed to the
18211821
),
18221822
)
18231823
df
1824-
df.resample("M", on="date").sum()
1824+
df.resample("M", on="date")[["a"]].sum()
18251825
18261826
Similarly, if you instead want to resample by a datetimelike
18271827
level of ``MultiIndex``, its name or location can be passed to the
18281828
``level`` keyword.
18291829

18301830
.. ipython:: python
18311831
1832-
df.resample("M", level="d").sum()
1832+
df.resample("M", level="d")[["a"]].sum()
18331833
18341834
.. _timeseries.iterating-label:
18351835

doc/source/whatsnew/v0.18.1.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ without using temporary variable.
166166
.. ipython:: python
167167
168168
bb = pd.read_csv("data/baseball.csv", index_col="id")
169-
(bb.groupby(["year", "team"]).sum().loc[lambda df: df.r > 100])
169+
(bb.groupby(["year", "team"]).sum(numeric_only=True).loc[lambda df: df.r > 100])
170170
171171
.. _whatsnew_0181.partial_string_indexing:
172172

doc/source/whatsnew/v0.19.0.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -497,8 +497,8 @@ Other enhancements
497497
),
498498
)
499499
df
500-
df.resample("M", on="date").sum()
501-
df.resample("M", level="d").sum()
500+
df.resample("M", on="date")[["a"]].sum()
501+
df.resample("M", level="d")[["a"]].sum()
502502
503503
- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__. See the docs for more details (:issue:`13577`).
504504
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)

0 commit comments

Comments
 (0)