Skip to content

Commit dcb8b6a

Browse files
VincentLajreback
authored andcommitted
DOC: Enhancing pivot / reshape docs (#21038)
1 parent bb32564 commit dcb8b6a

File tree

2 files changed

+151
-31
lines changed

2 files changed

+151
-31
lines changed

doc/source/reshaping.rst

+104-6
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
1717
Reshaping by pivoting DataFrame objects
1818
---------------------------------------
1919

20+
.. image:: _static/reshaping_pivot.png
21+
2022
.. ipython::
2123
:suppress:
2224

@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
3335

3436
In [3]: df = unpivot(tm.makeTimeDataFrame())
3537

36-
Data is often stored in CSV files or databases in so-called "stacked" or
37-
"record" format:
38+
Data is often stored in so-called "stacked" or "record" format:
3839

3940
.. ipython:: python
4041
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
6667
6768
df[df['variable'] == 'A']
6869
69-
.. image:: _static/reshaping_pivot.png
70-
7170
But suppose we wish to do time series operations with the variables. A better
7271
representation would be where the ``columns`` are the unique variables and an
7372
``index`` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
8786
.. ipython:: python
8887
8988
df['value2'] = df['value'] * 2
90-
pivoted = df.pivot('date', 'variable')
89+
pivoted = df.pivot(index='date', columns='variable')
9190
pivoted
9291
9392
You can then select subsets from the pivoted ``DataFrame``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
9998
Note that this returns a view on the underlying data in the case where the data
10099
are homogeneously-typed.
101100

101+
.. note::
102+
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
103+
entries, cannot reshape`` if the index/column pair is not unique. In this
104+
case, consider using :func:`~pandas.pivot_table` which is a generalization
105+
of pivot that can handle duplicate values for one index/column pair.
106+
102107
.. _reshaping.stacking:
103108

104109
Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
704709
In [3]: np.unique(x, return_inverse=True)[::-1]
705710
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
706711
707-
708712
.. note::
709713
If you just want to handle one column as a categorical variable (like R's factor),
710714
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
711715
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
712716
see the :ref:`Categorical introduction <categorical>` and the
713717
:ref:`API documentation <api.categorical>`.
718+
719+
Examples
720+
--------
721+
722+
In this section, we will review frequently asked questions and examples. The
723+
column names and relevant column values are named to correspond with how this
724+
DataFrame will be pivoted in the answers below.
725+
726+
.. ipython:: python
727+
728+
np.random.seed([3, 1415])
729+
n = 20
730+
731+
cols = np.array(['key', 'row', 'item', 'col'])
732+
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
733+
df.columns = cols
734+
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
735+
736+
df
737+
738+
Pivoting with Single Aggregations
739+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
740+
741+
Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
742+
``row`` values are the index, and the mean of ``val0`` are the values? In
743+
particular, the resulting DataFrame should look like:
744+
745+
.. code-block:: ipython
746+
747+
col col0 col1 col2 col3 col4
748+
row
749+
row0 0.77 0.605 NaN 0.860 0.65
750+
row2 0.13 NaN 0.395 0.500 0.25
751+
row3 NaN 0.310 NaN 0.545 NaN
752+
row4 NaN 0.100 0.395 0.760 0.24
753+
754+
This solution uses :func:`~pandas.pivot_table`. Also note that
755+
``aggfunc='mean'`` is the default. It is included here to be explicit.
756+
757+
.. ipython:: python
758+
759+
df.pivot_table(
760+
values='val0', index='row', columns='col', aggfunc='mean')
761+
762+
Note that we can also replace the missing values by using the ``fill_value``
763+
parameter.
764+
765+
.. ipython:: python
766+
767+
df.pivot_table(
768+
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
769+
770+
Also note that we can pass in other aggregation functions as well. For example,
771+
we can also pass in ``sum``.
772+
773+
.. ipython:: python
774+
775+
df.pivot_table(
776+
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
777+
778+
Another aggregation we can do is calculate the frequency in which the columns
779+
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
780+
``size`` to the ``aggfunc`` parameter.
781+
782+
.. ipython:: python
783+
784+
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
785+
786+
Pivoting with Multiple Aggregations
787+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
788+
789+
We can also perform multiple aggregations. For example, to perform both a
790+
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.
791+
792+
.. ipython:: python
793+
794+
df.pivot_table(
795+
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
796+
797+
Note to aggregate over multiple value columns, we can pass in a list to the
798+
``values`` parameter.
799+
800+
.. ipython:: python
801+
802+
df.pivot_table(
803+
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
804+
805+
Note to subdivide over multiple columns we can pass in a list to the
806+
``columns`` parameter.
807+
808+
.. ipython:: python
809+
810+
df.pivot_table(
811+
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])

pandas/core/frame.py

+47-25
Original file line numberDiff line numberDiff line change
@@ -5518,50 +5518,72 @@ def pivot(self, index=None, columns=None, values=None):
55185518
... "C": ["small", "large", "large", "small",
55195519
... "small", "large", "small", "small",
55205520
... "large"],
5521-
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
5521+
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
5522+
... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
55225523
>>> df
5523-
A B C D
5524-
0 foo one small 1
5525-
1 foo one large 2
5526-
2 foo one large 2
5527-
3 foo two small 3
5528-
4 foo two small 3
5529-
5 bar one large 4
5530-
6 bar one small 5
5531-
7 bar two small 6
5532-
8 bar two large 7
5524+
A B C D E
5525+
0 foo one small 1 2
5526+
1 foo one large 2 4
5527+
2 foo one large 2 5
5528+
3 foo two small 3 5
5529+
4 foo two small 3 6
5530+
5 bar one large 4 6
5531+
6 bar one small 5 8
5532+
7 bar two small 6 9
5533+
8 bar two large 7 9
5534+
5535+
This first example aggregates values by taking the sum.
55335536
55345537
>>> table = pivot_table(df, values='D', index=['A', 'B'],
55355538
... columns=['C'], aggfunc=np.sum)
55365539
>>> table
55375540
C large small
55385541
A B
5539-
bar one 4.0 5.0
5540-
two 7.0 6.0
5541-
foo one 4.0 1.0
5542-
two NaN 6.0
5542+
bar one 4 5
5543+
two 7 6
5544+
foo one 4 1
5545+
two NaN 6
5546+
5547+
We can also fill missing values using the `fill_value` parameter.
55435548
55445549
>>> table = pivot_table(df, values='D', index=['A', 'B'],
5545-
... columns=['C'], aggfunc=np.sum)
5550+
... columns=['C'], aggfunc=np.sum, fill_value=0)
55465551
>>> table
55475552
C large small
55485553
A B
5549-
bar one 4.0 5.0
5550-
two 7.0 6.0
5551-
foo one 4.0 1.0
5552-
two NaN 6.0
5554+
bar one 4 5
5555+
two 7 6
5556+
foo one 4 1
5557+
two 0 6
5558+
5559+
The next example aggregates by taking the mean across multiple columns.
5560+
5561+
>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
5562+
... aggfunc={'D': np.mean,
5563+
... 'E': np.mean})
5564+
>>> table
5565+
D E
5566+
mean mean
5567+
A C
5568+
bar large 5.500000 7.500000
5569+
small 5.500000 8.500000
5570+
foo large 2.000000 4.500000
5571+
small 2.333333 4.333333
5572+
5573+
We can also calculate multiple types of aggregations for any given
5574+
value column.
55535575
55545576
>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
55555577
... aggfunc={'D': np.mean,
55565578
... 'E': [min, max, np.mean]})
55575579
>>> table
55585580
D E
5559-
mean max median min
5581+
mean max mean min
55605582
A C
5561-
bar large 5.500000 16 14.5 13
5562-
small 5.500000 15 14.5 14
5563-
foo large 2.000000 10 9.5 9
5564-
small 2.333333 12 11.0 8
5583+
bar large 5.500000 9 7.500000 6
5584+
small 5.500000 9 8.500000 8
5585+
foo large 2.000000 5 4.500000 4
5586+
small 2.333333 6 4.333333 2
55655587
55665588
Returns
55675589
-------

0 commit comments

Comments
 (0)