-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Enhancing pivot / reshape docs #21038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
783f5f0
9e79c2f
f60874a
ab4584d
263b1d8
db68b01
8c9ae27
eaec575
e0d9501
61f9f43
56587c7
a935dd3
5283d29
c146d7c
67112b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,8 @@ Reshaping and Pivot Tables | |
Reshaping by pivoting DataFrame objects | ||
--------------------------------------- | ||
|
||
.. image:: _static/reshaping_pivot.png | ||
|
||
.. ipython:: | ||
:suppress: | ||
|
||
|
@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects | |
|
||
In [3]: df = unpivot(tm.makeTimeDataFrame()) | ||
|
||
Data is often stored in CSV files or databases in so-called "stacked" or | ||
"record" format: | ||
Data is often stored in so-called "stacked" or "record" format: | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -60,8 +61,6 @@ To select out everything for variable ``A`` we could do: | |
|
||
df[df['variable'] == 'A'] | ||
|
||
.. image:: _static/reshaping_pivot.png | ||
|
||
But suppose we wish to do time series operations with the variables. A better | ||
representation would be where the ``columns`` are the unique variables and an | ||
``index`` of dates identifies individual observations. To reshape the data into | ||
|
@@ -81,7 +80,7 @@ column: | |
.. ipython:: python | ||
|
||
df['value2'] = df['value'] * 2 | ||
pivoted = df.pivot('date', 'variable') | ||
pivoted = df.pivot(index='date', columns='variable') | ||
pivoted | ||
|
||
You can then select subsets from the pivoted ``DataFrame``: | ||
|
@@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``: | |
Note that this returns a view on the underlying data in the case where the data | ||
are homogeneously-typed. | ||
|
||
.. note:: | ||
``pandas.pivot`` will error with a ``ValueError: Index contains duplicate | ||
entries, cannot reshape`` if the index/column pair is not unique. In this | ||
case, consider using :func:`~pandas.pivot_table` which is a generalization | ||
of pivot that can handle duplicate values for one index/column pair. | ||
|
||
.. _reshaping.stacking: | ||
|
||
Reshaping by stacking and unstacking | ||
|
@@ -698,10 +703,124 @@ handling of NaN: | |
In [3]: np.unique(x, return_inverse=True)[::-1] | ||
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object)) | ||
|
||
|
||
.. note:: | ||
If you just want to handle one column as a categorical variable (like R's factor), | ||
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or | ||
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`, | ||
see the :ref:`Categorical introduction <categorical>` and the | ||
:ref:`API documentation <api.categorical>`. | ||
|
||
Frequently Asked Questions (and Examples) | ||
------------------ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. needs to be the same length as the title (how about just make this title Examples)? |
||
|
||
In this section, we will review frequently asked questions and examples. The | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we got rid of the Q and A format don't need this intro |
||
column names and relevant column values are named to correspond with how this | ||
DataFrame will be pivoted in the answers below. | ||
|
||
.. ipython:: python | ||
|
||
np.random.seed([3,1415]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason to choose a list as a seed here? Not saying there's anything wrong with it per se, just have always seen an int literal like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No real reason. I'm basing these examples off of the StackOverflow post originally linked in the issue: #19089. |
||
n = 20 | ||
|
||
cols = np.array(['key', 'row', 'item', 'col']) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can just do
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! Refactored a bit. |
||
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str) | ||
|
||
df = pd.DataFrame(np.core.defchararray.add(cols, arr1), columns=cols).join( | ||
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val') | ||
) | ||
|
||
df | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to have these as Question, rather just make an informative title. |
||
Question 1 | ||
~~~~~~~~~~ | ||
|
||
How do I pivot ``df`` such that the ``col`` values are columns, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double backticks are for literals, where single backticks are for inline code / argument refs. Can you change any argument reference (i.e. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WillAyd while that makes sense, this seems inconsistent with how the single backticks and double backticks are being used elsewhere in this doc. It also seems like the double backticks look better in the docs itself. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm OK - that's a fair point. @TomAugspurger do you know if there's an official stance on this? Worth updating in a separate issue? |
||
``row`` values are the index, and mean of ``val0`` are the values? In | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ", and the mean" |
||
particular, the resulting DataFrame should look like: | ||
|
||
.. code-block:: ipython | ||
|
||
col col0 col1 col2 col3 col4 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be a ipython block |
||
row | ||
row0 0.77 0.605 NaN 0.860 0.65 | ||
row2 0.13 NaN 0.395 0.500 0.25 | ||
row3 NaN 0.310 NaN 0.545 NaN | ||
row4 NaN 0.100 0.395 0.760 0.24 | ||
|
||
**Answer** | ||
This solution uses :func:`~pandas.pivot_table`. Also note that | ||
``aggfunc='mean'`` is the default. It is included here to be explicit. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values='val0', index='row', columns='col', aggfunc='mean') | ||
|
||
Note that we can also replace the missing values by using the ``fill_value`` | ||
parameter. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0) | ||
|
||
Also note that we can pass in other aggregation functions as well. For example, | ||
we can also pass in ``sum``. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0) | ||
|
||
Question 2 | ||
~~~~~~~~~~ | ||
|
||
How can I perform multiple aggregations at the same time? For example, what if | ||
I wanted to perform both a ``sum`` and ``mean`` aggregation? | ||
|
||
**Answer** | ||
We can pass in a list to the ``aggfunc`` argument. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values='val0', index='row', columns='col', aggfunc=['mean', 'sum']) | ||
|
||
Question 3 | ||
~~~~~~~~~~ | ||
|
||
How can I aggregate over multiple value columns? | ||
|
||
**Answer** | ||
We can pass in a list to the ``values`` parameter. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean']) | ||
|
||
Question 4 | ||
~~~~~~~~~~ | ||
|
||
How can I Group By over multiple columns? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure "Group By" is the right terminology to use here - any reason in particular you went with that? Wondering if in general it wouldn't be more concise and easier to word if we did away with the "Q/A" format and just preceded each example with something like "Multiple values can be used at once" and leaving it to the example to highlight the effect of that |
||
|
||
**Answer** | ||
We can pass in a list to the ``columns`` parameter. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table( | ||
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean']) | ||
|
||
Question 5 | ||
~~~~~~~~~~ | ||
|
||
How can I aggregate the frequency in which the columns and rows occur together | ||
a.k.a. "cross tabulation"? | ||
|
||
**Answer** | ||
We can pass ``size`` to the ``aggfunc`` parameter. | ||
|
||
.. ipython:: python | ||
|
||
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1375,6 +1375,7 @@ Reshaping | |
- Bug in :func:`isna`, which cannot handle ambiguous typed lists (:issue:`20675`) | ||
- Bug in :func:`concat` which raises an error when concatenating TZ-aware dataframes and all-NaT dataframes (:issue:`12396`) | ||
- Bug in :func:`concat` which raises an error when concatenating empty TZ-aware series (:issue:`18447`) | ||
- Updated :func:`~pandas.pivot_table` with more comprehensive examples. Also updated Reshaping and Pivot Tables documentation with a Frequenty Asked Questions example (:issue:`19089`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't typically add a whatsnew note for documentation-only updates so you can remove this |
||
|
||
Other | ||
^^^^^ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5233,50 +5233,73 @@ def pivot(self, index=None, columns=None, values=None): | |
... "C": ["small", "large", "large", "small", | ||
... "small", "large", "small", "small", | ||
... "large"], | ||
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]}) | ||
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7], | ||
... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]}) | ||
>>> df | ||
A B C D | ||
0 foo one small 1 | ||
1 foo one large 2 | ||
2 foo one large 2 | ||
3 foo two small 3 | ||
4 foo two small 3 | ||
5 bar one large 4 | ||
6 bar one small 5 | ||
7 bar two small 6 | ||
8 bar two large 7 | ||
A B C D E | ||
0 foo one small 1 2 | ||
1 foo one large 2 4 | ||
2 foo one large 2 5 | ||
3 foo two small 3 5 | ||
4 foo two small 3 6 | ||
5 bar one large 4 6 | ||
6 bar one small 5 8 | ||
7 bar two small 6 9 | ||
8 bar two large 7 9 | ||
|
||
This first example aggregates values by taking the sum. | ||
|
||
>>> table = pivot_table(df, values='D', index=['A', 'B'], | ||
... columns=['C'], aggfunc=np.sum) | ||
>>> table | ||
C large small | ||
A B | ||
bar one 4.0 5.0 | ||
two 7.0 6.0 | ||
foo one 4.0 1.0 | ||
two NaN 6.0 | ||
bar one 4 5 | ||
two 7 6 | ||
foo one 4 1 | ||
two NaN 6 | ||
|
||
We can also fill missing values using the `fill_value` parameter. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Worth calling out in this example that providing the fill_value has preserved the |
||
|
||
>>> table = pivot_table(df, values='D', index=['A', 'B'], | ||
... columns=['C'], aggfunc=np.sum) | ||
... columns=['C'], aggfunc=np.sum, fill_value=0) | ||
>>> table | ||
C large small | ||
A B | ||
bar one 4.0 5.0 | ||
two 7.0 6.0 | ||
foo one 4.0 1.0 | ||
two NaN 6.0 | ||
bar one 4 5 | ||
two 7 6 | ||
foo one 4 1 | ||
two 0 6 | ||
|
||
The next example aggregates by taking the mean using values for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "mean across multiple columns" reads a little easier than "mean using values for multiple columns" IMO |
||
multiple columns. | ||
|
||
>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'], | ||
... aggfunc={'D': np.mean, | ||
... 'E': np.mean}) | ||
>>> table | ||
D E | ||
mean mean | ||
A C | ||
bar large 5.500000 7.500000 | ||
small 5.500000 8.500000 | ||
foo large 2.000000 4.500000 | ||
small 2.333333 4.333333 | ||
|
||
We can also calculate multiple types of aggregations for any given | ||
value column. | ||
|
||
>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'], | ||
... aggfunc={'D': np.mean, | ||
... 'E': [min, max, np.mean]}) | ||
>>> table | ||
D E | ||
mean max median min | ||
mean max mean min | ||
A C | ||
bar large 5.500000 16 14.5 13 | ||
small 5.500000 15 14.5 14 | ||
foo large 2.000000 10 9.5 9 | ||
small 2.333333 12 11.0 8 | ||
bar large 5.500000 9 7.500000 6 | ||
small 5.500000 9 8.500000 8 | ||
foo large 2.000000 5 4.500000 4 | ||
small 2.333333 6 4.333333 2 | ||
|
||
Returns | ||
------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert
pandas.pivot
into an inline reference, similar to how you have pandas.pivot_table?