@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
17
17
Reshaping by pivoting DataFrame objects
18
18
---------------------------------------
19
19
20
+ .. image :: _static/reshaping_pivot.png
21
+
20
22
.. ipython ::
21
23
:suppress:
22
24
@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
33
35
34
36
In [3]: df = unpivot(tm.makeTimeDataFrame())
35
37
36
- Data is often stored in CSV files or databases in so-called "stacked" or
37
- "record" format:
38
+ Data is often stored in so-called "stacked" or "record" format:
38
39
39
40
.. ipython :: python
40
41
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
66
67
67
68
df[df[' variable' ] == ' A' ]
68
69
69
- .. image :: _static/reshaping_pivot.png
70
-
71
70
But suppose we wish to do time series operations with the variables. A better
72
71
representation would be where the ``columns `` are the unique variables and an
73
72
``index `` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
87
86
.. ipython :: python
88
87
89
88
df[' value2' ] = df[' value' ] * 2
90
- pivoted = df.pivot(' date' , ' variable' )
89
+ pivoted = df.pivot(index = ' date' , columns = ' variable' )
91
90
pivoted
92
91
93
92
You can then select subsets from the pivoted ``DataFrame ``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
99
98
Note that this returns a view on the underlying data in the case where the data
100
99
are homogeneously-typed.
101
100
101
+ .. note ::
102
+ :func: `~pandas.pivot ` will error with a ``ValueError: Index contains duplicate
103
+ entries, cannot reshape `` if the index/column pair is not unique. In this
104
+ case, consider using :func: `~pandas.pivot_table ` which is a generalization
105
+ of pivot that can handle duplicate values for one index/column pair.
106
+
102
107
.. _reshaping.stacking :
103
108
104
109
Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
704
709
In [3]: np.unique(x, return_inverse=True)[::-1]
705
710
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
706
711
707
-
708
712
.. note ::
709
713
If you just want to handle one column as a categorical variable (like R's factor),
710
714
you can use ``df["cat_col"] = pd.Categorical(df["col"]) `` or
711
715
``df["cat_col"] = df["col"].astype("category") ``. For full docs on :class: `~pandas.Categorical `,
712
716
see the :ref: `Categorical introduction <categorical >` and the
713
717
:ref: `API documentation <api.categorical >`.
718
+
719
+ Examples
720
+ --------
721
+
722
+ In this section, we will review frequently asked questions and examples. The
723
+ column names and relevant column values are named to correspond with how this
724
+ DataFrame will be pivoted in the answers below.
725
+
726
+ .. ipython :: python
727
+
728
+ np.random.seed([3 , 1415 ])
729
+ n = 20
730
+
731
+ cols = np.array([' key' , ' row' , ' item' , ' col' ])
732
+ df = cols + pd.DataFrame((np.random.randint(5 , size = (n, 4 )) // [2 , 1 , 2 , 1 ]).astype(str ))
733
+ df.columns = cols
734
+ df = df.join(pd.DataFrame(np.random.rand(n, 2 ).round(2 )).add_prefix(' val' ))
735
+
736
+ df
737
+
738
+ Pivoting with Single Aggregations
739
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
740
+
741
+ Suppose we wanted to pivot ``df `` such that the ``col `` values are columns,
742
+ ``row `` values are the index, and the mean of ``val0 `` are the values? In
743
+ particular, the resulting DataFrame should look like:
744
+
745
+ .. code-block :: ipython
746
+
747
+ col col0 col1 col2 col3 col4
748
+ row
749
+ row0 0.77 0.605 NaN 0.860 0.65
750
+ row2 0.13 NaN 0.395 0.500 0.25
751
+ row3 NaN 0.310 NaN 0.545 NaN
752
+ row4 NaN 0.100 0.395 0.760 0.24
753
+
754
+ This solution uses :func: `~pandas.pivot_table `. Also note that
755
+ ``aggfunc='mean' `` is the default. It is included here to be explicit.
756
+
757
+ .. ipython :: python
758
+
759
+ df.pivot_table(
760
+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' mean' )
761
+
762
+ Note that we can also replace the missing values by using the ``fill_value ``
763
+ parameter.
764
+
765
+ .. ipython :: python
766
+
767
+ df.pivot_table(
768
+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' mean' , fill_value = 0 )
769
+
770
+ Also note that we can pass in other aggregation functions as well. For example,
771
+ we can also pass in ``sum ``.
772
+
773
+ .. ipython :: python
774
+
775
+ df.pivot_table(
776
+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' sum' , fill_value = 0 )
777
+
778
+ Another aggregation we can do is calculate the frequency in which the columns
779
+ and rows occur together a.k.a. "cross tabulation". To do this, we can pass
780
+ ``size `` to the ``aggfunc `` parameter.
781
+
782
+ .. ipython :: python
783
+
784
+ df.pivot_table(index = ' row' , columns = ' col' , fill_value = 0 , aggfunc = ' size' )
785
+
786
+ Pivoting with Multiple Aggregations
787
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
788
+
789
+ We can also perform multiple aggregations. For example, to perform both a
790
+ ``sum `` and ``mean ``, we can pass in a list to the ``aggfunc `` argument.
791
+
792
+ .. ipython :: python
793
+
794
+ df.pivot_table(
795
+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = [' mean' , ' sum' ])
796
+
797
+ Note to aggregate over multiple value columns, we can pass in a list to the
798
+ ``values `` parameter.
799
+
800
+ .. ipython :: python
801
+
802
+ df.pivot_table(
803
+ values = [' val0' , ' val1' ], index = ' row' , columns = ' col' , aggfunc = [' mean' ])
804
+
805
+ Note to subdivide over multiple columns we can pass in a list to the
806
+ ``columns `` parameter.
807
+
808
+ .. ipython :: python
809
+
810
+ df.pivot_table(
811
+ values = [' val0' ], index = ' row' , columns = [' item' , ' col' ], aggfunc = [' mean' ])
0 commit comments