.. _10min_tut_06_stats:

{{ header }}

.. ipython:: python

    import pandas as pd

.. raw:: html

    <div class="card gs-data">
        <div class="card-header">
            <div class="gs-data-title">
                Data used for this tutorial:
            </div>
        </div>
        <ul class="list-group list-group-flush">
            <li class="list-group-item">
                <div data-toggle="collapse" href="#collapsedata" role="button" aria-expanded="false" aria-controls="collapsedata">
                    <span class="badge badge-dark">Titanic data</span>
                </div>
                <div class="collapse" id="collapsedata">
                    <div class="card-body">
                        <p class="card-text">

This tutorial uses the Titanic data set, stored as CSV. The data
consists of the following data columns:

-  PassengerId: Id of every passenger.
-  Survived: This feature have value 0 and 1. 0 for not survived and 1
   for survived.
-  Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
-  Name: Name of passenger.
-  Sex: Gender of passenger.
-  Age: Age of passenger.
-  SibSp: Indication that passenger have siblings and spouse.
-  Parch: Whether a passenger is alone or have family.
-  Ticket: Ticket number of passenger.
-  Fare: Indicating the fare.
-  Cabin: The cabin of passenger.
-  Embarked: The embarked category.

.. raw:: html

                        </p>
                    <a href="https://github.com/pandas-dev/pandas/tree/master/doc/data/titanic.csv" class="btn btn-dark btn-sm">To raw data</a>
                </div>
            </div>

.. ipython:: python

    titanic = pd.read_csv("data/titanic.csv")
    titanic.head()

.. raw:: html

        </li>
    </ul>
    </div>

How to calculate summary statistics?
------------------------------------

Aggregating statistics
~~~~~~~~~~~~~~~~~~~~~~

.. image:: ../../_static/schemas/06_aggregate.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the average age of the Titanic passengers?

.. ipython:: python

    titanic["Age"].mean()

.. raw:: html

        </li>
    </ul>

Different statistics are available and can be applied to columns with
numerical data. Operations in general exclude missing data and operate
across rows by default.

.. image:: ../../_static/schemas/06_reduction.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the median age and ticket fare price of the Titanic passengers?

.. ipython:: python

    titanic[["Age", "Fare"]].median()

The statistic applied to multiple columns of a ``DataFrame`` (the selection of two columns
return a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset>`) is calculated for each numeric column.

.. raw:: html

        </li>
    </ul>

The aggregating statistic can be calculated for multiple columns at the
same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>` tutorial?

.. ipython:: python

    titanic[["Age", "Fare"]].describe()

Instead of the predefined statistics, specific combinations of
aggregating statistics for given columns can be defined using the
:func:`DataFrame.agg` method:

.. ipython:: python

    titanic.agg({'Age': ['min', 'max', 'median', 'skew'],
                 'Fare': ['min', 'max', 'median', 'mean']})

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

Details about descriptive statistics are provided in the user guide section on :ref:`descriptive statistics <basics.stats>`.

.. raw:: html

   </div>


Aggregating statistics grouped by category
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: ../../_static/schemas/06_groupby.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the average age for male versus female Titanic passengers?

.. ipython:: python

    titanic[["Sex", "Age"]].groupby("Sex").mean()

As our interest is the average age for each gender, a subselection on
these two columns is made first: ``titanic[["Sex", "Age"]]``. Next, the
:meth:`~DataFrame.groupby` method is applied on the ``Sex`` column to make a group per
category. The average age *for each gender* is calculated and
returned.

.. raw:: html

        </li>
    </ul>

Calculating a given statistic (e.g. ``mean`` age) *for each category in
a column* (e.g. male/female in the ``Sex`` column) is a common pattern.
The ``groupby`` method is used to support this type of operations. More
general, this fits in the more general ``split-apply-combine`` pattern:

-  **Split** the data into groups
-  **Apply** a function to each group independently
-  **Combine** the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If
not, the ``mean`` method is applied to each column containing numerical
columns:

.. ipython:: python

    titanic.groupby("Sex").mean()

It does not make much sense to get the average value of the ``Pclass``.
if we are only interested in the average age for each gender, the
selection of columns (rectangular brackets ``[]`` as usual) is supported
on the grouped data as well:

.. ipython:: python

    titanic.groupby("Sex")["Age"].mean()

.. image:: ../../_static/schemas/06_groupby_select_detail.svg
   :align: center

.. note::
    The ``Pclass`` column contains numerical data but actually
    represents 3 categories (or factors) with respectively the labels ‘1’,
    ‘2’ and ‘3’. Calculating statistics on these does not make much sense.
    Therefore, pandas provides a ``Categorical`` data type to handle this
    type of data. More information is provided in the user guide
    :ref:`categorical` section.

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the mean ticket fare price for each of the sex and cabin class combinations?

.. ipython:: python

    titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Grouping can be done by multiple columns at the same time. Provide the
column names as a list to the :meth:`~DataFrame.groupby` method.

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

A full description on the split-apply-combine approach is provided in the user guide section on :ref:`groupby operations <groupby>`.

.. raw:: html

   </div>

Count number of records by category
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: ../../_static/schemas/06_valuecounts.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the number of passengers in each of the cabin classes?

.. ipython:: python

    titanic["Pclass"].value_counts()

The :meth:`~Series.value_counts` method counts the number of records for each
category in a column.

.. raw:: html

        </li>
    </ul>

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
within each group:

.. ipython:: python

    titanic.groupby("Pclass")["Pclass"].count()

.. note::
    Both ``size`` and ``count`` can be used in combination with
    ``groupby``. Whereas ``size`` includes ``NaN`` values and just provides
    the number of rows (size of the table), ``count`` excludes the missing
    values. In the ``value_counts`` method, use the ``dropna`` argument to
    include or exclude the ``NaN`` values.

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

The user guide has a dedicated section on ``value_counts`` , see page on :ref:`discretization <basics.discretization>`.

.. raw:: html

   </div>

.. raw:: html

    <div class="shadow gs-callout gs-callout-remember">
        <h4>REMEMBER</h4>

-  Aggregation statistics can be calculated on entire columns or rows
-  ``groupby`` provides the power of the *split-apply-combine* pattern
-  ``value_counts`` is a convenient shortcut to count the number of
   entries in each category of a variable

.. raw:: html

   </div>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

A full description on the split-apply-combine approach is provided in the user guide pages about :ref:`groupby operations <groupby>`.

.. raw:: html

   </div>