Skip to content

Latest commit

 

History

History
377 lines (241 loc) · 9.79 KB

03_subset_data.rst

File metadata and controls

377 lines (241 loc) · 9.79 KB

{{ header }}

.. ipython:: python

    import pandas as pd

Data used for this tutorial:
  • .. ipython:: python
    
        titanic = pd.read_csv("data/titanic.csv")
        titanic.head()
    
    

How do I select a subset of a DataFrame?

How do I select specific columns from a DataFrame?

  • I’m interested in the age of the Titanic passengers.

    .. ipython:: python
    
        ages = titanic["Age"]
        ages.head()
    
    

    To select a single column, use square brackets [] with the column name of the column of interest.

Each column in a :class:`DataFrame` is a :class:`Series`. As a single column is selected, the returned object is a pandas :class:`Series`. We can verify this by checking the type of the output:

.. ipython:: python

    type(titanic["Age"])

And have a look at the shape of the output:

.. ipython:: python

    titanic["Age"].shape

:attr:`DataFrame.shape` is an attribute (remember :ref:`tutorial on reading and writing <10min_tut_02_read_write>`, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.

  • I’m interested in the age and sex of the Titanic passengers.

    .. ipython:: python
    
        age_sex = titanic[["Age", "Sex"]]
        age_sex.head()
    
    

    To select multiple columns, use a list of column names within the selection brackets [].

Note

The inner square brackets define a :ref:`Python list <python:tut-morelists>` with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame as seen in the previous example.

The returned data type is a pandas DataFrame:

.. ipython:: python

    type(titanic[["Age", "Sex"]])

.. ipython:: python

    titanic[["Age", "Sex"]].shape

The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

To user guide

For basic information on indexing, see the user guide section on :ref:`indexing and selecting data <indexing.basics>`.

How do I filter specific rows from a DataFrame?

  • I’m interested in the passengers older than 35 years.

    .. ipython:: python
    
        above_35 = titanic[titanic["Age"] > 35]
        above_35.head()
    
    

    To select rows based on a conditional expression, use a condition inside the selection brackets [].

The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:

.. ipython:: python

    titanic["Age"] > 35

The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:

.. ipython:: python

    above_35.shape

  • I’m interested in the Titanic passengers from cabin class 2 and 3.

    .. ipython:: python
    
        class_23 = titanic[titanic["Pclass"].isin([2, 3])]
        class_23.head()
    
    

    Similar to the conditional expression, the :func:`~Series.isin` conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:

.. ipython:: python

    class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
    class_23.head()

Note

When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &.

To user guide

See the dedicated section in the user guide about :ref:`boolean indexing <indexing.boolean>` or about the :ref:`isin function <indexing.basics.indexing_isin>`.

  • I want to work with passenger data for which the age is known.

    .. ipython:: python
    
        age_no_na = titanic[titanic["Age"].notna()]
        age_no_na.head()
    
    

    The :meth:`~Series.notna` conditional function returns a True for each row the values are not a Null value. As such, this can be combined with the selection brackets [] to filter the data table.

You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:

.. ipython:: python

    age_no_na.shape

To user guide

For more dedicated functions on missing values, see the user guide section about :ref:`handling missing data <missing_data>`.

How do I select specific rows and columns from a DataFrame?

  • I’m interested in the names of the passengers older than 35 years.

    .. ipython:: python
    
        adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
        adult_names.head()
    
    

    In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using column names, row labels or a condition expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

  • I’m interested in rows 10 till 25 and columns 3 to 5.

    .. ipython:: python
    
        titanic.iloc[9:25, 2:5]
    
    

    Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the fourth column:

.. ipython:: python

    titanic.iloc[0:3, 3] = "anonymous"
    titanic.iloc[:5, 3]

To user guide

See the user guide section on :ref:`different choices for indexing <indexing.choice>` to get more insight into the usage of loc and iloc.

REMEMBER

  • When selecting subsets of data, square brackets [] are used.
  • Inside these square brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
  • Use loc for label-based selection (using row/column names).
  • Use iloc for position-based selection (using table positions).
  • You can assign new values to a selection based on loc/iloc.
To user guide

A full overview of indexing is provided in the user guide pages on :ref:`indexing and selecting data <indexing>`.