{{ header }}
.. ipython:: python import pandas as pd
.. ipython:: python titanic = pd.read_csv("data/titanic.csv") titanic.head()
I’m interested in the age of the Titanic passengers.
.. ipython:: python ages = titanic["Age"] ages.head()
To select a single column, use square brackets
[]
with the column name of the column of interest.
Each column in a :class:`DataFrame` is a :class:`Series`. As a single column is selected, the returned object is a pandas :class:`Series`. We can verify this by checking the type of the output:
.. ipython:: python type(titanic["Age"])
And have a look at the shape
of the output:
.. ipython:: python titanic["Age"].shape
:attr:`DataFrame.shape` is an attribute (remember :ref:`tutorial on reading and writing <10min_tut_02_read_write>`, do not use parentheses for attributes) of a
pandas Series
and DataFrame
containing the number of rows and
columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only
the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.
.. ipython:: python age_sex = titanic[["Age", "Sex"]] age_sex.head()
To select multiple columns, use a list of column names within the selection brackets
[]
.
Note
The inner square brackets define a
:ref:`Python list <python:tut-morelists>` with column names, whereas
the outer square brackets are used to select the data from a pandas
DataFrame
as seen in the previous example.
The returned data type is a pandas DataFrame:
.. ipython:: python type(titanic[["Age", "Sex"]])
.. ipython:: python titanic[["Age", "Sex"]].shape
The selection returned a DataFrame
with 891 rows and 2 columns. Remember, a
DataFrame
is 2-dimensional with both a row and column dimension.
For basic information on indexing, see the user guide section on :ref:`indexing and selecting data <indexing.basics>`.
I’m interested in the passengers older than 35 years.
.. ipython:: python above_35 = titanic[titanic["Age"] > 35] above_35.head()
To select rows based on a conditional expression, use a condition inside the selection brackets
[]
.
The condition inside the selection
brackets titanic["Age"] > 35
checks for which rows the Age
column has a value larger than 35:
.. ipython:: python titanic["Age"] > 35
The output of the conditional expression (>
, but also ==
,
!=
, <
, <=
,… would work) is actually a pandas Series
of
boolean values (either True
or False
) with the same number of
rows as the original DataFrame
. Such a Series
of boolean values
can be used to filter the DataFrame
by putting it in between the
selection brackets []
. Only rows for which the value is True
will be selected.
We know from before that the original Titanic DataFrame
consists of
891 rows. Let’s have a look at the number of rows which satisfy the
condition by checking the shape
attribute of the resulting
DataFrame
above_35
:
.. ipython:: python above_35.shape
I’m interested in the Titanic passengers from cabin class 2 and 3.
.. ipython:: python class_23 = titanic[titanic["Pclass"].isin([2, 3])] class_23.head()
Similar to the conditional expression, the :func:`~Series.isin` conditional function returns a
True
for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets[]
. In this case, the condition inside the selection bracketstitanic["Pclass"].isin([2, 3])
checks for which rows thePclass
column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is
either 2 or 3 and combining the two statements with an |
(or)
operator:
.. ipython:: python class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)] class_23.head()
Note
When combining multiple conditional statements, each condition
must be surrounded by parentheses ()
. Moreover, you can not use
or
/and
but need to use the or
operator |
and the and
operator &
.
See the dedicated section in the user guide about :ref:`boolean indexing <indexing.boolean>` or about the :ref:`isin function <indexing.basics.indexing_isin>`.
I want to work with passenger data for which the age is known.
.. ipython:: python age_no_na = titanic[titanic["Age"].notna()] age_no_na.head()
The :meth:`~Series.notna` conditional function returns a
True
for each row the values are not aNull
value. As such, this can be combined with the selection brackets[]
to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
.. ipython:: python age_no_na.shape
For more dedicated functions on missing values, see the user guide section about :ref:`handling missing data <missing_data>`.
I’m interested in the names of the passengers older than 35 years.
.. ipython:: python adult_names = titanic.loc[titanic["Age"] > 35, "Name"] adult_names.head()
In this case, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. Theloc
/iloc
operators are required in front of the selection brackets[]
. When usingloc
/iloc
, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using column names, row labels or a condition expression, use
the loc
operator in front of the selection brackets []
. For both
the part before and after the comma, you can use a single label, a list
of labels, a slice of labels, a conditional expression or a colon. Using
a colon specifies you want to select all rows or columns.
I’m interested in rows 10 till 25 and columns 3 to 5.
.. ipython:: python titanic.iloc[9:25, 2:5]
Again, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use theiloc
operator in front of the selection brackets[]
.
When selecting specific rows and/or columns with loc
or iloc
,
new values can be assigned to the selected data. For example, to assign
the name anonymous
to the first 3 elements of the fourth column:
.. ipython:: python titanic.iloc[0:3, 3] = "anonymous" titanic.iloc[:5, 3]
See the user guide section on :ref:`different choices for indexing <indexing.choice>` to get more insight into the usage of loc
and iloc
.
- When selecting subsets of data, square brackets
[]
are used. - Inside these square brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
- Use
loc
for label-based selection (using row/column names). - Use
iloc
for position-based selection (using table positions). - You can assign new values to a selection based on
loc
/iloc
.
A full overview of indexing is provided in the user guide pages on :ref:`indexing and selecting data <indexing>`.