{{ header }}
.. ipython:: python import pandas as pd
-
Titanic data
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
- PassengerId: Id of every passenger.
- Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
- Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
- Name: Name of passenger.
- Sex: Gender of passenger.
- Age: Age of passenger.
- SibSp: Indication that passenger have siblings and spouse.
- Parch: Whether a passenger is alone or have family.
- Ticket: Ticket number of passenger.
- Fare: Indicating the fare.
- Cabin: The cabin of passenger.
- Embarked: The embarked category.
I want to analyze the Titanic passenger data, available as a CSV file.
.. ipython:: python titanic = pd.read_csv("data/titanic.csv")
pandas provides the :func:`read_csv` function to read data stored as a csv file into a pandas
DataFrame
. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefixread_*
.
Make sure to always have a check on the data after reading in the
data. When displaying a DataFrame
, the first and last 5 rows will be
shown by default:
.. ipython:: python titanic
I want to see the first 8 rows of a pandas DataFrame.
.. ipython:: python titanic.head(8)
To see the first N rows of a
DataFrame
, use the :meth:`~DataFrame.head` method with the required number of rows (in this case 8) as argument.
Note
Interested in the last N rows instead? pandas also provides a
:meth:`~DataFrame.tail` method. For example, titanic.tail(10)
will return the last
10 rows of the DataFrame.
A check on how pandas interpreted each of the column data types can be
done by requesting the pandas dtypes
attribute:
.. ipython:: python titanic.dtypes
For each of the columns, the used data type is enlisted. The data types
in this DataFrame
are integers (int64
), floats (float64
) and
strings (object
).
Note
When asking for the dtypes
, no brackets are used!
dtypes
is an attribute of a DataFrame
and Series
. Attributes
of DataFrame
or Series
do not need brackets. Attributes
represent a characteristic of a DataFrame
/Series
, whereas a
method (which requires brackets) do something with the
DataFrame
/Series
as introduced in the :ref:`first tutorial <10min_tut_01_tableoriented>`.
My colleague requested the Titanic data as a spreadsheet.
.. ipython:: python titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
Whereas
read_*
functions are used to read data to pandas, theto_*
methods are used to store data. The :meth:`~DataFrame.to_excel` method stores the data as an excel file. In the example here, thesheet_name
is named passengers instead of the default Sheet1. By settingindex=False
the row index labels are not saved in the spreadsheet.
The equivalent read function :meth:`~DataFrame.read_excel` will reload the data to a
DataFrame
:
.. ipython:: python titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers", engine="openpyxl")
.. ipython:: python titanic.head()
.. ipython:: python :suppress: import os os.remove("titanic.xlsx")
I’m interested in a technical summary of a
DataFrame
.. ipython:: python titanic.info()
The method :meth:`~DataFrame.info` provides technical information about a
DataFrame
, so let’s explain the output in more detail:- It is indeed a :class:`DataFrame`.
- There are 891 entries, i.e. 891 rows.
- Each row has a row label (aka the
index
) with values ranging from 0 to 890. - The table has 12 columns. Most columns have a value for each of the
rows (all 891 values are
non-null
). Some columns do have missing values and less than 891non-null
values. - The columns
Name
,Sex
,Cabin
andEmbarked
consists of textual data (strings, akaobject
). The other columns are numerical data with some of them whole numbers (akainteger
) and others are real numbers (akafloat
). - The kind of data (characters, integers,…) in the different columns
are summarized by listing the
dtypes
. - The approximate amount of RAM used to hold the DataFrame is provided as well.
- Getting data in to pandas from many different file formats or data
sources is supported by
read_*
functions. - Exporting data out of pandas is provided by different
to_*
methods. - The
head
/tail
/info
methods and thedtypes
attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about :ref:`reader and writer functions <io>`.