{{ header }}
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the :ref:`Cookbook<cookbook>`.
Customarily, we import as follows:
.. ipython:: python import numpy as np import pandas as pd
See the :ref:`Intro to data structures section <dsintro>`.
Creating a :class:`Series` by passing a list of values, letting pandas create a default integer index:
.. ipython:: python s = pd.Series([1, 3, 5, np.nan, 6, 8]) s
Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index using :func:`date_range` and labeled columns:
.. ipython:: python dates = pd.date_range("20130101", periods=6) dates df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD")) df
Creating a :class:`DataFrame` by passing a dictionary of objects that can be converted into a series-like structure:
.. ipython:: python df2 = pd.DataFrame( { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "D": np.array([3] * 4, dtype="int32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } ) df2
The columns of the resulting :class:`DataFrame` have different :ref:`dtypes <basics.dtypes>`:
.. ipython:: python df2.dtypes
If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here's a subset of the attributes that will be completed:
.. ipython:: @verbatim In [1]: df2.<TAB> # noqa: E225, E999 df2.A df2.bool df2.abs df2.boxplot df2.add df2.C df2.add_prefix df2.clip df2.add_suffix df2.columns df2.align df2.copy df2.all df2.count df2.any df2.combine df2.append df2.D df2.apply df2.describe df2.applymap df2.diff df2.B df2.duplicated
As you can see, the columns A
, B
, C
, and D
are automatically
tab completed. E
and F
are there as well; the rest of the attributes have been
truncated for brevity.
See the :ref:`Basics section <basics>`.
Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame respectively:
.. ipython:: python df.head() df.tail(3)
Display the :attr:`DataFrame.index` or :attr:`DataFrame.columns`:
.. ipython:: python df.index df.columns
:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
Note that this can be an expensive operation when your :class:`DataFrame` has
columns with different data types, which comes down to a fundamental difference
between pandas and NumPy: NumPy arrays have one dtype for the entire array,
while pandas DataFrames have one dtype per column. When you call
:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold all
of the dtypes in the DataFrame. This may end up being object
, which requires
casting every value to a Python object.
For df
, our :class:`DataFrame` of all floating-point values, and
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data:
.. ipython:: python df.to_numpy()
For df2
, the :class:`DataFrame` with multiple dtypes,
:meth:`DataFrame.to_numpy` is relatively expensive:
.. ipython:: python df2.to_numpy()
Note
:meth:`DataFrame.to_numpy` does not include the index or column labels in the output.
:func:`~DataFrame.describe` shows a quick statistic summary of your data:
.. ipython:: python df.describe()
Transposing your data:
.. ipython:: python df.T
:meth:`DataFrame.sort_index` sorts by an axis:
.. ipython:: python df.sort_index(axis=1, ascending=False)
:meth:`DataFrame.sort_values` sorts by values:
.. ipython:: python df.sort_values(by="B")
Note
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, :meth:`DataFrame.at`, :meth:`DataFrame.iat`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc`.
See the indexing documentation :ref:`Indexing and Selecting Data <indexing>` and :ref:`MultiIndex / Advanced Indexing <advanced>`.
Selecting a single column, which yields a :class:`Series`,
equivalent to df.A
:
.. ipython:: python df["A"]
Selecting via []
(__getitem__
), which slices the rows:
.. ipython:: python df[0:3] df["20130102":"20130104"]
See more in :ref:`Selection by Label <indexing.label>` using :meth:`DataFrame.loc` or :meth:`DataFrame.at`.
For getting a cross section using a label:
.. ipython:: python df.loc[dates[0]]
Selecting on a multi-axis by label:
.. ipython:: python df.loc[:, ["A", "B"]]
Showing label slicing, both endpoints are included:
.. ipython:: python df.loc["20130102":"20130104", ["A", "B"]]
Reduction in the dimensions of the returned object:
.. ipython:: python df.loc["20130102", ["A", "B"]]
For getting a scalar value:
.. ipython:: python df.loc[dates[0], "A"]
For getting fast access to a scalar (equivalent to the prior method):
.. ipython:: python df.at[dates[0], "A"]
See more in :ref:`Selection by Position <indexing.integer>` using :meth:`DataFrame.iloc` or :meth:`DataFrame.at`.
Select via the position of the passed integers:
.. ipython:: python df.iloc[3]
By integer slices, acting similar to NumPy/Python:
.. ipython:: python df.iloc[3:5, 0:2]
By lists of integer position locations, similar to the NumPy/Python style:
.. ipython:: python df.iloc[[1, 2, 4], [0, 2]]
For slicing rows explicitly:
.. ipython:: python df.iloc[1:3, :]
For slicing columns explicitly:
.. ipython:: python df.iloc[:, 1:3]
For getting a value explicitly:
.. ipython:: python df.iloc[1, 1]
For getting fast access to a scalar (equivalent to the prior method):
.. ipython:: python df.iat[1, 1]
Using a single column's values to select data:
.. ipython:: python df[df["A"] > 0]
Selecting values from a DataFrame where a boolean condition is met:
.. ipython:: python df[df > 0]
Using the :func:`~Series.isin` method for filtering:
.. ipython:: python df2 = df.copy() df2["E"] = ["one", "one", "two", "three", "four", "three"] df2 df2[df2["E"].isin(["two", "four"])]
Setting a new column automatically aligns the data by the indexes:
.. ipython:: python s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6)) s1 df["F"] = s1
Setting values by label:
.. ipython:: python df.at[dates[0], "A"] = 0
Setting values by position:
.. ipython:: python df.iat[0, 1] = 0
Setting by assigning with a NumPy array:
.. ipython:: python :okwarning: df.loc[:, "D"] = np.array([5] * len(df))
The result of the prior setting operations:
.. ipython:: python df
A where
operation with setting:
.. ipython:: python df2 = df.copy() df2[df2 > 0] = -df2 df2
pandas primarily uses the value np.nan
to represent missing data. It is by
default not included in computations. See the :ref:`Missing Data section
<missing_data>`.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
.. ipython:: python df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"]) df1.loc[dates[0] : dates[1], "E"] = 1 df1
:meth:`DataFrame.dropna` drops any rows that have missing data:
.. ipython:: python df1.dropna(how="any")
:meth:`DataFrame.fillna` fills missing data:
.. ipython:: python df1.fillna(value=5)
:func:`isna` gets the boolean mask where values are nan
:
.. ipython:: python pd.isna(df1)
See the :ref:`Basic section on Binary Ops <basics.binop>`.
Operations in general exclude missing data.
Performing a descriptive statistic:
.. ipython:: python df.mean()
Same operation on the other axis:
.. ipython:: python df.mean(1)
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
.. ipython:: python s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) s df.sub(s, axis="index")
:meth:`DataFrame.apply` applies a user defined function to the data:
.. ipython:: python df.apply(np.cumsum) df.apply(lambda x: x.max() - x.min())
See more at :ref:`Histogramming and Discretization <basics.discretization>`.
.. ipython:: python s = pd.Series(np.random.randint(0, 7, size=10)) s s.value_counts()
Series is equipped with a set of string processing methods in the str
attribute that make it easy to operate on each element of the array, as in the
code snippet below. Note that pattern-matching in str
generally uses regular
expressions by default (and in
some cases always uses them). See more at :ref:`Vectorized String Methods
<text.string_methods>`.
.. ipython:: python s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) s.str.lower()
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the :ref:`Merging section <merging>`.
Concatenating pandas objects together along an axis with :func:`concat`:
.. ipython:: python df = pd.DataFrame(np.random.randn(10, 4)) df # break it into pieces pieces = [df[:3], df[3:7], df[7:]] pd.concat(pieces)
Note
Adding a column to a :class:`DataFrame` is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the :class:`DataFrame` constructor instead of building a :class:`DataFrame` by iteratively appending records to it.
:func:`merge` enables SQL style join types along specific columns. See the :ref:`Database style joining <merging.join>` section.
.. ipython:: python left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]}) right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]}) left right pd.merge(left, right, on="key")
Another example that can be given is:
.. ipython:: python left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]}) right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) left right pd.merge(left, right, on="key")
By "group by" we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
See the :ref:`Grouping section <groupby>`.
.. ipython:: python df = pd.DataFrame( { "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], "B": ["one", "one", "two", "three", "two", "two", "one", "three"], "C": np.random.randn(8), "D": np.random.randn(8), } ) df
Grouping and then applying the :meth:`~pandas.core.groupby.GroupBy.sum` function to the resulting groups:
.. ipython:: python df.groupby("A")[["C", "D"]].sum()
Grouping by multiple columns forms a hierarchical index, and again we can apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:
.. ipython:: python df.groupby(["A", "B"]).sum()
See the sections on :ref:`Hierarchical Indexing <advanced.hierarchical>` and :ref:`Reshaping <reshaping.stacking>`.
.. ipython:: python tuples = list( zip( ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ["one", "two", "one", "two", "one", "two", "one", "two"], ) ) index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"]) df2 = df[:4] df2
The :meth:`~DataFrame.stack` method "compresses" a level in the DataFrame's columns:
.. ipython:: python stacked = df2.stack() stacked
With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the
index
), the inverse operation of :meth:`~DataFrame.stack` is
:meth:`~DataFrame.unstack`, which by default unstacks the last level:
.. ipython:: python stacked.unstack() stacked.unstack(1) stacked.unstack(0)
See the section on :ref:`Pivot Tables <reshaping.pivot>`.
.. ipython:: python df = pd.DataFrame( { "A": ["one", "one", "two", "three"] * 3, "B": ["A", "B", "C"] * 4, "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2, "D": np.random.randn(12), "E": np.random.randn(12), } ) df
:func:`pivot_table` pivots a :class:`DataFrame` specifying the values
, index
and columns
.. ipython:: python pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the :ref:`Time Series section <timeseries>`.
.. ipython:: python rng = pd.date_range("1/1/2012", periods=100, freq="S") ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) ts.resample("5Min").sum()
:meth:`Series.tz_localize` localizes a time series to a time zone:
.. ipython:: python rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D") ts = pd.Series(np.random.randn(len(rng)), rng) ts ts_utc = ts.tz_localize("UTC") ts_utc
:meth:`Series.tz_convert` converts a timezones aware time series to another time zone:
.. ipython:: python ts_utc.tz_convert("US/Eastern")
Converting between time span representations:
.. ipython:: python rng = pd.date_range("1/1/2012", periods=5, freq="M") ts = pd.Series(np.random.randn(len(rng)), index=rng) ts ps = ts.to_period() ps ps.to_timestamp()
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:
.. ipython:: python prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV") ts = pd.Series(np.random.randn(len(prng)), prng) ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9 ts.head()
pandas can include categorical data in a :class:`DataFrame`. For full docs, see the :ref:`categorical introduction <categorical>` and the :ref:`API documentation <api.arrays.categorical>`.
.. ipython:: python df = pd.DataFrame( {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]} )
Converting the raw grades to a categorical data type:
.. ipython:: python df["grade"] = df["raw_grade"].astype("category") df["grade"]
Rename the categories to more meaningful names (assigning to :meth:`Series.cat.categories` is in place!):
.. ipython:: python df["grade"].cat.categories = ["very good", "good", "very bad"]
Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default):
.. ipython:: python df["grade"] = df["grade"].cat.set_categories( ["very bad", "bad", "medium", "good", "very good"] ) df["grade"]
Sorting is per order in the categories, not lexical order:
.. ipython:: python df.sort_values(by="grade")
Grouping by a categorical column also shows empty categories:
.. ipython:: python df.groupby("grade").size()
See the :ref:`Plotting <visualization>` docs.
We use the standard convention for referencing the matplotlib API:
.. ipython:: python import matplotlib.pyplot as plt plt.close("all")
The plt.close
method is used to close a figure window:
.. ipython:: python ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000)) ts = ts.cumsum() @savefig series_plot_basic.png ts.plot();
If running under Jupyter Notebook, the plot will appear on :meth:`~Series.plot`. Otherwise use matplotlib.pyplot.show to show it or matplotlib.pyplot.savefig to write it to a file.
.. ipython:: python plt.show();
On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all of the columns with labels:
.. ipython:: python df = pd.DataFrame( np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"] ) df = df.cumsum() plt.figure(); df.plot(); @savefig frame_plot_basic.png plt.legend(loc='best');
:ref:`Writing to a csv file: <io.store_in_csv>` using :meth:`DataFrame.to_csv`
.. ipython:: python df.to_csv("foo.csv")
:ref:`Reading from a csv file: <io.read_csv_table>` using :func:`read_csv`
.. ipython:: python pd.read_csv("foo.csv")
.. ipython:: python :suppress: import os os.remove("foo.csv")
Reading and writing to :ref:`HDFStores <io.hdf5>`.
Writing to a HDF5 Store using :meth:`DataFrame.to_hdf`:
.. ipython:: python df.to_hdf("foo.h5", "df")
Reading from a HDF5 Store using :func:`read_hdf`:
.. ipython:: python pd.read_hdf("foo.h5", "df")
.. ipython:: python :suppress: os.remove("foo.h5")
Reading and writing to :ref:`Excel <io.excel>`.
Writing to an excel file using :meth:`DataFrame.to_excel`:
.. ipython:: python df.to_excel("foo.xlsx", sheet_name="Sheet1")
Reading from an excel file using :func:`read_excel`:
.. ipython:: python pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
.. ipython:: python :suppress: os.remove("foo.xlsx")
If you are attempting to perform a boolean operation on a :class:`Series` or :class:`DataFrame` you might see an exception like:
.. ipython:: python :okexcept: if pd.Series([False, True, False]): print("I was true")
See :ref:`Comparisons<basics.compare>` and :ref:`Gotchas<gotchas>` for an explanation and what to do.