Skip to content

DOC: Add scaling to large datasets section #28577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 1, 2019
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 32 additions & 63 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ Scaling to large datasets
*************************

Pandas provides data structures for in-memory analytics, which makes using pandas
to analyze larger than memory datasets somewhat tricky.
to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets
that are a sizable fraction of memory become unwieldy, as some pandas operations need
to make intermediate copies.

This document provides a few recommendations for scaling your analysis to larger datasets.
It's a complement to :ref:`enhancingperf`, which focuses on speeding up analysis
Expand All @@ -14,7 +16,7 @@ for datasets that fit in memory.
But first, it's worth considering *not using pandas*. Pandas isn't the right
tool for all situations. If you're working with very large datasets and a tool
like PostgreSQL fits your needs, then you should probably be using that.
Assuming you want or need the expressivity and power of pandas, let's carry on.
Assuming you want or need the expressiveness and power of pandas, let's carry on.

.. ipython:: python

Expand All @@ -26,54 +28,11 @@ Assuming you want or need the expressivity and power of pandas, let's carry on.

from pandas.util.testing import _make_timeseries


Use more efficient file formats
-------------------------------

Depending on your workload, data loading may be a bottleneck. In these cases you
might consider switching from a slow format like CSV to a faster format like
Parquet. Loading from a file format like Parquet will also require less memory
usage, letting you load larger datasets into pandas before running out of
memory.

.. ipython:: python
:suppress:

# Make a random in-memory dataset
ts = _make_timeseries(freq="30S", seed=0)
ts.to_csv("timeseries.csv")
ts.to_parquet("timeseries.parquet")

For example, suppose we have a dataset like the following::

id name x y
timestamp
2000-01-01 00:00:00 1029 Michael 0.278837 0.247932
2000-01-01 00:00:30 1010 Patricia 0.077144 0.490260
2000-01-01 00:01:00 1001 Victor 0.214525 0.258635
2000-01-01 00:01:30 1018 Alice -0.646866 0.822104
2000-01-01 00:02:00 991 Dan 0.902389 0.466665
... ... ... ... ...
2000-12-30 23:58:00 992 Sarah 0.721155 0.944118
2000-12-30 23:58:30 1007 Ursula 0.409277 0.133227
2000-12-30 23:59:00 1009 Hannah -0.452802 0.184318
2000-12-30 23:59:30 978 Kevin -0.904728 -0.179146
2000-12-31 00:00:00 973 Ingrid -0.370763 -0.794667

That dataset has been stored on disk as CSV and Parquet. We want to
compare the performance of reading those two formats.

.. ipython:: python

col = "timestamp"
%time _ = pd.read_csv("timeseries.csv", index_col=col, parse_dates=[col])

.. ipython:: python

%time _ = pd.read_parquet("timeseries.parquet")

Notice that parquet gives higher performance for reading (and writing), both
in terms of speed and lower peak memory usage. See :ref:`io` for more.

Load less data
--------------
Expand Down Expand Up @@ -115,14 +74,16 @@ Option 1 loads in all the data and then filters to what we need.

columns = ['id_0', 'name_0', 'x_0', 'y_0']

%time _ = pd.read_parquet("timeseries_wide.parquet")[columns]
pd.read_parquet("timeseries_wide.parquet")[columns]

Option 2 only loads the columns we request. This is faster and has a lower peak
memory usage since the entire dataset isn't in memory at once.
Option 2 only loads the columns we request.

.. ipython:: python

%time _ = pd.read_parquet("timeseries_wide.parquet", columns=columns)
pd.read_parquet("timeseries_wide.parquet", columns=columns)

If we were to measure the memory usage of the two calls, we'd see that specifying
``columns`` uses about 1/10th the memory in this case.

With :func:`pandas.read_csv`, you can specify ``usecols`` to limit the columns
read into memory. Not all file formats that can be read by pandas provide an option
Expand All @@ -136,6 +97,14 @@ especially true for high-cardinality text data (columns with relatively few
unique values). By using more efficient data types you can store larger datasets
in memory.

.. ipython:: python

ts = pd.read_parquet("timeseries.parquet")
ts

Now, let's inspect the data types and memory usage to see where we should focus our
attention.

.. ipython:: python

ts.dtypes
Expand Down Expand Up @@ -186,8 +155,8 @@ Use chunking
------------

Some workloads can be achieved with chunking: splitting a large problem like "convert this
directory of CSVs to parquet" into a bunch of small problems ("convert this individual parquet
file into a CSV. Now repeat that for each file in this directory."). As long as each chunk
directory of CSVs to parquet" into a bunch of small problems ("convert this individual CSV
file into a Parquet file. Now repeat that for each file in this directory."). As long as each chunk
fits in memory, you can work with datasets that are much larger than memory.

.. note::
Expand Down Expand Up @@ -240,7 +209,7 @@ work for arbitrary-sized datasets.
.. ipython:: python

%%time
files = list(pathlib.Path("data/timeseries/").glob("ts*.parquet"))
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
counts = pd.Series(dtype=int)
for path in files:
# Only one dataframe is in memory at a time...
Expand All @@ -264,7 +233,7 @@ Use other libraries

Pandas is just one library offering a DataFrame API. Because of its popularity,
pandas' API has become something of a standard that other libraries implement.
The pandas documentation maintains a list of libraries implemetning a DataFrame API
The pandas documentation maintains a list of libraries implementing a DataFrame API
in :ref:`our ecosystem page <ecosystem.out-of-core>`.

For example, `Dask`_, a parallel computing library, has `dask.dataframe`_, a
Expand Down Expand Up @@ -337,13 +306,14 @@ parallel. We can also connect to a cluster to distribute the work on many
machines. In this case we'll connect to a local "cluster" made up of several
processes on this single machine.

.. ipython:: python
.. code-block:: python

from dask.distributed import Client, LocalCluster
>>> from dask.distributed import Client, LocalCluster

cluster = LocalCluster()
client = Client(cluster)
client
>>> cluster = LocalCluster()
>>> client = Client(cluster)
>>> client
<Client: 'tcp://127.0.0.1:53349' processes=4 threads=8, memory=17.18 GB>

Once this ``client`` is created, all of Dask's computation will take place on
the cluster (which is just processes in this case).
Expand All @@ -364,6 +334,10 @@ we need to supply the divisions manually.

.. ipython:: python

N = 12
starts = [f'20{i:>02d}-01-01' for i in range(N)]
ends = [f'20{i:>02d}-12-13' for i in range(N)]

divisions = tuple(pd.to_datetime(starts)) + (pd.Timestamp(ends[-1]),)
ddf.divisions = divisions
ddf
Expand Down Expand Up @@ -395,10 +369,5 @@ datasets.

You see more dask examples at https://examples.dask.org.

.. ipython:: python
:suppress:

del client, cluster

.. _Dask: https://dask.org
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html
7 changes: 5 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,11 @@ dependencies:
- nbconvert>=5.4.1
- nbsphinx
- pandoc
- dask
- distributed
- dask-core
- toolz>=0.7.3
- fsspec>=0.5.1
- partd>=0.3.10
- cloudpickle>=0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here that those last 4 ones are just dependencies of dask.dataframe?


# web (jinja2 is also needed, but it's also an optional pandas dependency)
- markdown
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -782,7 +782,8 @@ def test_categorical_no_compress():

def test_sort():

# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby # noqa: flake8
# http://stackoverflow.com/questions/23814368/
# sorting-pandas-categorical-labels-after-groupby # noqa: flake8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed on master (it's also what is causing the merge conflict)

# This should result in a properly sorted Series so that the plot
# has a sorted x axis
# self.cat.groupby(['value_group'])['value_group'].count().plot(kind='bar')
Expand Down
7 changes: 5 additions & 2 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,11 @@ numpydoc>=0.9.0
nbconvert>=5.4.1
nbsphinx
pandoc
dask
distributed
dask-core
toolz>=0.7.3
fsspec>=0.5.1
partd>=0.3.10
cloudpickle>=0.2.1
markdown
feedparser
pyyaml
Expand Down
2 changes: 1 addition & 1 deletion scripts/generate_pip_deps_from_conda.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
import yaml

EXCLUDE = {"python=3"}
RENAME = {"pytables": "tables", "pyqt": "pyqt5"}
RENAME = {"pytables": "tables", "pyqt": "pyqt5", "dask-core": "dask"}


def conda_package_to_pip(package):
Expand Down