From ec5ae9cfb77463cd47f4644f59442dfd70b341ea Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Wed, 13 Mar 2024 16:39:07 -0700 Subject: [PATCH 1/8] remove Use Dask adn Use Modin sections --- doc/source/user_guide/scale.rst | 188 -------------------------------- 1 file changed, 188 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 080f8484ce969..61ea07851e28f 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -216,191 +216,3 @@ Manually chunking is an OK option for workflows that don't require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you. - -.. _scale.other_libraries: - -Use Dask --------- - -pandas is just one library offering a DataFrame API. Because of its popularity, -pandas' API has become something of a standard that other libraries implement. -The pandas documentation maintains a list of libraries implementing a DataFrame API -in `the ecosystem page `_. - -For example, `Dask`_, a parallel computing library, has `dask.dataframe`_, a -pandas-like API for working with larger than memory datasets in parallel. Dask -can use multiple threads or processes on a single machine, or a cluster of -machines to process data in parallel. - - -We'll import ``dask.dataframe`` and notice that the API feels similar to pandas. -We can use Dask's ``read_parquet`` function, but provide a globstring of files to read in. - -.. ipython:: python - :okwarning: - - import dask.dataframe as dd - - ddf = dd.read_parquet("data/timeseries/ts*.parquet", engine="pyarrow") - ddf - -Inspecting the ``ddf`` object, we see a few things - -* There are familiar attributes like ``.columns`` and ``.dtypes`` -* There are familiar methods like ``.groupby``, ``.sum``, etc. -* There are new attributes like ``.npartitions`` and ``.divisions`` - -The partitions and divisions are how Dask parallelizes computation. A **Dask** -DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a -Dask DataFrame ends up making many pandas method calls, and Dask knows how to -coordinate everything to get the result. - -.. ipython:: python - - ddf.columns - ddf.dtypes - ddf.npartitions - -One major difference: the ``dask.dataframe`` API is *lazy*. If you look at the -repr above, you'll notice that the values aren't actually printed out; just the -column names and dtypes. That's because Dask hasn't actually read the data yet. -Rather than executing immediately, doing operations build up a **task graph**. - -.. ipython:: python - :okwarning: - - ddf - ddf["name"] - ddf["name"].value_counts() - -Each of these calls is instant because the result isn't being computed yet. -We're just building up a list of computation to do when someone needs the -result. Dask knows that the return type of a :class:`pandas.Series.value_counts` -is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version -returns a Dask Series with the same dtype and the same name. - -To get the actual result you can call ``.compute()``. - -.. ipython:: python - :okwarning: - - %time ddf["name"].value_counts().compute() - -At that point, you get back the same thing you'd get with pandas, in this case -a concrete pandas :class:`pandas.Series` with the count of each ``name``. - -Calling ``.compute`` causes the full task graph to be executed. This includes -reading the data, selecting the columns, and doing the ``value_counts``. The -execution is done *in parallel* where possible, and Dask tries to keep the -overall memory footprint small. You can work with datasets that are much larger -than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory. - -By default, ``dask.dataframe`` operations use a threadpool to do operations in -parallel. We can also connect to a cluster to distribute the work on many -machines. In this case we'll connect to a local "cluster" made up of several -processes on this single machine. - -.. code-block:: python - - >>> from dask.distributed import Client, LocalCluster - - >>> cluster = LocalCluster() - >>> client = Client(cluster) - >>> client - - -Once this ``client`` is created, all of Dask's computation will take place on -the cluster (which is just processes in this case). - -Dask implements the most used parts of the pandas API. For example, we can do -a familiar groupby aggregation. - -.. ipython:: python - :okwarning: - - %time ddf.groupby("name")[["x", "y"]].mean().compute().head() - -The grouping and aggregation is done out-of-core and in parallel. - -When Dask knows the ``divisions`` of a dataset, certain optimizations are -possible. When reading parquet datasets written by dask, the divisions will be -known automatically. In this case, since we created the parquet files manually, -we need to supply the divisions manually. - -.. ipython:: python - :okwarning: - - N = 12 - starts = [f"20{i:>02d}-01-01" for i in range(N)] - ends = [f"20{i:>02d}-12-13" for i in range(N)] - - divisions = tuple(pd.to_datetime(starts)) + (pd.Timestamp(ends[-1]),) - ddf.divisions = divisions - ddf - -Now we can do things like fast random access with ``.loc``. - -.. ipython:: python - :okwarning: - - ddf.loc["2002-01-01 12:01":"2002-01-01 12:05"].compute() - -Dask knows to just look in the 3rd partition for selecting values in 2002. It -doesn't need to look at any other data. - -Many workflows involve a large amount of data and processing it in a way that -reduces the size to something that fits in memory. In this case, we'll resample -to daily frequency and take the mean. Once we've taken the mean, we know the -results will fit in memory, so we can safely call ``compute`` without running -out of memory. At that point it's just a regular pandas object. - -.. ipython:: python - :okwarning: - - @savefig dask_resample.png - ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot() - -.. ipython:: python - :suppress: - - import shutil - - shutil.rmtree("data/timeseries") - -These Dask examples have all be done using multiple processes on a single -machine. Dask can be `deployed on a cluster -`_ to scale up to even larger -datasets. - -You see more dask examples at https://examples.dask.org. - -Use Modin ---------- - -Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and -provides the ability to scale pandas workflows across nodes and CPUs available. It is also able -to work with larger than memory datasets. To start working with Modin you just need -to replace a single line of code, namely, the import statement. - -.. code-block:: ipython - - # import pandas as pd - import modin.pandas as pd - -After you have changed the import statement, you can proceed using the well-known pandas API -to scale computation. Modin distributes computation across nodes and CPUs available utilizing -an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported -in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it -along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and -the number of rows. - -For more information refer to `Modin's documentation`_ or the `Modin's tutorials`_. - -.. _Modin: https://github.com/modin-project/modin -.. _`Modin's documentation`: https://modin.readthedocs.io/en/latest -.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution -.. _Ray: https://github.com/ray-project/ray -.. _Dask: https://dask.org -.. _`MPI through unidist`: https://github.com/modin-project/unidist -.. _HDK: https://github.com/intel-ai/hdk -.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html From 1ff37d172ecd55b0f2598cfb467164fdbd1c7412 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Thu, 14 Mar 2024 10:07:21 -0700 Subject: [PATCH 2/8] add a new section: Use Other Libraries and link to Out-of-core section in Ecosystem web page --- doc/source/user_guide/scale.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 61ea07851e28f..29d6b47708bb3 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -216,3 +216,11 @@ Manually chunking is an OK option for workflows that don't require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you. + +Use Other Libraries +------------------- + +pandas is just one library offering a DataFrame API. Because of its popularity, +pandas' API has become something of a standard that other libraries implement. +The pandas documentation maintains a list of libraries implementing a DataFrame API +in `the ecosystem page `_. From 08b6ec0db4ff26304e8916a90203973689fa3691 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Thu, 14 Mar 2024 10:12:34 -0700 Subject: [PATCH 3/8] remove dask-expr --- environment.yml | 1 - requirements-dev.txt | 1 - 2 files changed, 2 deletions(-) diff --git a/environment.yml b/environment.yml index edc0eb88eeb0c..5e4c775e70686 100644 --- a/environment.yml +++ b/environment.yml @@ -62,7 +62,6 @@ dependencies: # downstream packages - dask-core<=2024.2.1 - seaborn-base - - dask-expr<=0.5.3 # local testing dependencies - moto diff --git a/requirements-dev.txt b/requirements-dev.txt index 580390b87032f..009dceaf72872 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -49,7 +49,6 @@ xlsxwriter>=3.0.5 zstandard>=0.19.0 dask<=2024.2.1 seaborn -dask-expr<=0.5.3 moto flask asv>=0.6.1 From 504f02fb384e689939bbe07d038c03e4df7c0ad3 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Thu, 14 Mar 2024 13:53:38 -0700 Subject: [PATCH 4/8] remove version pinning from dask and dask-core --- environment.yml | 2 +- requirements-dev.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/environment.yml b/environment.yml index 5e4c775e70686..3528f12c66a8b 100644 --- a/environment.yml +++ b/environment.yml @@ -60,7 +60,7 @@ dependencies: - zstandard>=0.19.0 # downstream packages - - dask-core<=2024.2.1 + - dask-core - seaborn-base # local testing dependencies diff --git a/requirements-dev.txt b/requirements-dev.txt index 009dceaf72872..40c7403cb88e8 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -47,7 +47,7 @@ xarray>=2022.12.0 xlrd>=2.0.1 xlsxwriter>=3.0.5 zstandard>=0.19.0 -dask<=2024.2.1 +dask seaborn moto flask From eb77c42b9342b476c4c3cb4cf5312fcbdc9495fe Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Fri, 15 Mar 2024 09:29:32 -0700 Subject: [PATCH 5/8] put other libraries label back in --- doc/source/user_guide/scale.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 29d6b47708bb3..d2f51bd694879 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -217,6 +217,8 @@ require too sophisticated of operations. Some operations, like :meth:`pandas.Dat much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you. +.. _scale.other_libraries: + Use Other Libraries ------------------- From 1e5bc9f9873a7e2f61053c00c88e30aa71cee6e0 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Fri, 15 Mar 2024 10:14:45 -0700 Subject: [PATCH 6/8] update use other libraries description to have a better transfer to ecosystem page --- doc/source/user_guide/scale.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index d2f51bd694879..2e04997b2414e 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -156,7 +156,7 @@ fits in memory, you can work with datasets that are much larger than memory. Chunking works well when the operation you're performing requires zero or minimal coordination between chunks. For more complicated workflows, you're better off - :ref:`using another library `. + :ref:`using other libraries `. Suppose we have an even larger "logical dataset" on disk that's a directory of parquet files. Each file in the directory represents a different year of the entire dataset. @@ -222,7 +222,7 @@ different library that implements these out-of-core algorithms for you. Use Other Libraries ------------------- -pandas is just one library offering a DataFrame API. Because of its popularity, -pandas' API has become something of a standard that other libraries implement. -The pandas documentation maintains a list of libraries implementing a DataFrame API +There are many other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame, +but can give you the ability to scale your large dataset processing and analytics +by parallel runtime, distributed memory, clustering, etc. You can find more information in `the ecosystem page `_. From 1bd146b549f919a6ec85d741140392af9749a143 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Fri, 15 Mar 2024 12:34:33 -0700 Subject: [PATCH 7/8] change minor sentences for suggestions --- doc/source/user_guide/scale.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 2e04997b2414e..35641d726025e 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -217,12 +217,15 @@ require too sophisticated of operations. Some operations, like :meth:`pandas.Dat much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you. -.. _scale.other_libraries: +.. _scale.other_libraries::q:wq + + + Use Other Libraries ------------------- -There are many other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame, -but can give you the ability to scale your large dataset processing and analytics +There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame, +and can give you the ability to scale your large dataset processing and analytics by parallel runtime, distributed memory, clustering, etc. You can find more information in `the ecosystem page `_. From cba7b81e4e5c88a5f144233bd042da2dd9988626 Mon Sep 17 00:00:00 2001 From: yukikitayama Date: Fri, 15 Mar 2024 13:24:00 -0700 Subject: [PATCH 8/8] remove unnecessary characters --- doc/source/user_guide/scale.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 35641d726025e..29df2994fbc35 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -217,10 +217,7 @@ require too sophisticated of operations. Some operations, like :meth:`pandas.Dat much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you. -.. _scale.other_libraries::q:wq - - - +.. _scale.other_libraries: Use Other Libraries -------------------