pandas-dev · mroeschke · Mar 15, 2024 · Mar 13, 2024 · Mar 14, 2024 · Mar 14, 2024
diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst
@@ -217,190 +217,10 @@ require too sophisticated of operations. Some operations, like :meth:`pandas.Dat
 much harder to do chunkwise. In these cases, you may be better switching to a
 different library that implements these out-of-core algorithms for you.
 
-.. _scale.other_libraries:
-
-Use Dask
---------
+Use Other Libraries
+-------------------
 
 pandas is just one library offering a DataFrame API. Because of its popularity,
 pandas' API has become something of a standard that other libraries implement.
 The pandas documentation maintains a list of libraries implementing a DataFrame API
-in `the ecosystem page <https://pandas.pydata.org/community/ecosystem.html>`_.
-
-For example, `Dask`_, a parallel computing library, has `dask.dataframe`_, a
-pandas-like API for working with larger than memory datasets in parallel. Dask
-can use multiple threads or processes on a single machine, or a cluster of
-machines to process data in parallel.
-
-
-We'll import ``dask.dataframe`` and notice that the API feels similar to pandas.
-We can use Dask's ``read_parquet`` function, but provide a globstring of files to read in.
-
-.. ipython:: python
-   :okwarning:
-
-   import dask.dataframe as dd
-
-   ddf = dd.read_parquet("data/timeseries/ts*.parquet", engine="pyarrow")
-   ddf
-
-Inspecting the ``ddf`` object, we see a few things
-
-* There are familiar attributes like ``.columns`` and ``.dtypes``
-* There are familiar methods like ``.groupby``, ``.sum``, etc.
-* There are new attributes like ``.npartitions`` and ``.divisions``
-
-The partitions and divisions are how Dask parallelizes computation. A **Dask**
-DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
-Dask DataFrame ends up making many pandas method calls, and Dask knows how to
-coordinate everything to get the result.
-
-.. ipython:: python
-
-   ddf.columns
-   ddf.dtypes
-   ddf.npartitions
-
-One major difference: the ``dask.dataframe`` API is *lazy*. If you look at the
-repr above, you'll notice that the values aren't actually printed out; just the
-column names and dtypes. That's because Dask hasn't actually read the data yet.
-Rather than executing immediately, doing operations build up a **task graph**.
-
-.. ipython:: python
-   :okwarning:
-
-   ddf
-   ddf["name"]
-   ddf["name"].value_counts()
-
-Each of these calls is instant because the result isn't being computed yet.
-We're just building up a list of computation to do when someone needs the
-result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
-is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
-returns a Dask Series with the same dtype and the same name.
-
-To get the actual result you can call ``.compute()``.
-
-.. ipython:: python
-   :okwarning:
-
-   %time ddf["name"].value_counts().compute()
-
-At that point, you get back the same thing you'd get with pandas, in this case
-a concrete pandas :class:`pandas.Series` with the count of each ``name``.
-
-Calling ``.compute`` causes the full task graph to be executed. This includes
-reading the data, selecting the columns, and doing the ``value_counts``. The
-execution is done *in parallel* where possible, and Dask tries to keep the
-overall memory footprint small. You can work with datasets that are much larger
-than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.
-
-By default, ``dask.dataframe`` operations use a threadpool to do operations in
-parallel. We can also connect to a cluster to distribute the work on many
-machines. In this case we'll connect to a local "cluster" made up of several
-processes on this single machine.
-
-.. code-block:: python
-
-   >>> from dask.distributed import Client, LocalCluster
-
-   >>> cluster = LocalCluster()
-   >>> client = Client(cluster)
-   >>> client
-   <Client: 'tcp://127.0.0.1:53349' processes=4 threads=8, memory=17.18 GB>
-
-Once this ``client`` is created, all of Dask's computation will take place on
-the cluster (which is just processes in this case).
-
-Dask implements the most used parts of the pandas API. For example, we can do
-a familiar groupby aggregation.
-
-.. ipython:: python
-   :okwarning:
-
-   %time ddf.groupby("name")[["x", "y"]].mean().compute().head()
-
-The grouping and aggregation is done out-of-core and in parallel.
-
-When Dask knows the ``divisions`` of a dataset, certain optimizations are
-possible. When reading parquet datasets written by dask, the divisions will be
-known automatically. In this case, since we created the parquet files manually,
-we need to supply the divisions manually.
-
-.. ipython:: python
-   :okwarning:
-
-   N = 12
-   starts = [f"20{i:>02d}-01-01" for i in range(N)]
-   ends = [f"20{i:>02d}-12-13" for i in range(N)]
-
-   divisions = tuple(pd.to_datetime(starts)) + (pd.Timestamp(ends[-1]),)
-   ddf.divisions = divisions
-   ddf
-
-Now we can do things like fast random access with ``.loc``.
-
-.. ipython:: python
-   :okwarning:
-
-   ddf.loc["2002-01-01 12:01":"2002-01-01 12:05"].compute()
-
-Dask knows to just look in the 3rd partition for selecting values in 2002. It
-doesn't need to look at any other data.
-
-Many workflows involve a large amount of data and processing it in a way that
-reduces the size to something that fits in memory. In this case, we'll resample
-to daily frequency and take the mean. Once we've taken the mean, we know the
-results will fit in memory, so we can safely call ``compute`` without running
-out of memory. At that point it's just a regular pandas object.
-
-.. ipython:: python
-   :okwarning:
-
-   @savefig dask_resample.png
-   ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot()
-
-.. ipython:: python
-   :suppress:
-
-   import shutil
-
-   shutil.rmtree("data/timeseries")
-
-These Dask examples have all be done using multiple processes on a single
-machine. Dask can be `deployed on a cluster
-<https://docs.dask.org/en/latest/setup.html>`_ to scale up to even larger
-datasets.
-
-You see more dask examples at https://examples.dask.org.
-
-Use Modin
----------
-
-Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and
-provides the ability to scale pandas workflows across nodes and CPUs available. It is also able
-to work with larger than memory datasets. To start working with Modin you just need
-to replace a single line of code, namely, the import statement.
-
-.. code-block:: ipython
-
-   # import pandas as pd
-   import modin.pandas as pd
-
-After you have changed the import statement, you can proceed using the well-known pandas API
-to scale computation. Modin distributes computation across nodes and CPUs available utilizing
-an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
-in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
-along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
-the number of rows.
-
-For more information refer to `Modin's documentation`_ or the `Modin's tutorials`_.
-
-.. _Modin: https://github.com/modin-project/modin
-.. _`Modin's documentation`: https://modin.readthedocs.io/en/latest
-.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
-.. _Ray: https://github.com/ray-project/ray
-.. _Dask: https://dask.org
-.. _`MPI through unidist`: https://github.com/modin-project/unidist
-.. _HDK: https://github.com/intel-ai/hdk
-.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html
+in `the ecosystem page <https://pandas.pydata.org/community/ecosystem.html#out-of-core>`_.
diff --git a/environment.yml b/environment.yml
@@ -60,9 +60,8 @@ dependencies:
   - zstandard>=0.19.0
 
   # downstream packages
-  - dask-core<=2024.2.1
+  - dask-core
   - seaborn-base
-  - dask-expr<=0.5.3
 
   # local testing dependencies
   - moto

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -47,9 +47,8 @@ xarray>=2022.12.0
 xlrd>=2.0.1
 xlsxwriter>=3.0.5
 zstandard>=0.19.0
-dask<=2024.2.1
+dask
 seaborn
-dask-expr<=0.5.3
 moto
 flask
 asv>=0.6.1