Skip to content

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 7, 2024
102 changes: 102 additions & 0 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -374,5 +374,107 @@ datasets.

You see more dask examples at https://examples.dask.org.

Use Modin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents. I would just simply remove both Dask and Modin here, add a link to

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I would at least deduplicate them here if we want to keep those sections. e.g.,

# Using thirdparty libraries

multiple nodes blah blah..

## Dask

oneliner explanation

Link

## Modin

oneliner explanation

Link

## PySpark: Pandas API on Spark

oneliner explanation

Link


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

I would argue against the statement They all aim drop-in replacement. Modin is supposed to work after changing the import statement import pandas as pd to import modin.pandas as pd and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.

---------

Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has a drop-in" -> "aims to be a drop-in" would be more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed.

provides the ability to scale pandas workflows across nodes and CPUs available. It is also able
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say Modin distributes computation across nodes and CPUs available which seems to pretty much repeat the same thing again to conextualize that part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable.

to work with larger than memory datasets. To start working with Modin you just need
to replace a single line of code, namely, the import statement.

.. code-block:: ipython

# import pandas as pd
import modin.pandas as pd

After you have changed the import statement, you can proceed using the well-known pandas API
to scale computation. Modin distributes computation across nodes and CPUs available utilizing
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas
and perform a simple operation on the data.

.. code-block:: ipython

import pandas
import modin.pandas as pd
import numpy as np

array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8))
filename = "example.csv"
np.savetxt(filename, array, delimiter=",")

%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
Wall time: 52.5 s
%time pandas_df = pandas_df.map(lambda x: x + 0.01)
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
Wall time: 56.5 s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is topandas_df = pandas_df + 0.01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

Applied .. code-block:: ipython. Thanks.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to pandas_df = pandas_df + 0.01

I understand that the binary operation is much more efficient than appliying a function with map. Since the user may apply any function on the data, I wanted to show how it performs with pandas and Modin. Do you mind we leave this as is?

Copy link
Member

@rhshadrach rhshadrach Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind we leave this as is?

It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing pandas_df + 0.01 on my machine. Because of this - I do mind.

I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now.


%time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s
Wall time: 17.5 s
%time modin_df = modin_df.map(lambda x: x + 0.01)
CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s
Wall time: 2.54 s

We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution.
Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle
heavy tasks rather than a big number of small ones. Yet, Modin is actively working on eliminating all these drawbacks.
What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin.

.. code-block:: ipython

from modin.pandas.io import to_pandas

%%time
pandas_subset = pandas_df.iloc[:100000]
for col in pandas_subset.columns:
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms
Wall time: 293 ms

%%time
modin_subset = modin_df.iloc[:100000]
for col in modin_subset.columns:
modin_subset[col] = modin_subset[col] / modin_subset[col].sum()
CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s
Wall time: 20.9 s

%%time
pandas_subset = to_pandas(modin_df.iloc[:100000])
for col in pandas_subset.columns:
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
CPU times: user 566 ms, sys: 279 ms, total: 845 ms
Wall time: 731 ms

You could also rewrite this code a bit to get the same result with much less execution time.

.. code-block:: ipython

%%time
pandas_subset = pandas_df.iloc[:100000]
pandas_subset = pandas_subset / pandas_subset.sum(axis=0)
CPU times: user 105 ms, sys: 97.5 ms, total: 202 ms
Wall time: 72.2 ms

%%time
modin_subset = modin_df.iloc[:100000]
modin_subset = modin_subset / modin_subset.sum(axis=0)
CPU times: user 531 ms, sys: 134 ms, total: 666 ms
Wall time: 374 ms

For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
to start scaling pandas operations with Modin and an execution engine you like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)

In my opinion, just: "For more information refer to Modin's documentation_ or the Modin's tutorials_. " would feel more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. Changed.


.. _Modin: https://github.com/modin-project/modin
.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
.. _Ray: https://github.com/ray-project/ray
.. _Dask: https://dask.org
.. _`MPI through unidist`: https://github.com/modin-project/unidist
.. _HDK: https://github.com/intel-ai/hdk
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html