Skip to content

Commit 4580197

Browse files
committed
DOC-#57585: Add Use Modin section on Scaling to large datasets page
Signed-off-by: Igoshev, Iaroslav <[email protected]>
1 parent 2493681 commit 4580197

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed

doc/source/user_guide/scale.rst

+96
Original file line numberDiff line numberDiff line change
@@ -374,5 +374,101 @@ datasets.
374374

375375
You see more dask examples at https://examples.dask.org.
376376

377+
Use Modin
378+
---------
379+
380+
Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and
381+
provides the ability to scale pandas workflows across nodes and CPUs available and
382+
to work with larger than memory datasets. To start working with Modin you just need
383+
to replace a single line of code, namely, the import statement.
384+
385+
.. ipython:: python
386+
387+
# import pandas as pd
388+
import modin.pandas as pd
389+
390+
After you have changed the import statement, you can proceed using the well-known pandas API
391+
to scale computation. Modin distributes computation across nodes and CPUs available utilizing
392+
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
393+
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
394+
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
395+
the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas
396+
and perform a simple operation on the data.
397+
398+
.. ipython:: python
399+
400+
import pandas
401+
import modin.pandas as pd
402+
import numpy as np
403+
404+
array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8))
405+
filename = "example.csv"
406+
np.savetxt(filename, array, delimiter=",")
407+
408+
%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
409+
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
410+
Wall time: 52.5 s
411+
%time pandas_df = pandas_df.map(lambda x: x + 0.01)
412+
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
413+
Wall time: 56.5 s
414+
415+
%time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
416+
CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s
417+
Wall time: 17.5 s
418+
%time modin_df = modin_df.map(lambda x: x + 0.01)
419+
CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s
420+
Wall time: 2.54 s
421+
422+
We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution.
423+
Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
424+
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
425+
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle
426+
heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks.
427+
What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin.
428+
429+
.. ipython:: python
430+
431+
from modin.pandas.io import to_pandas
432+
433+
%%time
434+
pandas_subset = pandas_df.iloc[:100000]
435+
for col in pandas_subset.columns:
436+
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
437+
CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms
438+
Wall time: 293 ms
439+
440+
%%time
441+
modin_subset = modin_df.iloc[:100000]
442+
for col in modin_subset.columns:
443+
modin_subset[col] = modin_subset[col] / modin_subset[col].sum()
444+
CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s
445+
Wall time: 20.9 s
446+
447+
%%time
448+
pandas_subset = to_pandas(modin_df.iloc[:100000])
449+
for col in pandas_subset.columns:
450+
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
451+
CPU times: user 566 ms, sys: 279 ms, total: 845 ms
452+
Wall time: 731 ms
453+
454+
You could also rewrite this code a bit to get the same result with much less execution time.
455+
456+
.. ipython:: python
457+
458+
%%time
459+
modin_subset = modin_df.iloc[:100000]
460+
modin_subset = modin_subset / modin_subset.sum(axis=0)
461+
CPU times: user 531 ms, sys: 134 ms, total: 666 ms
462+
Wall time: 374 ms
463+
464+
For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
465+
to start scaling pandas operations with Modin and an execution engine you like.
466+
467+
.. _Modin: https://github.com/modin-project/modin
468+
.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest
469+
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
470+
.. _Ray: https://github.com/ray-project/ray
377471
.. _Dask: https://dask.org
472+
.. _`MPI through unidist`: https://github.com/modin-project/unidist
473+
.. _HDK: https://github.com/intel-ai/hdk
378474
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html

0 commit comments

Comments
 (0)