@@ -374,5 +374,101 @@ datasets.
374
374
375
375
You see more dask examples at https://examples.dask.org.
376
376
377
+ Use Modin
378
+ ---------
379
+
380
+ Modin _ is a scalable dataframe library, which has a drop-in replacement API for pandas and
381
+ provides the ability to scale pandas workflows across nodes and CPUs available and
382
+ to work with larger than memory datasets. To start working with Modin you just need
383
+ to replace a single line of code, namely, the import statement.
384
+
385
+ .. ipython :: python
386
+
387
+ # import pandas as pd
388
+ import modin.pandas as pd
389
+
390
+ After you have changed the import statement, you can proceed using the well-known pandas API
391
+ to scale computation. Modin distributes computation across nodes and CPUs available utilizing
392
+ an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
393
+ in Modin: Ray _, Dask _, `MPI through unidist `_, HDK _. The partitioning schema of a Modin DataFrame partitions it
394
+ along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
395
+ the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas
396
+ and perform a simple operation on the data.
397
+
398
+ .. ipython :: python
399
+
400
+ import pandas
401
+ import modin.pandas as pd
402
+ import numpy as np
403
+
404
+ array = np.random.randint(low = 0.1 , high = 1.0 , size = (2 ** 20 , 2 ** 8 ))
405
+ filename = " example.csv"
406
+ np.savetxt(filename, array, delimiter = " ," )
407
+
408
+ % time pandas_df = pandas.read_csv(filename, names = [f " col { i} " for i in range (2 ** 8 )])
409
+ CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
410
+ Wall time: 52.5 s
411
+ % time pandas_df = pandas_df.map(lambda x : x + 0.01 )
412
+ CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
413
+ Wall time: 56.5 s
414
+
415
+ % time modin_df = pd.read_csv(filename, names = [f " col { i} " for i in range (2 ** 8 )])
416
+ CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s
417
+ Wall time: 17.5 s
418
+ % time modin_df = modin_df.map(lambda x : x + 0.01 )
419
+ CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s
420
+ Wall time: 2.54 s
421
+
422
+ We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution.
423
+ Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
424
+ It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
425
+ in an efficient way. Also, for-loops is an antipattern for Modin since Modin has been designed to efficiently handle
426
+ heavy tasks, rather a small number of small ones. What you can do in such a case is to use pandas for the cases
427
+ where it is more beneficial than Modin.
428
+
429
+ .. ipython :: python
430
+
431
+ from modin.pandas.io import to_pandas
432
+
433
+ %% time
434
+ pandas_subset = pandas_df.iloc[:100000 ]
435
+ for col in pandas_subset.columns:
436
+ pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
437
+ CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms
438
+ Wall time: 293 ms
439
+
440
+ %% time
441
+ modin_subset = modin_df.iloc[:100000 ]
442
+ for col in modin_subset.columns:
443
+ modin_subset[col] = modin_subset[col] / modin_subset[col].sum()
444
+ CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s
445
+ Wall time: 20.9 s
446
+
447
+ %% time
448
+ pandas_subset = to_pandas(modin_df.iloc[:100000 ])
449
+ for col in pandas_subset.columns:
450
+ pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
451
+ CPU times: user 566 ms, sys: 279 ms, total: 845 ms
452
+ Wall time: 731 ms
453
+
454
+ You could also rewrite this code a bit to get the same result with much less execution time.
455
+
456
+ .. ipython :: python
457
+
458
+ %% time
459
+ modin_subset = modin_df.iloc[:100000 ]
460
+ modin_subset = modin_subset / modin_subset.sum(axis = 0 )
461
+ CPU times: user 531 ms, sys: 134 ms, total: 666 ms
462
+ Wall time: 374 ms
463
+
464
+ For more information refer to `Modin's documentation `_ or dip into `Modin's tutorials `_ right away
465
+ to start scaling pandas operations with Modin and an execution engine you like.
466
+
467
+ .. _Modin : https://github.com/modin-project/modin
468
+ .. _`Modin's documetation` : https://modin.readthedocs.io/en/latest
469
+ .. _`Modin's tutorials` : https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
470
+ .. _Ray : https://github.com/ray-project/ray
377
471
.. _Dask : https://dask.org
472
+ .. _`MPI through unidist` : https://github.com/modin-project/unidist
473
+ .. _HDK : https://github.com/intel-ai/hdk
378
474
.. _dask.dataframe : https://docs.dask.org/en/latest/dataframe.html
0 commit comments