-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC-#57585: Add Use Modin
section on Scaling to large datasets
page
#57586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4580197
be9f1e7
f0d6d07
a6521ca
0e45ac4
f37354a
23b2891
b8dfa73
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -374,5 +374,33 @@ datasets. | |
|
||
You see more dask examples at https://examples.dask.org. | ||
|
||
Use Modin | ||
--------- | ||
|
||
Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and | ||
provides the ability to scale pandas workflows across nodes and CPUs available. It is also able | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable. |
||
to work with larger than memory datasets. To start working with Modin you just need | ||
to replace a single line of code, namely, the import statement. | ||
|
||
.. code-block:: ipython | ||
|
||
# import pandas as pd | ||
import modin.pandas as pd | ||
|
||
After you have changed the import statement, you can proceed using the well-known pandas API | ||
to scale computation. Modin distributes computation across nodes and CPUs available utilizing | ||
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported | ||
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it | ||
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and | ||
the number of rows. | ||
Comment on lines
+393
to
+395
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if it's just me, but the part about partitioning seems too technical and out of context for this short intro. Feel free to disagree if you think it's key for users to know this asap when considering Modin. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this page is not a getting started, I think it is fine to say some words here about partitioning, which is flexible along columns and row. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think being both columnar and row-wise is an important feature to mention (since some things only scale well in one axis). |
||
|
||
For more information refer to `Modin's documentation`_ or the `Modin's tutorials`_. | ||
|
||
.. _Modin: https://github.com/modin-project/modin | ||
.. _`Modin's documentation`: https://modin.readthedocs.io/en/latest | ||
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution | ||
.. _Ray: https://github.com/ray-project/ray | ||
.. _Dask: https://dask.org | ||
.. _`MPI through unidist`: https://github.com/modin-project/unidist | ||
.. _HDK: https://github.com/intel-ai/hdk | ||
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents. I would just simply remove both Dask and Modin here, add a link to
pandas/web/pandas/community/ecosystem.md
Line 344 in a0784d2
A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, I would at least deduplicate them here if we want to keep those sections. e.g.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue against the statement
They all aim drop-in replacement
. Modin is supposed to work after changing the import statementimport pandas as pd
toimport modin.pandas as pd
and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks thatScaling to large datasets
page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.