Skip to content

DOC: Remove Dask and Modin sections in scale.rst in favor of linking to ecosystem docs. #57843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 15, 2024

Conversation

yukikitayama
Copy link
Contributor

environment.yml Outdated
@@ -62,7 +62,6 @@ dependencies:
# downstream packages
- dask-core<=2024.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- dask-core<=2024.2.1
- dask-core

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Thanks for your instructions. This reveals that I didn't understand version pinning, but I learned something.

@@ -49,7 +49,6 @@ xlsxwriter>=3.0.5
zstandard>=0.19.0
dask<=2024.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dask<=2024.2.1
dask

@mroeschke mroeschke added the Docs label Mar 14, 2024
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of details, but changes look good, thanks @yukikitayama for taking care of this.

@@ -217,190 +217,10 @@ require too sophisticated of operations. Some operations, like :meth:`pandas.Dat
much harder to do chunkwise. In these cases, you may be better switching to a
different library that implements these out-of-core algorithms for you.

.. _scale.other_libraries:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave this label please, so we can link to this section if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have warning undefined label because I removed it. Makes sense. Thanks for letting me know.

Use Dask
--------
Use Other Libraries
-------------------

pandas is just one library offering a DataFrame API. Because of its popularity,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph was probably ok when Dask was discussed in detail here, but I think now it does a very poor job at pointing out to the ecosystem.

A bit of context: This is the documentation for scaling pandas (using pandas with data too big to fit in memory, or to process with a single computer). Besides what's explained above of this section, we want users to know that there are a set of libraries such as PySpask, Dask and Modin that implement an API almost identical to the pandas one, but run in clusters. And that they can find more information in the ecosystem page.

Do you mind trying to rephrase this section in a way that is helpful for users to understand this @yukikitayama ?

Thank you very much for the work here.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and the work on this @yukikitayama.

I added couple of minor suggestions, but looks good to me.

.. _`MPI through unidist`: https://github.com/modin-project/unidist
.. _HDK: https://github.com/intel-ai/hdk
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html
There are many other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are many other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,
There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing and giving me suggestions @datapythonista . I saw the unit test ASAN/UBSAN failed, but is it okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our CI seems to be failing now, you can ignore that failure.

.. _HDK: https://github.com/intel-ai/hdk
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html
There are many other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,
but can give you the ability to scale your large dataset processing and analytics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
but can give you the ability to scale your large dataset processing and analytics
and can give you the ability to scale your large dataset processing and analytics

@mroeschke mroeschke added this to the 2.2.2 milestone Mar 15, 2024
@mroeschke mroeschke merged commit d4ddc80 into pandas-dev:main Mar 15, 2024
46 of 47 checks passed
Copy link

lumberbot-app bot commented Mar 15, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 d4ddc805a03586f9ce0cc1cc541709419ae47c4a
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #57843: DOC: Remove Dask and Modin sections in scale.rst in favor of linking to ecosystem docs.'
  1. Push to a named branch:
git push YOURFORK 2.2.x:auto-backport-of-pr-57843-on-2.2.x
  1. Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #57843 on branch 2.2.x (DOC: Remove Dask and Modin sections in scale.rst in favor of linking to ecosystem docs.)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@mroeschke
Copy link
Member

Thanks @yukikitayama

(Backporting since the dask dependency changes were backported too)

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Mar 15, 2024
…scale.rst in favor of linking to ecosystem docs.
@yukikitayama
Copy link
Contributor Author

Thank you for reviewing @mroeschke !

mroeschke added a commit that referenced this pull request Mar 15, 2024
…in favor of linking to ecosystem docs. (#57861)

Co-authored-by: Yuki Kitayama <[email protected]>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…to ecosystem docs. (pandas-dev#57843)

* remove Use Dask adn Use Modin sections

* add a new section: Use Other Libraries and link to Out-of-core section in Ecosystem web page

* remove dask-expr

* remove version pinning from dask and dask-core

* put other libraries label back in

* update use other libraries description to have a better transfer to ecosystem page

* change minor sentences for suggestions

* remove unnecessary characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Remove Dask and Modin sections in scale.rst in favor of linking to ecosystem docs.
3 participants