Skip to content

Added pyarrow extra for instructions on silencing the DeprecationWarning #57284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Added pyarrow extra for instructions on silencing the DeprecationWarning #57284

wants to merge 3 commits into from

Conversation

jamesbraza
Copy link

Changes requested in #54466 (comment)

@@ -70,6 +70,7 @@ aws = ['s3fs>=2022.11.0']
gcp = ['gcsfs>=2022.11.0', 'pandas-gbq>=0.19.0']
excel = ['odfpy>=1.4.1', 'openpyxl>=3.1.0', 'python-calamine>=0.1.7', 'pyxlsb>=1.0.10', 'xlrd>=2.0.1', 'xlsxwriter>=3.0.5']
parquet = ['pyarrow>=10.0.1']
pyarrow = ['pyarrow>=10.0.1']
feather = ['pyarrow>=10.0.1']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having three optional dependency groups request exactly the same dependency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the rational goes like this:

  1. Pre-existing Pandas: has an Apache Parquet integration that happens to use pyarrow, and Feather format support that happens to use pyarrow. Both are distinct and opt-in , so they get their own setuptools extra
  2. Pandas 2.2 adds a DeprecationWarning specifically and only about pyarrow for data processing
  3. So this PR adds an extra specific to the Pandas 2.2's DeprecationWarning, and integrates the extra into to the warning's message
    • As the DeprecationWarning is unrelated to Parquet or Feather, it gets its own extra

For further clarification, read #54466 (comment)

Does that make sense now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more than before, thanks. But couldn't the Warning also say: "Please specify either pandas[parquet] or pandas[feather] during installation"?
This will be up to the taste of the maintainers here, I guess.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, in my mind, does Parquet or Feather implementation have to use pyarrow? What stops them from moving away from using pyarrow for their implementation?

And if we do move to suggesting one of them, which would we prefer? I think the DeprecationWarning should have a clear suggestion of which one extra to use

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation on the pandas extras is pretty minimal but I interpret it the same as you @jamesbraza that feather means support for "Parquet, ORC, and feather reading / writing" and doesn't imply that a compatible version of pyarrow will always be installed in the future if there were a lighter alternative to pyarrow (I don't know much about these things other than people complaining in the feedback thread about the size of pyarrow).

One tradeoff with adding a new extra is it does not exist in previous versions. Since pandas 2.2 only supports Python 3.9+, projects supporting 3.8 still might want to specify pandas for 3.8 and pandas[pyarrow] for 3.9+. An alternative would be relying on pandas[feather] for all versions. Maybe splitting by version is better any way since pyarrow is not doing anything for the 3.8 installation (if the project was not otherwise using the older pyarrow features) and adds a large download. I am not sure how pip resolves extras in all cases. I think pip ignores the extra while choosing the version and just warns if the extra is not defined for that version? It does just warn if you request an extra that does not exist and still installs the package. I am not sure what happens if one dependency requires pandas[pyarrow] (only available in 2.2.1+) and another dependency requires pandas<2.2.

Copy link
Author

@jamesbraza jamesbraza Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughts, agree with what all you're thinking.

Since pandas 2.2 only supports Python 3.9+, projects supporting 3.8 still might want to specify pandas for 3.8

The DeprecationWarning was only added in Pandas 2.2, so I don't think Python 3.8 users (who will not be able to access 2.2 or 3.0) will need to resolve this warning? Unless I am missing something

Perhaps we can have the pyarrow extra use a platform specifier such that it only applies to 3.9+

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Python 3.8 users (who will not be able to access 2.2 or 3.0) will need to resolve this warning?

I was thinking about authors of distributions that depend on pandas. If it wasn't for #57073 entertaining the idea of rolling back the requirement, I as a distribution author would just accept it and start putting pandas[pyarrow] in my distribution's dependency list since installing pyarrow would become mandatory soon any way, but if I still support 3.8 then my 3.8 users would start seeing an error from pip about there being no pyarrow extra.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see what you mean now, it's for pandas installers who support Python >=3.9 and Python < 3.9 simultaneously. One comment is 3.8 users wouldn't see an error, just a warning:

WARNING: pandas 2.2.0 does not provide the extra 'pyarrow'

However, that's still not quite good enough. The comprehensive solution is to use platform specifiers like this:

# requirements.txt
pandas ; python_version < "3.9"
pandas[pyarrow]>=2.2.1 ; python_version >= "3.9"

I am finding this also relevant: https://stackoverflow.com/a/68147602

Does that make sense? If you like it, I can update the DeprecationWarning message to reflect that.

@simonjayhawkins simonjayhawkins added Dependencies Required and optional dependencies Warnings Warnings that appear or should be added to pandas labels Feb 7, 2024
@simonjayhawkins simonjayhawkins added this to the 2.2.1 milestone Feb 7, 2024
@@ -212,6 +212,7 @@
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but {pa_msg}
To resolve this warning, please specify the pyarrow extra during installation: pandas[pyarrow]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold on this PR for now.

There's a decently high likelihood that we end up reverting this warning altogether.
(I know it's super frustrating to have the warning - I'm hoping to reach a conclusion at the next dev meeting and then release a hopefully warning-free pandas 2.1.1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good with me! I am open to whatever

@mroeschke
Copy link
Member

Sorry I missed this, but this was addressed in #57551. Thanks for the start here but closing

@mroeschke mroeschke closed this Feb 22, 2024
@jamesbraza
Copy link
Author

Sound good @mroeschke . It would be nice to update the messages in the DeprecationWarning to mention this extra too, feel free to diff this PR with your PR to see what I had posed

@jamesbraza jamesbraza deleted the adding-pyarrow branch February 22, 2024 18:43
@mroeschke
Copy link
Member

mroeschke commented Feb 22, 2024

The warning will be removed in 2.2.1 #57556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants