-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Added pyarrow
extra for instructions on silencing the DeprecationWarning
#57284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -70,6 +70,7 @@ aws = ['s3fs>=2022.11.0'] | |||
gcp = ['gcsfs>=2022.11.0', 'pandas-gbq>=0.19.0'] | |||
excel = ['odfpy>=1.4.1', 'openpyxl>=3.1.0', 'python-calamine>=0.1.7', 'pyxlsb>=1.0.10', 'xlrd>=2.0.1', 'xlsxwriter>=3.0.5'] | |||
parquet = ['pyarrow>=10.0.1'] | |||
pyarrow = ['pyarrow>=10.0.1'] | |||
feather = ['pyarrow>=10.0.1'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of having three optional dependency groups request exactly the same dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the rational goes like this:
- Pre-existing Pandas: has an Apache Parquet integration that happens to use
pyarrow
, and Feather format support that happens to usepyarrow
. Both are distinct and opt-in , so they get their ownsetuptools
extra - Pandas 2.2 adds a
DeprecationWarning
specifically and only aboutpyarrow
for data processing - So this PR adds an extra specific to the Pandas 2.2's
DeprecationWarning
, and integrates the extra into to the warning's message- As the
DeprecationWarning
is unrelated to Parquet or Feather, it gets its own extra
- As the
For further clarification, read #54466 (comment)
Does that make sense now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, more than before, thanks. But couldn't the Warning also say: "Please specify either pandas[parquet] or pandas[feather] during installation"?
This will be up to the taste of the maintainers here, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, in my mind, does Parquet or Feather implementation have to use pyarrow
? What stops them from moving away from using pyarrow
for their implementation?
And if we do move to suggesting one of them, which would we prefer? I think the DeprecationWarning
should have a clear suggestion of which one extra to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation on the pandas extras is pretty minimal but I interpret it the same as you @jamesbraza that feather
means support for "Parquet, ORC, and feather reading / writing" and doesn't imply that a compatible version of pyarrow
will always be installed in the future if there were a lighter alternative to pyarrow
(I don't know much about these things other than people complaining in the feedback thread about the size of pyarrow
).
One tradeoff with adding a new extra is it does not exist in previous versions. Since pandas 2.2 only supports Python 3.9+, projects supporting 3.8 still might want to specify pandas
for 3.8 and pandas[pyarrow]
for 3.9+. An alternative would be relying on pandas[feather]
for all versions. Maybe splitting by version is better any way since pyarrow
is not doing anything for the 3.8 installation (if the project was not otherwise using the older pyarrow
features) and adds a large download. I am not sure how pip resolves extras in all cases. I think pip ignores the extra while choosing the version and just warns if the extra is not defined for that version? It does just warn if you request an extra that does not exist and still installs the package. I am not sure what happens if one dependency requires pandas[pyarrow]
(only available in 2.2.1+) and another dependency requires pandas<2.2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughts, agree with what all you're thinking.
Since pandas 2.2 only supports Python 3.9+, projects supporting 3.8 still might want to specify pandas for 3.8
The DeprecationWarning
was only added in Pandas 2.2, so I don't think Python 3.8 users (who will not be able to access 2.2 or 3.0) will need to resolve this warning? Unless I am missing something
Perhaps we can have the pyarrow
extra use a platform specifier such that it only applies to 3.9+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Python 3.8 users (who will not be able to access 2.2 or 3.0) will need to resolve this warning?
I was thinking about authors of distributions that depend on pandas. If it wasn't for #57073 entertaining the idea of rolling back the requirement, I as a distribution author would just accept it and start putting pandas[pyarrow]
in my distribution's dependency list since installing pyarrow
would become mandatory soon any way, but if I still support 3.8 then my 3.8 users would start seeing an error from pip about there being no pyarrow
extra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see what you mean now, it's for pandas
installers who support Python >=3.9 and Python < 3.9 simultaneously. One comment is 3.8 users wouldn't see an error, just a warning:
WARNING: pandas 2.2.0 does not provide the extra 'pyarrow'
However, that's still not quite good enough. The comprehensive solution is to use platform specifiers like this:
# requirements.txt
pandas ; python_version < "3.9"
pandas[pyarrow]>=2.2.1 ; python_version >= "3.9"
I am finding this also relevant: https://stackoverflow.com/a/68147602
Does that make sense? If you like it, I can update the DeprecationWarning
message to reflect that.
@@ -212,6 +212,7 @@ | |||
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), | |||
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) | |||
but {pa_msg} | |||
To resolve this warning, please specify the pyarrow extra during installation: pandas[pyarrow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hold on this PR for now.
There's a decently high likelihood that we end up reverting this warning altogether.
(I know it's super frustrating to have the warning - I'm hoping to reach a conclusion at the next dev meeting and then release a hopefully warning-free pandas 2.1.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good with me! I am open to whatever
Sorry I missed this, but this was addressed in #57551. Thanks for the start here but closing |
Sound good @mroeschke . It would be nice to update the messages in the |
The warning will be removed in 2.2.1 #57556 |
Changes requested in #54466 (comment)