diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 4d6e928ce68bd..d5737f6462bb4 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -8,6 +8,10 @@ [Patrick Hoefler](https://github.com/phofl) - Revision: 1 +# Note + +This PDEP is superseded by PDEP-15. + ## Abstract This PDEP proposes that: diff --git a/web/pandas/pdeps/0015-do-not-require-pyarrow.md b/web/pandas/pdeps/0015-do-not-require-pyarrow.md new file mode 100644 index 0000000000000..8231eceaacad0 --- /dev/null +++ b/web/pandas/pdeps/0015-do-not-require-pyarrow.md @@ -0,0 +1,79 @@ +# PDEP-15: Do not require PyArrow as a required dependency (for pandas 3.0) + +- Created: 8 May 2024 +- Status: Under discussion +- Discussion: [#58623](https://github.com/pandas-dev/pandas/pull/58623) + [#52711](https://github.com/pandas-dev/pandas/pull/52711) + [#52509](https://github.com/pandas-dev/pandas/issues/52509) + [#54466](https://github.com/pandas-dev/pandas/issues/54466) +- Author: [Thomas Li](https://github.com/lithomas1) +- Revision: 1 + +## Abstract + +This PDEP supersedes PDEP-10, which stipulated that PyArrow should become a required dependency +for pandas 3.0. After reviewing feedback posted +on the feedback issue [#54466](https://github.com/pandas-dev/pandas/issues/54466), we've +decided against moving forward with this PDEP for pandas 3.0. + +The primary reasons for rejecting this PDEP are twofold: + +1) Requiring pyarrow as a dependency can cause installation problems for a significant portion of users. + + - Pyarrow does not fit or has a hard time fitting in space-constrained environments +such as AWS Lambda, due to its large size of around ~40 MB for a compiled wheel +(which is larger than pandas' own wheel sizes) + - This can also cause problems for downstream libraries that use pandas as a dependency + as while pandas + pyarrow can potentially fit in an AWS Lambda environment, the combination of + pandas, pyarrow, and the downstream library may not fit. + - While it may potentially be possible to work around this issue by using the AWS Lambda Layer from + the [AWS SDK for pandas](https://aws-sdk-pandas.readthedocs.io/en/stable/install.html#aws-lambda-layer), + the primary benefit of pyarrow strings is not enough to force users to make a disruptive change. + + - Installation of pyarrow is not possible on some platforms. We provide support for some +less widely used platforms such as Alpine Linux, which pyarrow does not provide wheels for. + - While pyarrow has made great strides towards supporting most platforms that pandas is installable on + (e.g. the recent addition of pyodide support in pyarrow), we would still have to drop support for some + platforms like musllinux (the feature request is tracked [here](https://github.com/apache/arrow/issues/18036)) if pyarrow was to be required. + + While installation issues are mentioned in the drawbacks section of PDEP-10, at the time of the writing +of the PDEP, we underestimated the impact this would have on users, and also downstream developers. + +2) Many of the benefits presented in PDEP-10 can be materialized for users that have pyarrow installed, without + forcing a pyarrow requirement on other users. + + In PDEP-10, there are three primary benefits listed: + + - First class support for strings. + + - PDEP-14 enables a new string dtype by default for pandas 3.0, + which will be backed by a pyarrow string dtype by default, + (for users who have pyarrow installed) and use a Python object based fallback for + users that don't have pyarrow installed. This allows all users to experience the usability + benefits of a string dtype by default, and for users with pyarrow to experience the performance + benefits of a pyarrow backed string array. + + - Support for dtypes not present in pandas. + - There are some types in pyarrow that don't have a corresponding pandas/numpy dtype, for example + the nested pyarrow types(e.g. lists and structs), and decimal types. + - Currently, users can already create arrays with these dtypes if they have pyarrow installed, but we cannot infer + arrays to those dtypes by default, without forcing a pyarrow requirement on users, + as there is no Python/numpy equivalent for these dtypes). + + - Interoperability + - The Arrow C Data Interface would allow us to import/export pandas DataFrames to and from other libraries + that support Arrow in a zero-copy manner. + + - While several libraries have adopted the Arrow C Data Interface, e.g. polars, xgboost, duckdb, etc., the main + beneficiaries of Arrow C Data Interface are other dataframe libraries, as most downstream libraries tend to + already support using pandas dataframes as input. + +Although this PR recommends not adopting pyarrow as a required dependency in pandas 3.0, this does not mean that we are +abandoning pyarrow support and integration in pandas. Adopting support for pyarrow arrays +and data types in more of pandas will lead to greater interoperability with the +ecosystem and better performance for users. Furthermore, a lot of the drawbacks, such as the large installation size of +pyarrow and the lack of support for certain platforms, can be solved (as shown by the recent addition of pyarrow to the pyodide +distributions), allowing us to potentially revisit this decision in the future. + +However, at this point in time, it is clear that we are not ready to require pyarrow +as a dependency in pandas.