Skip to content

ENH: Add io.nullable_backend=pyarrow support to read_excel #49965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 2, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Nov 30, 2022

@mroeschke mroeschke added IO Excel read_excel, to_excel Arrow pyarrow functionality labels Nov 30, 2022
@mroeschke mroeschke added this to the 2.0 milestone Nov 30, 2022

Configuration option, ``io.nullable_backend``, to return pyarrow-backed dtypes from IO functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A new global configuration, ``io.nullable_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in :func:`read_parquet`, :func:`read_orc` and :func:`read_csv` (with ``engine="pyarrow"``)
to return pyarrow-backed dtypes when set to ``"pyarrow"`` (:issue:`48957`).
The ``use_nullable_dtypes`` keyword argument has been expanded to :func:`read_csv` and :func:`read_excel` to enable automatic conversion to nullable dtypes (:issue:`36712`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @phofl this is the section where we can expand on use_nullable_dtype. Let me know if I should expand on anything in this PR

Copy link
Member

@phofl phofl Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding a really short example with only one or 2 columns for read_csv. To show a bit better how to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I expand on the example below (it's hidden in the diff) to show pd.option_context("io.nullable_backend", "pandas") as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed that. Yes a pandas example would be great.

Another thing: Maybe make the functions bullet points, e.g.

  • read_csv
  • read_excel

So that they become a bit more prominent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea with the bullet points. Also added an example with the pandas example

expected["i"].array._data.cast(pa.timestamp(unit="us"))
)
# pyarrow supports a null type, so don't have to default to Int64
expected["j"] = ArrowExtensionArray(pa.array([None, None]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl I noticed that in _infer_types that result_mask.all() (all nulls I think) would default the dtype to Int64. Any specific reason why Int64 was chosen?

Copy link
Member

@phofl phofl Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you do:

df = pd.DataFrame([[np.nan, np.nan]])
df.convert_dtypes()

you get back Int64 for all columns. Not saying this is perfect, but it made sense to model it after the existing behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay makes sense why this is the case then

@phofl phofl merged commit 7e5a95c into pandas-dev:main Dec 2, 2022
@phofl
Copy link
Member

phofl commented Dec 2, 2022

thx @mroeschke for addressing the doc comments!

@mroeschke mroeschke deleted the enh/io/excel_pyarrow_nullable branch December 2, 2022 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants