-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add io.nullable_backend=pyarrow support to read_excel #49965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add io.nullable_backend=pyarrow support to read_excel #49965
Conversation
doc/source/whatsnew/v2.0.0.rst
Outdated
|
||
Configuration option, ``io.nullable_backend``, to return pyarrow-backed dtypes from IO functions | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
A new global configuration, ``io.nullable_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in :func:`read_parquet`, :func:`read_orc` and :func:`read_csv` (with ``engine="pyarrow"``) | ||
to return pyarrow-backed dtypes when set to ``"pyarrow"`` (:issue:`48957`). | ||
The ``use_nullable_dtypes`` keyword argument has been expanded to :func:`read_csv` and :func:`read_excel` to enable automatic conversion to nullable dtypes (:issue:`36712`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @phofl this is the section where we can expand on use_nullable_dtype
. Let me know if I should expand on anything in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding a really short example with only one or 2 columns for read_csv. To show a bit better how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I expand on the example below (it's hidden in the diff) to show pd.option_context("io.nullable_backend", "pandas")
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, missed that. Yes a pandas example would be great.
Another thing: Maybe make the functions bullet points, e.g.
- read_csv
- read_excel
So that they become a bit more prominent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea with the bullet points. Also added an example with the pandas example
expected["i"].array._data.cast(pa.timestamp(unit="us")) | ||
) | ||
# pyarrow supports a null type, so don't have to default to Int64 | ||
expected["j"] = ArrowExtensionArray(pa.array([None, None])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl I noticed that in _infer_types
that result_mask.all()
(all nulls I think) would default the dtype to Int64
. Any specific reason why Int64 was chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you do:
df = pd.DataFrame([[np.nan, np.nan]])
df.convert_dtypes()
you get back Int64 for all columns. Not saying this is perfect, but it made sense to model it after the existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay makes sense why this is the case then
thx @mroeschke for addressing the doc comments! |
io.nullable_type="pandas"|"pyarrow"
to control IO readeruse_nullable_dtype
#48957 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.