Skip to content

ENH: Add io.nullable_backend=pyarrow support to read_excel #49965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,26 @@ The available extras, found in the :ref:`installation guide<install.dependencies
``[all, performance, computation, timezone, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql,
sql-other, html, xml, plot, output_formatting, clipboard, compression, test]`` (:issue:`39164`).

.. _whatsnew_200.enhancements.io_readers_nullable_pyarrow:
.. _whatsnew_200.enhancements.io_use_nullable_dtypes_and_nullable_backend:

Configuration option, ``io.nullable_backend``, to return pyarrow-backed dtypes from IO functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A new global configuration, ``io.nullable_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in :func:`read_parquet`, :func:`read_orc` and :func:`read_csv` (with ``engine="pyarrow"``)
to return pyarrow-backed dtypes when set to ``"pyarrow"`` (:issue:`48957`).
The ``use_nullable_dtypes`` keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (:issue:`36712`)

* :func:`read_csv`
* :func:`read_excel`

Additionally a new global configuration, ``io.nullable_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in the following functions
to select the nullable dtypes implementation.

* :func:`read_csv` (with ``engine="pyarrow"``)
* :func:`read_excel`
* :func:`read_parquet`
* :func:`read_orc`

By default, ``io.nullable_backend`` is set to ``"pandas"`` to return existing, numpy-backed nullable dtypes, but it can also
be set to ``"pyarrow"`` to return pyarrow-backed, nullable :class:`ArrowDtype` (:issue:`48957`).

.. ipython:: python

Expand All @@ -43,10 +56,15 @@ to return pyarrow-backed dtypes when set to ``"pyarrow"`` (:issue:`48957`).
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
with pd.option_context("io.nullable_backend", "pyarrow"):
df = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
with pd.option_context("io.nullable_backend", "pandas"):
df = pd.read_csv(data, use_nullable_dtypes=True)
df.dtypes

data.seek(0)
with pd.option_context("io.nullable_backend", "pyarrow"):
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
df_pyarrow.dtypes

.. _whatsnew_200.enhancements.other:

Other enhancements
Expand All @@ -55,7 +73,6 @@ Other enhancements
- :meth:`.DataFrameGroupBy.quantile` and :meth:`.SeriesGroupBy.quantile` now preserve nullable dtypes instead of casting to numpy dtypes (:issue:`37493`)
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support an ``axis`` argument. If ``axis`` is set, the default behaviour of which axis to consider can be overwritten (:issue:`47819`)
- :func:`assert_frame_equal` now shows the first element where the DataFrames differ, analogously to ``pytest``'s output (:issue:`47910`)
- Added new argument ``use_nullable_dtypes`` to :func:`read_csv` and :func:`read_excel` to enable automatic conversion to nullable dtypes (:issue:`36712`)
- Added ``index`` parameter to :meth:`DataFrame.to_dict` (:issue:`46398`)
- Added support for extension array dtypes in :func:`merge` (:issue:`44240`)
- Added metadata propagation for binary operators on :class:`DataFrame` (:issue:`28283`)
Expand Down
15 changes: 15 additions & 0 deletions pandas/io/parsers/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@

import numpy as np

from pandas._config.config import get_option

from pandas._libs import (
lib,
parsers,
Expand All @@ -39,6 +41,7 @@
DtypeObj,
Scalar,
)
from pandas.compat._optional import import_optional_dependency
from pandas.errors import (
ParserError,
ParserWarning,
Expand Down Expand Up @@ -71,6 +74,7 @@
from pandas import StringDtype
from pandas.core import algorithms
from pandas.core.arrays import (
ArrowExtensionArray,
BooleanArray,
Categorical,
ExtensionArray,
Expand Down Expand Up @@ -710,6 +714,7 @@ def _infer_types(
use_nullable_dtypes: Literal[True] | Literal[False] = (
self.use_nullable_dtypes and no_dtype_specified
)
nullable_backend = get_option("io.nullable_backend")
result: ArrayLike

if try_num_bool and is_object_dtype(values.dtype):
Expand Down Expand Up @@ -767,6 +772,16 @@ def _infer_types(
if inferred_type != "datetime":
result = StringDtype().construct_array_type()._from_sequence(values)

if use_nullable_dtypes and nullable_backend == "pyarrow":
pa = import_optional_dependency("pyarrow")
if isinstance(result, np.ndarray):
result = ArrowExtensionArray(pa.array(result, from_pandas=True))
else:
# ExtensionArray
result = ArrowExtensionArray(
pa.array(result.to_numpy(), from_pandas=True)
)

return result, na_count

def _cast_types(self, values: ArrayLike, cast_type: DtypeObj, column) -> ArrayLike:
Expand Down
32 changes: 28 additions & 4 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,7 +536,11 @@ def test_reader_dtype_str(self, read_ext, dtype, expected):
actual = pd.read_excel(basename + read_ext, dtype=dtype)
tm.assert_frame_equal(actual, expected)

def test_use_nullable_dtypes(self, read_ext):
@pytest.mark.parametrize(
"nullable_backend",
["pandas", pytest.param("pyarrow", marks=td.skip_if_no("pyarrow"))],
)
def test_use_nullable_dtypes(self, read_ext, nullable_backend):
# GH#36712
if read_ext in (".xlsb", ".xls"):
pytest.skip(f"No engine for filetype: '{read_ext}'")
Expand All @@ -557,10 +561,30 @@ def test_use_nullable_dtypes(self, read_ext):
)
with tm.ensure_clean(read_ext) as file_path:
df.to_excel(file_path, "test", index=False)
result = pd.read_excel(
file_path, sheet_name="test", use_nullable_dtypes=True
with pd.option_context("io.nullable_backend", nullable_backend):
result = pd.read_excel(
file_path, sheet_name="test", use_nullable_dtypes=True
)
if nullable_backend == "pyarrow":
import pyarrow as pa

from pandas.arrays import ArrowExtensionArray

expected = DataFrame(
{
col: ArrowExtensionArray(pa.array(df[col], from_pandas=True))
for col in df.columns
}
)
tm.assert_frame_equal(result, df)
# pyarrow by default infers timestamp resolution as us, not ns
expected["i"] = ArrowExtensionArray(
expected["i"].array._data.cast(pa.timestamp(unit="us"))
)
# pyarrow supports a null type, so don't have to default to Int64
expected["j"] = ArrowExtensionArray(pa.array([None, None]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl I noticed that in _infer_types that result_mask.all() (all nulls I think) would default the dtype to Int64. Any specific reason why Int64 was chosen?

Copy link
Member

@phofl phofl Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you do:

df = pd.DataFrame([[np.nan, np.nan]])
df.convert_dtypes()

you get back Int64 for all columns. Not saying this is perfect, but it made sense to model it after the existing behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay makes sense why this is the case then

else:
expected = df
tm.assert_frame_equal(result, expected)

def test_use_nullabla_dtypes_and_dtype(self, read_ext):
# GH#36712
Expand Down