Skip to content

ENH: Implement io.nullable_backend config for read_parquet #49039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 22, 2022
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.6.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Other enhancements
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support an ``axis`` argument. If ``axis`` is set, the default behaviour of which axis to consider can be overwritten (:issue:`47819`)
- :func:`assert_frame_equal` now shows the first element where the DataFrames differ, analogously to ``pytest``'s output (:issue:`47910`)
- Added new argument ``use_nullable_dtypes`` to :func:`read_csv` to enable automatic conversion to nullable dtypes (:issue:`36712`)
- Added new global configuration, ``io.nullable_backend`` to allow ``use_nullable_dtypes=True`` to return pyarrow-backed dtypes when set to ``"pyarrow"`` in :func:`read_parquet` (:issue:`48957`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should engine default to arrow if the option Is set to arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little hesitant to override a user specifying pd.read_*(..., engine="not-arrow") and this option forcing arrow to be used instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but we could make engine=NoDefault and only override if not given.

But I agree that we should consider this carefully

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Another consideration is that not all IO methods will have associated arrow parsers (immediately). e.g. read_html

- Added ``index`` parameter to :meth:`DataFrame.to_dict` (:issue:`46398`)
- Added metadata propagation for binary operators on :class:`DataFrame` (:issue:`28283`)
- :class:`.CategoricalConversionWarning`, :class:`.InvalidComparison`, :class:`.InvalidVersion`, :class:`.LossySetitemError`, and :class:`.NoBufferPresent` are now exposed in ``pandas.errors`` (:issue:`27656`)
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -731,6 +731,20 @@ def use_inf_as_na_cb(key) -> None:
validator=is_one_of_factory(["auto", "sqlalchemy"]),
)

io_nullable_backend_doc = """
: string
The nullable dtype implementation to return when ``use_nullable_dtypes=True``.
Available options: 'pandas', 'pyarrow', the default is 'pandas'.
"""

with cf.config_prefix("io.nullable_backend"):
cf.register_option(
"io_nullable_backend",
"pandas",
io_nullable_backend_doc,
validator=is_one_of_factory(["pandas", "pyarrow"]),
)

# --------
# Plotting
# ---------
Expand Down
55 changes: 38 additions & 17 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from pandas import (
DataFrame,
MultiIndex,
arrays,
get_option,
)
from pandas.core.shared_docs import _shared_docs
Expand Down Expand Up @@ -219,25 +220,27 @@ def read(
) -> DataFrame:
kwargs["use_pandas_metadata"] = True

nullable_backend = get_option("io.nullable_backend")
to_pandas_kwargs = {}
if use_nullable_dtypes:
import pandas as pd

mapping = {
self.api.int8(): pd.Int8Dtype(),
self.api.int16(): pd.Int16Dtype(),
self.api.int32(): pd.Int32Dtype(),
self.api.int64(): pd.Int64Dtype(),
self.api.uint8(): pd.UInt8Dtype(),
self.api.uint16(): pd.UInt16Dtype(),
self.api.uint32(): pd.UInt32Dtype(),
self.api.uint64(): pd.UInt64Dtype(),
self.api.bool_(): pd.BooleanDtype(),
self.api.string(): pd.StringDtype(),
self.api.float32(): pd.Float32Dtype(),
self.api.float64(): pd.Float64Dtype(),
}
to_pandas_kwargs["types_mapper"] = mapping.get
if nullable_backend == "pandas":
mapping = {
self.api.int8(): pd.Int8Dtype(),
self.api.int16(): pd.Int16Dtype(),
self.api.int32(): pd.Int32Dtype(),
self.api.int64(): pd.Int64Dtype(),
self.api.uint8(): pd.UInt8Dtype(),
self.api.uint16(): pd.UInt16Dtype(),
self.api.uint32(): pd.UInt32Dtype(),
self.api.uint64(): pd.UInt64Dtype(),
self.api.bool_(): pd.BooleanDtype(),
self.api.string(): pd.StringDtype(),
self.api.float32(): pd.Float32Dtype(),
self.api.float64(): pd.Float64Dtype(),
}
to_pandas_kwargs["types_mapper"] = mapping.get
manager = get_option("mode.data_manager")
if manager == "array":
to_pandas_kwargs["split_blocks"] = True # type: ignore[assignment]
Expand All @@ -249,9 +252,20 @@ def read(
mode="rb",
)
try:
result = self.api.parquet.read_table(
pa_table = self.api.parquet.read_table(
path_or_handle, columns=columns, **kwargs
).to_pandas(**to_pandas_kwargs)
)
if nullable_backend == "pandas":
result = pa_table.to_pandas(**to_pandas_kwargs)
elif nullable_backend == "pyarrow":
result = DataFrame(
{
col_name: arrays.ArrowExtensionArray(pa_col)
for col_name, pa_col in zip(
pa_table.column_names, pa_table.itercolumns()
)
}
)
if manager == "array":
result = result._as_manager("array", copy=False)
return result
Expand Down Expand Up @@ -492,6 +506,13 @@ def read_parquet(

.. versionadded:: 1.2.0

The nullable dtype implementation can be configured by setting the global
``io.nullable_backend`` configuration option to ``"pandas"`` to use
numpy-backed nullable dtypes or ``"pyarrow"`` to use pyarrow-backed
nullable dtypes (using ``pd.ArrowDtype``).

.. versionadded:: 2.0.0

**kwargs
Any additional kwargs are passed to the engine.

Expand Down
37 changes: 37 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -1021,6 +1021,43 @@ def test_read_parquet_manager(self, pa, using_array_manager):
else:
assert isinstance(result._mgr, pd.core.internals.BlockManager)

def test_read_use_nullable_types_pyarrow_config(self, pa, df_full):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing in the min versions build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

import pyarrow

df = df_full

# additional supported types for pyarrow
dti = pd.date_range("20130101", periods=3, tz="Europe/Brussels")
dti = dti._with_freq(None) # freq doesn't round-trip
df["datetime_tz"] = dti
df["bool_with_none"] = [True, None, True]

pa_table = pyarrow.Table.from_pandas(df)
expected = pd.DataFrame(
{
col_name: pd.arrays.ArrowExtensionArray(pa_column)
for col_name, pa_column in zip(
pa_table.column_names, pa_table.itercolumns()
)
}
)
# pyarrow infers datetimes as us instead of ns
expected["datetime"] = expected["datetime"].astype("timestamp[us][pyarrow]")
expected["datetime_with_nat"] = expected["datetime_with_nat"].astype(
"timestamp[us][pyarrow]"
)
expected["datetime_tz"] = expected["datetime_tz"].astype(
pd.ArrowDtype(pyarrow.timestamp(unit="us", tz="Europe/Brussels"))
)

with pd.option_context("io.nullable_backend", "pyarrow"):
check_round_trip(
df,
engine=pa,
read_kwargs={"use_nullable_dtypes": True},
expected=expected,
)


class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
Expand Down