Skip to content

ENH: Implement io.nullable_backend config for read_parquet #49039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 22, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Oct 11, 2022

@mroeschke mroeschke added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather Arrow pyarrow functionality labels Oct 11, 2022
@mroeschke mroeschke added this to the 1.6 milestone Oct 11, 2022
@@ -33,6 +33,7 @@ Other enhancements
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support an ``axis`` argument. If ``axis`` is set, the default behaviour of which axis to consider can be overwritten (:issue:`47819`)
- :func:`assert_frame_equal` now shows the first element where the DataFrames differ, analogously to ``pytest``'s output (:issue:`47910`)
- Added new argument ``use_nullable_dtypes`` to :func:`read_csv` to enable automatic conversion to nullable dtypes (:issue:`36712`)
- Added new global configuration, ``io.nullable_backend`` to allow ``use_nullable_dtypes=True`` to return pyarrow-backed dtypes when set to ``"pyarrow"`` in :func:`read_parquet` (:issue:`48957`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should engine default to arrow if the option Is set to arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little hesitant to override a user specifying pd.read_*(..., engine="not-arrow") and this option forcing arrow to be used instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but we could make engine=NoDefault and only override if not given.

But I agree that we should consider this carefully

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Another consideration is that not all IO methods will have associated arrow parsers (immediately). e.g. read_html

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
@@ -1021,6 +1021,43 @@ def test_read_parquet_manager(self, pa, using_array_manager):
else:
assert isinstance(result._mgr, pd.core.internals.BlockManager)

def test_read_use_nullable_types_pyarrow_config(self, pa, df_full):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing in the min versions build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

@mroeschke
Copy link
Member Author

@phofl any follow-ups here?

@@ -1021,6 +1021,46 @@ def test_read_parquet_manager(self, pa, using_array_manager):
else:
assert isinstance(result._mgr, pd.core.internals.BlockManager)

@pytest.mark.xfail(
pa_version_under2p0, reason="Timezone conversion incorrect for pyarrow < 2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this

@phofl phofl merged commit 0dce285 into pandas-dev:main Oct 22, 2022
@phofl
Copy link
Member

phofl commented Oct 22, 2022

thx @mroeschke

@mroeschke mroeschke deleted the enh/io/nullable_backend branch October 24, 2022 17:20
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants