ERR: Check that dtype_backend is valid #51871

phofl · 2023-03-09T22:16:04Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

this sits on top of the actual pr

Co-authored-by: Matthew Roeschke <[email protected]>

datapythonista

lgtm

There is some documentation that seems to be outdated, and added some ideas, but looks good.

datapythonista · 2023-03-13T18:04:40Z

doc/source/user_guide/io.rst

-    implementation, even if no nulls are present.
+dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
+    Which dtype backend to use. If
+    set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.


I think this needs to be changed, no?

Changed already on the other pr, rebased now

datapythonista · 2023-03-13T18:08:33Z

pandas/_libs/parsers.pyx

            ):
-                use_nullable_dtypes = self.use_nullable_dtypes and col_dtype is None
+                use_dtype_backend = self.dtype_backend != "numpy" and col_dtype is None


Not sure if the previous variable name use_nullable_dtypes is still more appropriate here. No big deal, up to you.

As long as no_default is used for numpy I like this more

datapythonista · 2023-03-13T18:09:58Z

pandas/core/internals/construction.py

@@ -980,7 +980,7 @@ def convert_object_array(
    ----------
    content: List[np.ndarray]
    dtype: np.dtype or ExtensionDtype
-    use_nullable_dtypes: Controls if nullable dtypes are returned.
+    dtype_backend: Controls if nullable dtypes are returned.


Probably ok, but maybe we can update this description.

datapythonista · 2023-03-13T18:15:12Z

pandas/io/json/_json.py

+                if self.dtype_backend == "pyarrow":
+                    return pa_table.to_pandas(types_mapper=ArrowDtype)
+                elif self.dtype_backend == "numpy_nullable":
+                    from pandas.io._util import _arrow_dtype_mapping

-                        mapping = _arrow_dtype_mapping()
-                        return pa_table.to_pandas(types_mapper=mapping.get)
+                    mapping = _arrow_dtype_mapping()
+                    return pa_table.to_pandas(types_mapper=mapping.get)


No big deal, but I'd probably just set the value for the mapper in the condition, and have a single call at the end to pa_table.to_pandas(types_mapper=variable_set_depending_on_dtype_backend). To me it's clearer, but maybe just a personal preference.

datapythonista · 2023-03-13T18:22:05Z

pandas/io/parquet.py

+            mapping = _arrow_dtype_mapping()
+            to_pandas_kwargs["types_mapper"] = mapping.get
+        elif dtype_backend == "pyarrow":
+            to_pandas_kwargs["types_mapper"] = pd.ArrowDtype  # type: ignore[assignment]  # noqa


Probably an idea for a future refactoring more than for this PR, but feels like it could make sense to encapsulate most of the backend stuff in a class. Something like:

backend = _Backend('pyarrow') # does the validation backend.mapping backend.is_nullable backend.is_default ...

IMO would make code a bit more readable, and also it'd probably make it easy to change things like the default in the future.

Sounds like a good idea, but as a follow up as you said. Will look into it when we are through here

…ation # Conflicts: # doc/source/user_guide/io.rst # pandas/core/generic.py # pandas/core/tools/numeric.py # pandas/io/clipboards.py # pandas/io/excel/_base.py # pandas/io/feather_format.py # pandas/io/html.py # pandas/io/json/_json.py # pandas/io/orc.py # pandas/io/parquet.py # pandas/io/parsers/readers.py # pandas/io/spss.py # pandas/io/sql.py # pandas/io/xml.py # pandas/tests/io/parser/test_read_fwf.py # pandas/tests/io/test_clipboard.py # pandas/tests/io/test_sql.py # pandas/tests/io/xml/test_xml.py

mroeschke · 2023-03-13T18:49:45Z

pandas/io/json/_json.py

                if self.dtype_backend == "pyarrow":
-                    return pa_table.to_pandas(types_mapper=ArrowDtype)
+                    mapping = ArrowDtype
                elif self.dtype_backend == "numpy_nullable":
                    from pandas.io._util import _arrow_dtype_mapping

                    mapping = _arrow_dtype_mapping()


Does this need to be _arrow_dtype_mapping().get?

Yeah, not sure if we have a test that hits this though, will look in a follow up

pandas/util/_validators.py

mroeschke

Two small items otherwise looks good

Co-authored-by: Matthew Roeschke <[email protected]>

This reverts commit 2186e5b.

datapythonista

lgtm, thanks @phofl, really nice.

Out of curiosity, was it discussed to make the numpy nullable backend the default for pandas 2? Feels like it should be mature enough, and a major version seems the best for making the change, so I assume we're leaving it for 3.0 at least?

…ation # Conflicts: # pandas/tests/io/test_orc.py

phofl · 2023-03-14T10:55:26Z

Quick thoughts:

Generally, nullables are still slower than NumPy itself (and need more memory)

Implementation wise:

There is still no global version to opt in, similar to pyarrow backend
NA vs NaN in Float64/32 is still open
Couple of other things

Those things have to be solved before we can start discussing it

datapythonista · 2023-03-14T11:32:18Z

Yep, makes sense. On a second thought, if we think pyarrow will eventually be the default, not sure if it's worth to change to nullable first, and then change it again.

In any case, great work with the new interface to change the backend, thanks for taking care of it!

lumberbot-app · 2023-03-14T13:14:13Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 dba0f66785aff795f30afbf9135771d3c11b5a67

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #51871: ERR: Check that dtype_backend is valid'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-51871-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51871 on branch 2.0.x (ERR: Check that dtype_backend is valid)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

phofl · 2023-03-14T13:14:23Z

Yeah that's something else to consider. I just wanted to express that there is some work to do before we can start discussing these topics

…valid) (#51964) ERR: Check that dtype_backend is valid (#51871)

phofl and others added 9 commits March 9, 2023 00:37

Remove use_nullable_dtypes and add dtype_backend keyword

1958b53

Add whatsnew

d054ae4

Fix

0298f60

Fix

111b42a

Update doc/source/whatsnew/v2.0.0.rst

8e112b1

Co-authored-by: Matthew Roeschke <[email protected]>

Update doc/source/whatsnew/v2.0.0.rst

998b807

Co-authored-by: Matthew Roeschke <[email protected]>

Adjust message

0858a27

Refactor message

fb42c2e

ERR: Check that dtype_backend is valid

04ca968

phofl added NA - MaskedArrays Related to pd.NA and nullable extension arrays Arrow pyarrow functionality labels Mar 9, 2023

phofl added this to the 2.0 milestone Mar 9, 2023

phofl added 2 commits March 9, 2023 23:54

Fix

d90989f

Merge branch 'dtype_backend' into dtype_backend_validation

200757a

datapythonista reviewed Mar 13, 2023

View reviewed changes

phofl added 2 commits March 13, 2023 18:33

Update

f251c8e

phofl mentioned this pull request Mar 13, 2023

RLS: 2.0 #46776

Closed

1 task

mroeschke reviewed Mar 13, 2023

View reviewed changes

pandas/util/_validators.py Outdated Show resolved Hide resolved

Add get

6b83aa0

mroeschke reviewed Mar 13, 2023

View reviewed changes

phofl and others added 5 commits March 13, 2023 18:53

Update pandas/util/_validators.py

2eb5c88

Co-authored-by: Matthew Roeschke <[email protected]>

Fix error message check

2e26467

Update docstring

2186e5b

Update docstring

612df73

Revert "Update docstring"

40c4fb5

This reverts commit 2186e5b.

datapythonista approved these changes Mar 14, 2023

View reviewed changes

phofl added 2 commits March 14, 2023 11:52

Merge remote-tracking branch 'upstream/main' into dtype_backend_valid…

59518e9

…ation # Conflicts: # pandas/tests/io/test_orc.py

Fix mypy and skip when no pyarrow

73592be

phofl merged commit dba0f66 into pandas-dev:main Mar 14, 2023

lumberbot-app bot added the Still Needs Manual Backport label Mar 14, 2023

phofl deleted the dtype_backend_validation branch March 14, 2023 13:14

phofl added a commit to phofl/pandas that referenced this pull request Mar 14, 2023

ERR: Check that dtype_backend is valid (pandas-dev#51871)

dad4e88

phofl mentioned this pull request Mar 14, 2023

Backport PR #51871 on branch 2.0.x (ERR: Check that dtype_backend is valid) #51964

Merged

5 tasks

phofl removed the Still Needs Manual Backport label Mar 14, 2023

phofl added a commit that referenced this pull request Mar 14, 2023

Backport PR #51871 on branch 2.0.x (ERR: Check that dtype_backend is …

ee7e30c

…valid) (#51964) ERR: Check that dtype_backend is valid (#51871)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: Check that dtype_backend is valid #51871

ERR: Check that dtype_backend is valid #51871

phofl commented Mar 9, 2023 •

edited

Loading

datapythonista left a comment

datapythonista Mar 13, 2023

phofl Mar 13, 2023

datapythonista Mar 13, 2023

phofl Mar 13, 2023

datapythonista Mar 13, 2023

phofl Mar 13, 2023

datapythonista Mar 13, 2023

phofl Mar 13, 2023

datapythonista Mar 13, 2023

phofl Mar 13, 2023

mroeschke Mar 13, 2023

phofl Mar 13, 2023

mroeschke left a comment

datapythonista left a comment

phofl commented Mar 14, 2023 •

edited

Loading

datapythonista commented Mar 14, 2023

lumberbot-app bot commented Mar 14, 2023

phofl commented Mar 14, 2023

ERR: Check that dtype_backend is valid #51871

ERR: Check that dtype_backend is valid #51871

Conversation

phofl commented Mar 9, 2023 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

phofl commented Mar 14, 2023 • edited Loading

datapythonista commented Mar 14, 2023

lumberbot-app bot commented Mar 14, 2023

phofl commented Mar 14, 2023

phofl commented Mar 9, 2023 •

edited

Loading

phofl commented Mar 14, 2023 •

edited

Loading