Remove use_nullable_dtypes and add dtype_backend keyword #51853

phofl · 2023-03-08T23:40:13Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@mroeschke this is heavy unfortunately. Should be ready for a first look for wording and similar, I expect tests to fail since http tests timed out locally which made running tests tricky. Will fix when ci is through

I think we could split theoretically,

sql
arrow stuff (orc, parquet, feather)
to_numeric and convert_dtypes
everything else

Let me know what you prefer

datapythonista · 2023-03-09T11:27:09Z

Looks good in general. I do think it'd make things much easier to split this in multiple PRs. In particular for you, and we could work in the different components in parallel, so you don't need to do all the work yourself. In any case, great that you already have this, and since it looks like the CI is almost green here, also ok to merge all together, makes reviewing more complex than by parts, but fine with me.

Only to general questions:

Why was it decided to use nodefault instead of "numpy" for the numpy dtypes? Seeing the PR seems to make things more complex and difficult to understand than using dtype_backend='numpy' by default.
Why in some places we're using use_backend_dtype: bool? Again, seems like propagating dtype_backend and checking if different from numpy would be clearer (if that's what it means to use dtype backend, which is not very clear)

I guess I'm missing something, I don't see the advantage.

phofl · 2023-03-09T12:29:40Z

Answers to your two questions:

We agreed that no one should set numpy explicitly as dtype_backend right now, so we don’t allow it. I’ll create a follow up pr raising on invalid inputs after this is in.

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

datapythonista · 2023-03-09T12:53:10Z

We agreed that no one should set numpy explicitly as dtype_backend right now

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

Makes sense, thanks for clarifying, it wasn't obvious from the diff.

mroeschke · 2023-03-09T20:12:11Z

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

doc/source/whatsnew/v2.0.0.rst

pandas/io/parquet.py

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke · 2023-03-09T20:25:41Z

pandas/io/sql.py

@@ -144,16 +143,15 @@ def _convert_arrays_to_dataframe(
    data,
    columns,
    coerce_float: bool = True,
-    use_nullable_dtypes: bool = False,
+    dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",


For these internal usages of dtype_backend, is there a particular reason why you're using "numpy" instead of no_default?

We are checking dtype_backend == "numpy" quite often in sql and this reads better than checking dtype_backend is/is not lib.no_default. This clutters the if conditions too much imo.

Okay that's fine then. Can you make sure we have tests where our public facing function with dtype_backend="numpy" raises a ValueError?

We don't yet. I wanted to a check for all functions in a follow up. This would have made this pr even larger.

Sure. Let make sure we have that checking before we release the RC

I’ll start a pr that builds on top of this one

pandas/io/parquet.py

datapythonista · 2023-03-09T22:41:32Z

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

@mroeschke @phofl I'm still a bit confused, sorry. If we allow dtype_backend='numpy then users can use that, and make sure that a future change in the pandas default won't change their backend. I fail to see how removing the numpy backend option is helping.

mroeschke · 2023-03-09T23:56:38Z

@jorisvandenbossche mind chiming in with your original reason for not having a dtype_backend="numpy" option (we didn't document this sufficiently in our meeting notes)

lithomas1 · 2023-03-10T16:29:46Z

How would this interact with the dtype keyword for functions like read_csv?

Should that overwrite dtype backend or should the pyarrow equivalent be returned?
e.g.

>>> import pandas as pd
>>> pd.options.mode.nullable_dtypes = True
>>> pd.options.mode.dtype_backend = "pyarrow"
>>> from io import StringIO
>>> pd.read_csv(StringIO("a,b,c\n1,2,3"), dtype="int64")

phofl · 2023-03-10T16:32:01Z

Explicit dtypes take precedence

lithomas1 · 2023-03-10T16:34:12Z

My bad, did not read comment above.

phofl · 2023-03-10T16:35:58Z

This is also the most logical thing. A possible use case: users want numpy dtypes except string columns, they should be arrow backed for example. There are probably more where Mixing to a certain extent is beneficial

phofl · 2023-03-10T16:40:37Z

Getting back to numpy backend: right now a numpy backend doesn’t make any sense, it’s the default. Nobody should have to request it explicitly. In a future where we potentially switch the backend option (not sure if/when we get there) we have to support the new default backend everywhere. This would also mean that we would need a global option to opt into, this is where a numpy backend could make sense, depending on what we want to do exactly. At this point we can either get rid of the keyword or add a new option (probably the first). So right now it doesn’t help anyone and also doesn’t block anything in the future

mroeschke

The changes here LGTM

jorisvandenbossche · 2023-03-10T20:56:11Z

So right now it doesn’t help anyone and also doesn’t block anything in the future

Yes, that sums it up for me. There is really no reason that anyone should now explicitly do dtype_backend="numpy", so why would we then allow it?
Not having it keeps it more flexible to do whatever we want later (we can still add it later if at some point that makes sense).

(one could also argue about the name "numpy" being slightly confusing/incorrect, since they are not all numpy dtypes (categorical, datetimetz, interval, etc), although they all use "numpy semantics", one could say)

jorisvandenbossche · 2023-03-10T21:00:53Z

doc/source/user_guide/io.rst

-    implementation, even if no nulls are present.
+dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
+    Which dtype backend to use. If
+    set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.


Is True also an option here?

Good catch, this shouldn't accept True

Thx, updated

jorisvandenbossche · 2023-03-10T21:05:32Z

pandas/io/parquet.py

+            msg += (
+                "Use dtype_backend='numpy_nullable' instead of use_nullable_dtype=True."
+            )
+        warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())


What would people think about not yet actually deprecating this keyword? (for parquet (and feather?), i.e. where it already existed)
That gives us a bit more time to see if what we decide now with dtype_backed is actually the way to go.

Hmm we could still un-deprecate this later in an 2.x release. I think I would rather prefer being consistent now with dtype_backend rather than allowing both use_nullable_dtypes and dtype_backend

Yeah agreed I'd rather make this consistent as well

jorisvandenbossche · 2023-03-10T21:11:49Z

pandas/io/orc.py

-    use_nullable_dtypes : bool, default False
-        Whether or not to use nullable dtypes as default when reading data. If
-        set to True, nullable dtypes are used for all dtypes that have a nullable
-        implementation, even if no nulls are present.
-
-        .. note::
-
-            The nullable dtype implementation can be configured by calling
-            ``pd.set_option("mode.dtype_backend", "pandas")`` to use
-            numpy-backed nullable dtypes or
-            ``pd.set_option("mode.dtype_backend", "pyarrow")`` to use
-            pyarrow-backed nullable dtypes (using ``pd.ArrowDtype``).
+    dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
+        Which dtype_backend to use, e.g. whether a DataFrame should have NumPy
+        arrays, nullable extension dtypes or pyarrow dtypes.


Commenting here, but it's about the docstring addition in general:

I would keep the notion of "nullable dtypes are used for all dtypes that have a nullable implementation, even if no nulls are present." for the case of "nullable_numpy" (and can mention that for pyarrow it is used for all dypes)

I would leave out the "extension", since also the pyarrow ones are based on extension dtypes. And also some of the default numpy ones are as well actually, so the "should have NumPy arrays" is not fully correct (but that's maybe too nitpicky for the docstring)

I think we should mention everywhere briefly that this is still experimental to set.

Updated the docstrings

phofl · 2023-03-13T18:20:30Z

Merging this to rebase the other pr. Happy to address comments regarding wording in a follow up

lumberbot-app · 2023-03-13T18:20:50Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 ab76540ada429ffb6f10785f1623c47a997b5763

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #51853: Remove use_nullable_dtypes and add dtype_backend keyword'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-51853-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51853 on branch 2.0.x (Remove use_nullable_dtypes and add dtype_backend keyword)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…51853)

…d dtype_backend keyword) (#51935) Remove use_nullable_dtypes and add dtype_backend keyword (#51853)

phofl added 4 commits March 9, 2023 00:37

Remove use_nullable_dtypes and add dtype_backend keyword

1958b53

Add whatsnew

d054ae4

Fix

0298f60

Fix

111b42a

phofl mentioned this pull request Mar 9, 2023

API: Setting Arrow-backed dtypes by default #51433

Closed