Skip to content

Remove use_nullable_dtypes and add dtype_backend keyword #51853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 13, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 8, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@mroeschke this is heavy unfortunately. Should be ready for a first look for wording and similar, I expect tests to fail since http tests timed out locally which made running tests tricky. Will fix when ci is through

I think we could split theoretically,

  • sql
  • arrow stuff (orc, parquet, feather)
  • to_numeric and convert_dtypes
  • everything else

Let me know what you prefer

@datapythonista
Copy link
Member

Looks good in general. I do think it'd make things much easier to split this in multiple PRs. In particular for you, and we could work in the different components in parallel, so you don't need to do all the work yourself. In any case, great that you already have this, and since it looks like the CI is almost green here, also ok to merge all together, makes reviewing more complex than by parts, but fine with me.

Only to general questions:

  • Why was it decided to use nodefault instead of "numpy" for the numpy dtypes? Seeing the PR seems to make things more complex and difficult to understand than using dtype_backend='numpy' by default.

  • Why in some places we're using use_backend_dtype: bool? Again, seems like propagating dtype_backend and checking if different from numpy would be clearer (if that's what it means to use dtype backend, which is not very clear)

I guess I'm missing something, I don't see the advantage.

@phofl
Copy link
Member Author

phofl commented Mar 9, 2023

Answers to your two questions:

We agreed that no one should set numpy explicitly as dtype_backend right now, so we don’t allow it. I’ll create a follow up pr raising on invalid inputs after this is in.

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

@datapythonista
Copy link
Member

We agreed that no one should set numpy explicitly as dtype_backend right now

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

Makes sense, thanks for clarifying, it wasn't obvious from the diff.

@mroeschke
Copy link
Member

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

phofl and others added 2 commits March 9, 2023 21:24
@@ -144,16 +143,15 @@ def _convert_arrays_to_dataframe(
data,
columns,
coerce_float: bool = True,
use_nullable_dtypes: bool = False,
dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these internal usages of dtype_backend, is there a particular reason why you're using "numpy" instead of no_default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are checking dtype_backend == "numpy" quite often in sql and this reads better than checking dtype_backend is/is not lib.no_default. This clutters the if conditions too much imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that's fine then. Can you make sure we have tests where our public facing function with dtype_backend="numpy" raises a ValueError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet. I wanted to a check for all functions in a follow up. This would have made this pr even larger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let make sure we have that checking before we release the RC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll start a pr that builds on top of this one

@mroeschke mroeschke added NA - MaskedArrays Related to pd.NA and nullable extension arrays Arrow pyarrow functionality labels Mar 9, 2023
@mroeschke mroeschke added this to the 2.0 milestone Mar 9, 2023
@datapythonista
Copy link
Member

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

@mroeschke @phofl I'm still a bit confused, sorry. If we allow dtype_backend='numpy then users can use that, and make sure that a future change in the pandas default won't change their backend. I fail to see how removing the numpy backend option is helping.

@mroeschke
Copy link
Member

@jorisvandenbossche mind chiming in with your original reason for not having a dtype_backend="numpy" option (we didn't document this sufficiently in our meeting notes)

@lithomas1
Copy link
Member

lithomas1 commented Mar 10, 2023

How would this interact with the dtype keyword for functions like read_csv?

Should that overwrite dtype backend or should the pyarrow equivalent be returned?
e.g.

>>> import pandas as pd
>>> pd.options.mode.nullable_dtypes = True
>>> pd.options.mode.dtype_backend = "pyarrow"
>>> from io import StringIO
>>> pd.read_csv(StringIO("a,b,c\n1,2,3"), dtype="int64")

@phofl
Copy link
Member Author

phofl commented Mar 10, 2023

Explicit dtypes take precedence

@lithomas1
Copy link
Member

lithomas1 commented Mar 10, 2023

My bad, did not read comment above.

@phofl
Copy link
Member Author

phofl commented Mar 10, 2023

This is also the most logical thing. A possible use case: users want numpy dtypes except string columns, they should be arrow backed for example. There are probably more where Mixing to a certain extent is beneficial

@phofl
Copy link
Member Author

phofl commented Mar 10, 2023

Getting back to numpy backend: right now a numpy backend doesn’t make any sense, it’s the default. Nobody should have to request it explicitly. In a future where we potentially switch the backend option (not sure if/when we get there) we have to support the new default backend everywhere. This would also mean that we would need a global option to opt into, this is where a numpy backend could make sense, depending on what we want to do exactly. At this point we can either get rid of the keyword or add a new option (probably the first). So right now it doesn’t help anyone and also doesn’t block anything in the future

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here LGTM

@jorisvandenbossche
Copy link
Member

So right now it doesn’t help anyone and also doesn’t block anything in the future

Yes, that sums it up for me. There is really no reason that anyone should now explicitly do dtype_backend="numpy", so why would we then allow it?
Not having it keeps it more flexible to do whatever we want later (we can still add it later if at some point that makes sense).

(one could also argue about the name "numpy" being slightly confusing/incorrect, since they are not all numpy dtypes (categorical, datetimetz, interval, etc), although they all use "numpy semantics", one could say)

implementation, even if no nulls are present.
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
Which dtype backend to use. If
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is True also an option here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this shouldn't accept True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, updated

msg += (
"Use dtype_backend='numpy_nullable' instead of use_nullable_dtype=True."
)
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would people think about not yet actually deprecating this keyword? (for parquet (and feather?), i.e. where it already existed)
That gives us a bit more time to see if what we decide now with dtype_backed is actually the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we could still un-deprecate this later in an 2.x release. I think I would rather prefer being consistent now with dtype_backend rather than allowing both use_nullable_dtypes and dtype_backend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed I'd rather make this consistent as well

pandas/io/orc.py Outdated
Comment on lines 68 to 66
use_nullable_dtypes : bool, default False
Whether or not to use nullable dtypes as default when reading data. If
set to True, nullable dtypes are used for all dtypes that have a nullable
implementation, even if no nulls are present.

.. note::

The nullable dtype implementation can be configured by calling
``pd.set_option("mode.dtype_backend", "pandas")`` to use
numpy-backed nullable dtypes or
``pd.set_option("mode.dtype_backend", "pyarrow")`` to use
pyarrow-backed nullable dtypes (using ``pd.ArrowDtype``).
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy
arrays, nullable extension dtypes or pyarrow dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here, but it's about the docstring addition in general:

  • I would keep the notion of "nullable dtypes are used for all dtypes that have a nullable implementation, even if no nulls are present." for the case of "nullable_numpy" (and can mention that for pyarrow it is used for all dypes)
  • I would leave out the "extension", since also the pyarrow ones are based on extension dtypes. And also some of the default numpy ones are as well actually, so the "should have NumPy arrays" is not fully correct (but that's maybe too nitpicky for the docstring)
  • I think we should mention everywhere briefly that this is still experimental to set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docstrings

@phofl
Copy link
Member Author

phofl commented Mar 13, 2023

Merging this to rebase the other pr. Happy to address comments regarding wording in a follow up

@phofl phofl merged commit ab76540 into pandas-dev:main Mar 13, 2023
@phofl phofl deleted the dtype_backend branch March 13, 2023 18:20
@lumberbot-app
Copy link

lumberbot-app bot commented Mar 13, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 ab76540ada429ffb6f10785f1623c47a997b5763
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #51853: Remove use_nullable_dtypes and add dtype_backend keyword'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-51853-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51853 on branch 2.0.x (Remove use_nullable_dtypes and add dtype_backend keyword)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants