-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Remove use_nullable_dtypes and add dtype_backend keyword #51853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks good in general. I do think it'd make things much easier to split this in multiple PRs. In particular for you, and we could work in the different components in parallel, so you don't need to do all the work yourself. In any case, great that you already have this, and since it looks like the CI is almost green here, also ok to merge all together, makes reviewing more complex than by parts, but fine with me. Only to general questions:
I guess I'm missing something, I don't see the advantage. |
Answers to your two questions: We agreed that no one should set numpy explicitly as dtype_backend right now, so we don’t allow it. I’ll create a follow up pr raising on invalid inputs after this is in. i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me |
I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use
Makes sense, thanks for clarifying, it wasn't obvious from the diff. |
I think the rational is to setup a future where we can more easily change the default dtype backend to |
Co-authored-by: Matthew Roeschke <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
@@ -144,16 +143,15 @@ def _convert_arrays_to_dataframe( | |||
data, | |||
columns, | |||
coerce_float: bool = True, | |||
use_nullable_dtypes: bool = False, | |||
dtype_backend: DtypeBackend | Literal["numpy"] = "numpy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these internal usages of dtype_backend
, is there a particular reason why you're using "numpy"
instead of no_default
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are checking dtype_backend == "numpy" quite often in sql and this reads better than checking dtype_backend is/is not lib.no_default. This clutters the if conditions too much imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that's fine then. Can you make sure we have tests where our public facing function with dtype_backend="numpy"
raises a ValueError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet. I wanted to a check for all functions in a follow up. This would have made this pr even larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let make sure we have that checking before we release the RC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll start a pr that builds on top of this one
@mroeschke @phofl I'm still a bit confused, sorry. If we allow |
@jorisvandenbossche mind chiming in with your original reason for not having a |
How would this interact with the Should that overwrite dtype backend or should the pyarrow equivalent be returned?
|
Explicit dtypes take precedence |
My bad, did not read comment above. |
This is also the most logical thing. A possible use case: users want numpy dtypes except string columns, they should be arrow backed for example. There are probably more where Mixing to a certain extent is beneficial |
Getting back to numpy backend: right now a numpy backend doesn’t make any sense, it’s the default. Nobody should have to request it explicitly. In a future where we potentially switch the backend option (not sure if/when we get there) we have to support the new default backend everywhere. This would also mean that we would need a global option to opt into, this is where a numpy backend could make sense, depending on what we want to do exactly. At this point we can either get rid of the keyword or add a new option (probably the first). So right now it doesn’t help anyone and also doesn’t block anything in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here LGTM
Yes, that sums it up for me. There is really no reason that anyone should now explicitly do (one could also argue about the name "numpy" being slightly confusing/incorrect, since they are not all numpy dtypes (categorical, datetimetz, interval, etc), although they all use "numpy semantics", one could say) |
doc/source/user_guide/io.rst
Outdated
implementation, even if no nulls are present. | ||
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames. | ||
Which dtype backend to use. If | ||
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is True also an option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this shouldn't accept True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, updated
msg += ( | ||
"Use dtype_backend='numpy_nullable' instead of use_nullable_dtype=True." | ||
) | ||
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would people think about not yet actually deprecating this keyword? (for parquet (and feather?), i.e. where it already existed)
That gives us a bit more time to see if what we decide now with dtype_backed is actually the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we could still un-deprecate this later in an 2.x release. I think I would rather prefer being consistent now with dtype_backend
rather than allowing both use_nullable_dtypes
and dtype_backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed I'd rather make this consistent as well
pandas/io/orc.py
Outdated
use_nullable_dtypes : bool, default False | ||
Whether or not to use nullable dtypes as default when reading data. If | ||
set to True, nullable dtypes are used for all dtypes that have a nullable | ||
implementation, even if no nulls are present. | ||
|
||
.. note:: | ||
|
||
The nullable dtype implementation can be configured by calling | ||
``pd.set_option("mode.dtype_backend", "pandas")`` to use | ||
numpy-backed nullable dtypes or | ||
``pd.set_option("mode.dtype_backend", "pyarrow")`` to use | ||
pyarrow-backed nullable dtypes (using ``pd.ArrowDtype``). | ||
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames | ||
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy | ||
arrays, nullable extension dtypes or pyarrow dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here, but it's about the docstring addition in general:
- I would keep the notion of "nullable dtypes are used for all dtypes that have a nullable implementation, even if no nulls are present." for the case of "nullable_numpy" (and can mention that for pyarrow it is used for all dypes)
- I would leave out the "extension", since also the pyarrow ones are based on extension dtypes. And also some of the default numpy ones are as well actually, so the "should have NumPy arrays" is not fully correct (but that's maybe too nitpicky for the docstring)
- I think we should mention everywhere briefly that this is still experimental to set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docstrings
Merging this to rebase the other pr. Happy to address comments regarding wording in a follow up |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.@mroeschke this is heavy unfortunately. Should be ready for a first look for wording and similar, I expect tests to fail since http tests timed out locally which made running tests tricky. Will fix when ci is through
I think we could split theoretically,
Let me know what you prefer