-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ERR: Check that dtype_backend is valid #51871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Matthew Roeschke <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There is some documentation that seems to be outdated, and added some ideas, but looks good.
doc/source/user_guide/io.rst
Outdated
implementation, even if no nulls are present. | ||
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames. | ||
Which dtype backend to use. If | ||
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be changed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed already on the other pr, rebased now
pandas/_libs/parsers.pyx
Outdated
): | ||
use_nullable_dtypes = self.use_nullable_dtypes and col_dtype is None | ||
use_dtype_backend = self.dtype_backend != "numpy" and col_dtype is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the previous variable name use_nullable_dtypes
is still more appropriate here. No big deal, up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as no_default is used for numpy I like this more
@@ -980,7 +980,7 @@ def convert_object_array( | |||
---------- | |||
content: List[np.ndarray] | |||
dtype: np.dtype or ExtensionDtype | |||
use_nullable_dtypes: Controls if nullable dtypes are returned. | |||
dtype_backend: Controls if nullable dtypes are returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably ok, but maybe we can update this description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
pandas/io/json/_json.py
Outdated
if self.dtype_backend == "pyarrow": | ||
return pa_table.to_pandas(types_mapper=ArrowDtype) | ||
elif self.dtype_backend == "numpy_nullable": | ||
from pandas.io._util import _arrow_dtype_mapping | ||
|
||
mapping = _arrow_dtype_mapping() | ||
return pa_table.to_pandas(types_mapper=mapping.get) | ||
mapping = _arrow_dtype_mapping() | ||
return pa_table.to_pandas(types_mapper=mapping.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal, but I'd probably just set the value for the mapper in the condition, and have a single call at the end to pa_table.to_pandas(types_mapper=variable_set_depending_on_dtype_backend)
. To me it's clearer, but maybe just a personal preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
pandas/io/parquet.py
Outdated
mapping = _arrow_dtype_mapping() | ||
to_pandas_kwargs["types_mapper"] = mapping.get | ||
elif dtype_backend == "pyarrow": | ||
to_pandas_kwargs["types_mapper"] = pd.ArrowDtype # type: ignore[assignment] # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably an idea for a future refactoring more than for this PR, but feels like it could make sense to encapsulate most of the backend stuff in a class. Something like:
backend = _Backend('pyarrow') # does the validation
backend.mapping
backend.is_nullable
backend.is_default
...
IMO would make code a bit more readable, and also it'd probably make it easy to change things like the default in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea, but as a follow up as you said. Will look into it when we are through here
…ation # Conflicts: # doc/source/user_guide/io.rst # pandas/core/generic.py # pandas/core/tools/numeric.py # pandas/io/clipboards.py # pandas/io/excel/_base.py # pandas/io/feather_format.py # pandas/io/html.py # pandas/io/json/_json.py # pandas/io/orc.py # pandas/io/parquet.py # pandas/io/parsers/readers.py # pandas/io/spss.py # pandas/io/sql.py # pandas/io/xml.py # pandas/tests/io/parser/test_read_fwf.py # pandas/tests/io/test_clipboard.py # pandas/tests/io/test_sql.py # pandas/tests/io/xml/test_xml.py
pandas/io/json/_json.py
Outdated
if self.dtype_backend == "pyarrow": | ||
return pa_table.to_pandas(types_mapper=ArrowDtype) | ||
mapping = ArrowDtype | ||
elif self.dtype_backend == "numpy_nullable": | ||
from pandas.io._util import _arrow_dtype_mapping | ||
|
||
mapping = _arrow_dtype_mapping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be _arrow_dtype_mapping().get
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not sure if we have a test that hits this though, will look in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small items otherwise looks good
Co-authored-by: Matthew Roeschke <[email protected]>
This reverts commit 2186e5b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @phofl, really nice.
Out of curiosity, was it discussed to make the numpy nullable backend the default for pandas 2? Feels like it should be mature enough, and a major version seems the best for making the change, so I assume we're leaving it for 3.0 at least?
…ation # Conflicts: # pandas/tests/io/test_orc.py
Quick thoughts:
Implementation wise:
Those things have to be solved before we can start discussing it |
Yep, makes sense. On a second thought, if we think pyarrow will eventually be the default, not sure if it's worth to change to nullable first, and then change it again. In any case, great work with the new interface to change the backend, thanks for taking care of it! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Yeah that's something else to consider. I just wanted to express that there is some work to do before we can start discussing these topics |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.this sits on top of the actual pr