Skip to content

ERR: Check that dtype_backend is valid #51871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 14, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 9, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

this sits on top of the actual pr

@phofl phofl added NA - MaskedArrays Related to pd.NA and nullable extension arrays Arrow pyarrow functionality labels Mar 9, 2023
@phofl phofl added this to the 2.0 milestone Mar 9, 2023
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

There is some documentation that seems to be outdated, and added some ideas, but looks good.

implementation, even if no nulls are present.
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
Which dtype backend to use. If
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be changed, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed already on the other pr, rebased now

):
use_nullable_dtypes = self.use_nullable_dtypes and col_dtype is None
use_dtype_backend = self.dtype_backend != "numpy" and col_dtype is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the previous variable name use_nullable_dtypes is still more appropriate here. No big deal, up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as no_default is used for numpy I like this more

@@ -980,7 +980,7 @@ def convert_object_array(
----------
content: List[np.ndarray]
dtype: np.dtype or ExtensionDtype
use_nullable_dtypes: Controls if nullable dtypes are returned.
dtype_backend: Controls if nullable dtypes are returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably ok, but maybe we can update this description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 949 to 955
if self.dtype_backend == "pyarrow":
return pa_table.to_pandas(types_mapper=ArrowDtype)
elif self.dtype_backend == "numpy_nullable":
from pandas.io._util import _arrow_dtype_mapping

mapping = _arrow_dtype_mapping()
return pa_table.to_pandas(types_mapper=mapping.get)
mapping = _arrow_dtype_mapping()
return pa_table.to_pandas(types_mapper=mapping.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but I'd probably just set the value for the mapper in the condition, and have a single call at the end to pa_table.to_pandas(types_mapper=variable_set_depending_on_dtype_backend). To me it's clearer, but maybe just a personal preference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

mapping = _arrow_dtype_mapping()
to_pandas_kwargs["types_mapper"] = mapping.get
elif dtype_backend == "pyarrow":
to_pandas_kwargs["types_mapper"] = pd.ArrowDtype # type: ignore[assignment] # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an idea for a future refactoring more than for this PR, but feels like it could make sense to encapsulate most of the backend stuff in a class. Something like:

backend = _Backend('pyarrow')  # does the validation
backend.mapping
backend.is_nullable
backend.is_default
...

IMO would make code a bit more readable, and also it'd probably make it easy to change things like the default in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea, but as a follow up as you said. Will look into it when we are through here

phofl added 2 commits March 13, 2023 18:33
…ation

# Conflicts:
#	doc/source/user_guide/io.rst
#	pandas/core/generic.py
#	pandas/core/tools/numeric.py
#	pandas/io/clipboards.py
#	pandas/io/excel/_base.py
#	pandas/io/feather_format.py
#	pandas/io/html.py
#	pandas/io/json/_json.py
#	pandas/io/orc.py
#	pandas/io/parquet.py
#	pandas/io/parsers/readers.py
#	pandas/io/spss.py
#	pandas/io/sql.py
#	pandas/io/xml.py
#	pandas/tests/io/parser/test_read_fwf.py
#	pandas/tests/io/test_clipboard.py
#	pandas/tests/io/test_sql.py
#	pandas/tests/io/xml/test_xml.py
@phofl phofl mentioned this pull request Mar 13, 2023
1 task
if self.dtype_backend == "pyarrow":
return pa_table.to_pandas(types_mapper=ArrowDtype)
mapping = ArrowDtype
elif self.dtype_backend == "numpy_nullable":
from pandas.io._util import _arrow_dtype_mapping

mapping = _arrow_dtype_mapping()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be _arrow_dtype_mapping().get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not sure if we have a test that hits this though, will look in a follow up

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small items otherwise looks good

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @phofl, really nice.

Out of curiosity, was it discussed to make the numpy nullable backend the default for pandas 2? Feels like it should be mature enough, and a major version seems the best for making the change, so I assume we're leaving it for 3.0 at least?

@phofl
Copy link
Member Author

phofl commented Mar 14, 2023

Quick thoughts:

  • Generally, nullables are still slower than NumPy itself (and need more memory)

Implementation wise:

  • There is still no global version to opt in, similar to pyarrow backend
  • NA vs NaN in Float64/32 is still open
  • Couple of other things

Those things have to be solved before we can start discussing it

@datapythonista
Copy link
Member

Yep, makes sense. On a second thought, if we think pyarrow will eventually be the default, not sure if it's worth to change to nullable first, and then change it again.

In any case, great work with the new interface to change the backend, thanks for taking care of it!

@phofl phofl merged commit dba0f66 into pandas-dev:main Mar 14, 2023
@lumberbot-app
Copy link

lumberbot-app bot commented Mar 14, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 dba0f66785aff795f30afbf9135771d3c11b5a67
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #51871: ERR: Check that dtype_backend is valid'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-51871-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51871 on branch 2.0.x (ERR: Check that dtype_backend is valid)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@phofl
Copy link
Member Author

phofl commented Mar 14, 2023

Yeah that's something else to consider. I just wanted to express that there is some work to do before we can start discussing these topics

@phofl phofl deleted the dtype_backend_validation branch March 14, 2023 13:14
phofl added a commit to phofl/pandas that referenced this pull request Mar 14, 2023
phofl added a commit that referenced this pull request Mar 14, 2023
…valid) (#51964)

ERR: Check that dtype_backend is valid (#51871)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants