Skip to content

String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested #59487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 12, 2024

The lib.maybe_converts_object() function (which is used for object dtype inference in various places) has a convert_to_nullable_dtype=Flase/True keyword to control whether to return dtypes using NA or not. Currently we were giving the future infer_string option priority. But also when the future dtype is enabled, we should still return an NA-variant of the dtype if explicitly asked for. So this PR switches the order of those two checks to let the convert_to_nullable_dtype keyword take precedence over the option.

This came up in SQL tests, but actually this also impacts the pd.array(..) behaviour. Currently on main, pd.array(["some", "strings"]) will infer the nullable StringDtype(), consistent with the other dtypes that have such a nullable variant (i.e. returning our Int64 instead of np.int64 for integers, etc).
This is tested behaviour, but we were still xfailing those tests when future.infer_string was enabled. With the change in this PR, also in the future pd.array(..) will infer the NA-variant.

(sidenote, we should update this simple convert_to_nullable_dtype True/False flag to something that allows passing whether we want a numpy_nullable or pyarrow array, but that's for a separate issue)

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Aug 12, 2024
Comment on lines -2702 to 2712
if using_string_dtype() and is_string_array(objects, skipna=True):
if convert_to_nullable_dtype and is_string_array(objects, skipna=True):
from pandas.core.arrays.string_ import StringDtype

dtype = StringDtype(na_value=np.nan)
dtype = StringDtype()
return dtype.construct_array_type()._from_sequence(objects, dtype=dtype)

elif convert_to_nullable_dtype and is_string_array(objects, skipna=True):
elif using_string_dtype() and is_string_array(objects, skipna=True):
from pandas.core.arrays.string_ import StringDtype

dtype = StringDtype()
dtype = StringDtype(na_value=np.nan)
return dtype.construct_array_type()._from_sequence(objects, dtype=dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this diff is just switching the order of the if/elif blocks, to first check if we want a nullable dtype and only then check if we want the future default string dtype.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 672 to 673
value = sanitize_array(value, index=None)
value = ensure_wrapped_if_datetimelike(value)
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the default input sanitation/inference we do for the Series constructor (and other places), so switched to use that here instead of pd.array(..) (which doesn't use the default dtypes for all types)

(this mostly impacts the name of the type in the error message)

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 12, 2024 13:31
@jorisvandenbossche jorisvandenbossche changed the title String dtype: maybe_converts_object give precedence to nullable dtype String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested Aug 12, 2024
@WillAyd
Copy link
Member

WillAyd commented Aug 12, 2024

(sidenote, we should update this simple convert_to_nullable_dtype True/False flag to something that allows passing whether we want a numpy_nullable or pyarrow array, but that's for a separate issue)

I think generally should just stick with dtype_backend throughout right? Not sure where we go from one to the other but that seems like a lossy move

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
@mroeschke mroeschke merged commit 851639d into pandas-dev:main Aug 21, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-object-conversion branch August 21, 2024 19:33
lithomas1 pushed a commit to lithomas1/pandas that referenced this pull request Sep 8, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
(cherry picked from commit 851639d)
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024
…maybe_converts_object`) if requested (#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants