String dtype: still return nullable NA-variant in object inference (`maybe_converts_object`) if requested #59487

jorisvandenbossche · 2024-08-12T08:45:01Z

The lib.maybe_converts_object() function (which is used for object dtype inference in various places) has a convert_to_nullable_dtype=Flase/True keyword to control whether to return dtypes using NA or not. Currently we were giving the future infer_string option priority. But also when the future dtype is enabled, we should still return an NA-variant of the dtype if explicitly asked for. So this PR switches the order of those two checks to let the convert_to_nullable_dtype keyword take precedence over the option.

This came up in SQL tests, but actually this also impacts the pd.array(..) behaviour. Currently on main, pd.array(["some", "strings"]) will infer the nullable StringDtype(), consistent with the other dtypes that have such a nullable variant (i.e. returning our Int64 instead of np.int64 for integers, etc).
This is tested behaviour, but we were still xfailing those tests when future.infer_string was enabled. With the change in this PR, also in the future pd.array(..) will infer the NA-variant.

(sidenote, we should update this simple convert_to_nullable_dtype True/False flag to something that allows passing whether we want a numpy_nullable or pyarrow array, but that's for a separate issue)

jorisvandenbossche · 2024-08-12T11:18:00Z

pandas/_libs/lib.pyx

-        if using_string_dtype() and is_string_array(objects, skipna=True):
+        if convert_to_nullable_dtype and is_string_array(objects, skipna=True):
            from pandas.core.arrays.string_ import StringDtype

-            dtype = StringDtype(na_value=np.nan)
+            dtype = StringDtype()
            return dtype.construct_array_type()._from_sequence(objects, dtype=dtype)

-        elif convert_to_nullable_dtype and is_string_array(objects, skipna=True):
+        elif using_string_dtype() and is_string_array(objects, skipna=True):
            from pandas.core.arrays.string_ import StringDtype

-            dtype = StringDtype()
+            dtype = StringDtype(na_value=np.nan)
            return dtype.construct_array_type()._from_sequence(objects, dtype=dtype)


Essentially this diff is just switching the order of the if/elif blocks, to first check if we want a nullable dtype and only then check if we want the future default string dtype.

WillAyd

lgtm

jorisvandenbossche · 2024-08-12T13:24:15Z

pandas/core/arrays/datetimelike.py

+        value = sanitize_array(value, index=None)
+        value = ensure_wrapped_if_datetimelike(value)


This is essentially the default input sanitation/inference we do for the Series constructor (and other places), so switched to use that here instead of pd.array(..) (which doesn't use the default dtypes for all types)

(this mostly impacts the name of the type in the error message)

WillAyd · 2024-08-12T13:56:58Z

(sidenote, we should update this simple convert_to_nullable_dtype True/False flag to something that allows passing whether we want a numpy_nullable or pyarrow array, but that's for a separate issue)

I think generally should just stick with dtype_backend throughout right? Not sure where we go from one to the other but that seems like a lossy move

WillAyd

lgtm

…-conversion

mroeschke · 2024-08-21T19:31:46Z

Thanks @jorisvandenbossche

…maybe_converts_object`) if requested (pandas-dev#59487) * String dtype: maybe_converts_object give precedence to nullable dtype * update datetimelike input validation * update tests and remove xfails * explicitly test pd.array() behaviour (remove xfail) * fixup allow_2d * undo changes related to datetimelike input validation * fix test for str on current main --------- Co-authored-by: Matthew Roeschke <[email protected]> (cherry picked from commit 851639d)

…maybe_converts_object`) if requested (pandas-dev#59487) * String dtype: maybe_converts_object give precedence to nullable dtype * update datetimelike input validation * update tests and remove xfails * explicitly test pd.array() behaviour (remove xfail) * fixup allow_2d * undo changes related to datetimelike input validation * fix test for str on current main --------- Co-authored-by: Matthew Roeschke <[email protected]>

…maybe_converts_object`) if requested (#59487) * String dtype: maybe_converts_object give precedence to nullable dtype * update datetimelike input validation * update tests and remove xfails * explicitly test pd.array() behaviour (remove xfail) * fixup allow_2d * undo changes related to datetimelike input validation * fix test for str on current main --------- Co-authored-by: Matthew Roeschke <[email protected]>

String dtype: maybe_converts_object give precedence to nullable dtype

03d4943

jorisvandenbossche added the Strings String extension data type and string data label Aug 12, 2024

jorisvandenbossche commented Aug 12, 2024

View reviewed changes

WillAyd approved these changes Aug 12, 2024

View reviewed changes

jorisvandenbossche added 2 commits August 12, 2024 15:07

update datetimelike input validation

c005778

update tests and remove xfails

0bee1ac

jorisvandenbossche commented Aug 12, 2024

View reviewed changes

explicitly test pd.array() behaviour (remove xfail)

0057158

jorisvandenbossche marked this pull request as ready for review August 12, 2024 13:31

jorisvandenbossche requested a review from WillAyd August 12, 2024 13:31

jorisvandenbossche changed the title ~~String dtype: maybe_converts_object give precedence to nullable dtype~~ String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested Aug 12, 2024

WillAyd approved these changes Aug 12, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Aug 12, 2024

TST (string dtype): clean up construction of expected string arrays #59481

Merged

jorisvandenbossche added 7 commits August 12, 2024 22:47

Merge remote-tracking branch 'upstream/main' into string-dtype-object…

c56c5f5

…-conversion

fixup allow_2d

8af4cda

Merge remote-tracking branch 'upstream/main' into string-dtype-object…

7b73c04

…-conversion

Merge remote-tracking branch 'upstream/main' into string-dtype-object…

aa89c32

…-conversion

undo changes related to datetimelike input validation

8b92517

fix test for str on current main

548b501

Merge remote-tracking branch 'upstream/main' into string-dtype-object…

5fbde9f

…-conversion

jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024

Merge branch 'main' into string-dtype-object-conversion

5cb76d0

mroeschke approved these changes Aug 21, 2024

View reviewed changes

mroeschke merged commit 851639d into pandas-dev:main Aug 21, 2024
47 checks passed

jorisvandenbossche deleted the string-dtype-object-conversion branch August 21, 2024 19:33

jorisvandenbossche mentioned this pull request Sep 20, 2024

Initial Backport of string changes for 2.3 release #59513

Merged

jorisvandenbossche added the backported label Oct 10, 2024

WillAyd mentioned this pull request Nov 14, 2024

[backport 2.3.x] String dtype: enable in SQL IO + resolve all xfails (#60255) #60315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

String dtype: still return nullable NA-variant in object inference (`maybe_converts_object`) if requested #59487

String dtype: still return nullable NA-variant in object inference (`maybe_converts_object`) if requested #59487

Uh oh!

jorisvandenbossche commented Aug 12, 2024 •

edited

Loading

Uh oh!

jorisvandenbossche Aug 12, 2024

Uh oh!

WillAyd left a comment

Uh oh!

jorisvandenbossche Aug 12, 2024 •

edited

Loading

Uh oh!

WillAyd commented Aug 12, 2024

Uh oh!

WillAyd left a comment

Uh oh!

Uh oh!

mroeschke commented Aug 21, 2024

Uh oh!

Uh oh!

		value = sanitize_array(value, index=None)
		value = ensure_wrapped_if_datetimelike(value)

Uh oh!

String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested #59487

String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested #59487

Uh oh!

Conversation

jorisvandenbossche commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Aug 12, 2024

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mroeschke commented Aug 21, 2024

Uh oh!

Uh oh!

String dtype: still return nullable NA-variant in object inference (`maybe_converts_object`) if requested #59487

String dtype: still return nullable NA-variant in object inference (`maybe_converts_object`) if requested #59487

jorisvandenbossche commented Aug 12, 2024 •

edited

Loading

jorisvandenbossche Aug 12, 2024 •

edited

Loading