TST (string dtype): clean up construction of expected string arrays #59481

jorisvandenbossche · 2024-08-11T13:29:08Z

~~(depends on #59470~~, merged and rebased on top of it)

We have various places in the tests where we very explicitly construct the "raw" arrays for the different variants of the string dtype using their class constructors.

I think we can simplify this by using our normal pd.Series/pd.array constructors with specifying the exact dtype (which will then rely on StringDtype.construct_array_type()._from_sequence(..) to construct the underling string array). This should give equivalent tests for what it is testing (and the constructors itself are tested elsewhere explicitly).

Also removes some xfails in those tests by fixing the logic of the expected string dtype (i.e. don't change the expected dtype when future.infer_string is enabled, see comment below at #59481 (review))

xref #54792

jorisvandenbossche · 2024-08-11T13:32:03Z

pandas/tests/io/json/test_pandas.py

+        if dtype_backend == "pyarrow":
+            pa = pytest.importorskip("pyarrow")
+            string_dtype = pd.ArrowDtype(pa.string())
+        else:
+            string_dtype = pd.StringDtype(string_storage)


This is actually a different logic than the lines that are removed just above, i.e. it no longer uses using_infer_string, because when the user explicitly passes dtype_backend="pyarrow"|"numpy_nullable", the user should actually always get the NA-variants of the string dtype (regardless of whether the future (NaN-based) string dtype is enabled or not).

This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__ should handle this and return a pd.ArrowDtype for the pyarrow backend.

Kind of an in between spot since pd.ArrowDtype is still a separate concept from pd.StringDtype, but I think in the future will be easier to refactor if we

This is probably useful in quite a few places in testing right?

I haven't yet encountered many places in the code base itself that uses that logic, IIRC (although #59487 is actually an example of where we maybe should do that, and the fact that we currently don't use ArrowDtype there is a bit of a bug/missing piece ..)

maybe the StringDtype new should handle this and return a pd.ArrowDtype for the pyarrow backend.

Given those are still separate concepts, as you mention (with also different arrays with different behaviour in various places), I personally think it would make the current situation on the short term rather more confusing if some invocation of StringDtype(..) would return ArrowDtype(string) (it would also require a new keyword because StringDtype("pyarrow") already exists). And on the longer term we should see what the outcome will be for PDEP-13 on logical types.
Especially for testing here, I think it is also good to be explicit about which dtype class is the expected one, for readability / understand-ability of the tests.

Fair points. If we start seeing in the core library, I think a common function to generate this would make more sense

WillAyd · 2024-08-12T15:36:50Z

pandas/tests/io/json/test_pandas.py

+        if dtype_backend == "pyarrow":
+            pa = pytest.importorskip("pyarrow")
+            string_dtype = pd.ArrowDtype(pa.string())
+        else:
+            string_dtype = pd.StringDtype(string_storage)


This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__ should handle this and return a pd.ArrowDtype for the pyarrow backend.

Kind of an in between spot since pd.ArrowDtype is still a separate concept from pd.StringDtype, but I think in the future will be easier to refactor if we

…io-dtype-backend-conversion

jorisvandenbossche · 2024-08-14T07:20:13Z

Sorry, I missed that with the last merge of main there were actually relevant failures (maybe some interaction with one of my other merged PRs). Quick fixup -> #59509

…andas-dev#59481)

…59481)

jorisvandenbossche added the Strings String extension data type and string data label Aug 11, 2024

jorisvandenbossche commented Aug 11, 2024

View reviewed changes

WillAyd requested changes Aug 12, 2024

View reviewed changes

jorisvandenbossche added 2 commits August 12, 2024 19:59

TST (string dtype): clean up construction of expected string arrays

883151a

typing in test_sql

c8f3b9f

jorisvandenbossche force-pushed the string-dtype-tests-io-dtype-backend-conversion branch from 6fc6b7f to c8f3b9f Compare August 12, 2024 17:59

jorisvandenbossche marked this pull request as ready for review August 12, 2024 18:12

jorisvandenbossche added 2 commits August 12, 2024 22:44

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

aed90a6

…io-dtype-backend-conversion

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

56e7a16

…io-dtype-backend-conversion

WillAyd approved these changes Aug 13, 2024

View reviewed changes

jorisvandenbossche merged commit e2ed477 into pandas-dev:main Aug 14, 2024
46 of 47 checks passed

jorisvandenbossche deleted the string-dtype-tests-io-dtype-backend-conversion branch August 14, 2024 07:09

jorisvandenbossche mentioned this pull request Aug 14, 2024

TST (string dtype): fix IO dtype_backend tests for storage of str dtype of columns' Index #59509

Merged

jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

5bc5a61

…andas-dev#59481)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

692743d

…andas-dev#59481)

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

c94c4f9

…andas-dev#59481)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

107bb19

…andas-dev#59481)

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

15a92f5

…andas-dev#59481)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

7d74bdb

…andas-dev#59481)

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

TST (string dtype): clean up construction of expected string arrays (p…

605fb2e

…andas-dev#59481)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024

TST (string dtype): clean up construction of expected string arrays (p…

ee701c2

…andas-dev#59481)

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024

TST (string dtype): clean up construction of expected string arrays (p…

07dc9a2

…andas-dev#59481)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024

TST (string dtype): clean up construction of expected string arrays (p…

e61d3cf

…andas-dev#59481)

WillAyd added a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024

TST (string dtype): clean up construction of expected string arrays (p…

7748cb1

…andas-dev#59481)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

TST (string dtype): clean up construction of expected string arrays (p…

865754a

…andas-dev#59481)

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

TST (string dtype): clean up construction of expected string arrays (p…

d241c0f

…andas-dev#59481)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

TST (string dtype): clean up construction of expected string arrays (p…

f77fd44

…andas-dev#59481)

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

TST (string dtype): clean up construction of expected string arrays (p…

7f13e76

…andas-dev#59481)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024

TST (string dtype): clean up construction of expected string arrays (p…

5df9dfd

…andas-dev#59481)

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024

TST (string dtype): clean up construction of expected string arrays (p…

12bc177

…andas-dev#59481)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024

TST (string dtype): clean up construction of expected string arrays (p…

fa14a19

…andas-dev#59481)

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024

TST (string dtype): clean up construction of expected string arrays (p…

036e9da

…andas-dev#59481)

jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024

TST (string dtype): clean up construction of expected string arrays (#…

86ce6c7

…59481)

jorisvandenbossche pushed a commit that referenced this pull request Oct 9, 2024

TST (string dtype): clean up construction of expected string arrays (#…

9f3526f

…59481)

jorisvandenbossche added the backported label Oct 10, 2024

simonjayhawkins mentioned this pull request Nov 18, 2024

[backport 2.3.x] TST (string dtype): clean-up assorted xfails (#60345) #60349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST (string dtype): clean up construction of expected string arrays #59481

TST (string dtype): clean up construction of expected string arrays #59481

jorisvandenbossche commented Aug 11, 2024 •

edited

Loading

jorisvandenbossche Aug 11, 2024

WillAyd Aug 12, 2024

jorisvandenbossche Aug 12, 2024

WillAyd Aug 13, 2024

WillAyd Aug 12, 2024

jorisvandenbossche commented Aug 14, 2024 •

edited

Loading

TST (string dtype): clean up construction of expected string arrays #59481

TST (string dtype): clean up construction of expected string arrays #59481

Conversation

jorisvandenbossche commented Aug 11, 2024 • edited Loading

jorisvandenbossche Aug 11, 2024

Choose a reason for hiding this comment

WillAyd Aug 12, 2024

Choose a reason for hiding this comment

jorisvandenbossche Aug 12, 2024

Choose a reason for hiding this comment

WillAyd Aug 13, 2024

Choose a reason for hiding this comment

WillAyd Aug 12, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 14, 2024 • edited Loading

jorisvandenbossche commented Aug 11, 2024 •

edited

Loading

jorisvandenbossche commented Aug 14, 2024 •

edited

Loading