Skip to content

TST (string dtype): clean up construction of expected string arrays #59481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 11, 2024

(depends on #59470, merged and rebased on top of it)

We have various places in the tests where we very explicitly construct the "raw" arrays for the different variants of the string dtype using their class constructors.

I think we can simplify this by using our normal pd.Series/pd.array constructors with specifying the exact dtype (which will then rely on StringDtype.construct_array_type()._from_sequence(..) to construct the underling string array). This should give equivalent tests for what it is testing (and the constructors itself are tested elsewhere explicitly).

Also removes some xfails in those tests by fixing the logic of the expected string dtype (i.e. don't change the expected dtype when future.infer_string is enabled, see comment below at #59481 (review))

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Aug 11, 2024
Comment on lines +2164 to +2168
if dtype_backend == "pyarrow":
pa = pytest.importorskip("pyarrow")
string_dtype = pd.ArrowDtype(pa.string())
else:
string_dtype = pd.StringDtype(string_storage)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a different logic than the lines that are removed just above, i.e. it no longer uses using_infer_string, because when the user explicitly passes dtype_backend="pyarrow"|"numpy_nullable", the user should actually always get the NA-variants of the string dtype (regardless of whether the future (NaN-based) string dtype is enabled or not).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__ should handle this and return a pd.ArrowDtype for the pyarrow backend.

Kind of an in between spot since pd.ArrowDtype is still a separate concept from pd.StringDtype, but I think in the future will be easier to refactor if we

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably useful in quite a few places in testing right?

I haven't yet encountered many places in the code base itself that uses that logic, IIRC (although #59487 is actually an example of where we maybe should do that, and the fact that we currently don't use ArrowDtype there is a bit of a bug/missing piece ..)

maybe the StringDtype new should handle this and return a pd.ArrowDtype for the pyarrow backend.

Given those are still separate concepts, as you mention (with also different arrays with different behaviour in various places), I personally think it would make the current situation on the short term rather more confusing if some invocation of StringDtype(..) would return ArrowDtype(string) (it would also require a new keyword because StringDtype("pyarrow") already exists). And on the longer term we should see what the outcome will be for PDEP-13 on logical types.
Especially for testing here, I think it is also good to be explicit about which dtype class is the expected one, for readability / understand-ability of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points. If we start seeing in the core library, I think a common function to generate this would make more sense

Comment on lines +2164 to +2168
if dtype_backend == "pyarrow":
pa = pytest.importorskip("pyarrow")
string_dtype = pd.ArrowDtype(pa.string())
else:
string_dtype = pd.StringDtype(string_storage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__ should handle this and return a pd.ArrowDtype for the pyarrow backend.

Kind of an in between spot since pd.ArrowDtype is still a separate concept from pd.StringDtype, but I think in the future will be easier to refactor if we

@jorisvandenbossche jorisvandenbossche force-pushed the string-dtype-tests-io-dtype-backend-conversion branch from 6fc6b7f to c8f3b9f Compare August 12, 2024 17:59
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 12, 2024 18:12
@jorisvandenbossche jorisvandenbossche merged commit e2ed477 into pandas-dev:main Aug 14, 2024
46 of 47 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-tests-io-dtype-backend-conversion branch August 14, 2024 07:09
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Aug 14, 2024

Sorry, I missed that with the last merge of main there were actually relevant failures (maybe some interaction with one of my other merged PRs). Quick fixup -> #59509

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants