-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST (string dtype): clean up construction of expected string arrays #59481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): clean up construction of expected string arrays #59481
Conversation
if dtype_backend == "pyarrow": | ||
pa = pytest.importorskip("pyarrow") | ||
string_dtype = pd.ArrowDtype(pa.string()) | ||
else: | ||
string_dtype = pd.StringDtype(string_storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a different logic than the lines that are removed just above, i.e. it no longer uses using_infer_string
, because when the user explicitly passes dtype_backend="pyarrow"|"numpy_nullable"
, the user should actually always get the NA-variants of the string dtype (regardless of whether the future (NaN-based) string dtype is enabled or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__
should handle this and return a pd.ArrowDtype for the pyarrow backend.
Kind of an in between spot since pd.ArrowDtype
is still a separate concept from pd.StringDtype
, but I think in the future will be easier to refactor if we
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably useful in quite a few places in testing right?
I haven't yet encountered many places in the code base itself that uses that logic, IIRC (although #59487 is actually an example of where we maybe should do that, and the fact that we currently don't use ArrowDtype there is a bit of a bug/missing piece ..)
maybe the StringDtype new should handle this and return a pd.ArrowDtype for the pyarrow backend.
Given those are still separate concepts, as you mention (with also different arrays with different behaviour in various places), I personally think it would make the current situation on the short term rather more confusing if some invocation of StringDtype(..)
would return ArrowDtype(string)
(it would also require a new keyword because StringDtype("pyarrow")
already exists). And on the longer term we should see what the outcome will be for PDEP-13 on logical types.
Especially for testing here, I think it is also good to be explicit about which dtype class is the expected one, for readability / understand-ability of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points. If we start seeing in the core library, I think a common function to generate this would make more sense
if dtype_backend == "pyarrow": | ||
pa = pytest.importorskip("pyarrow") | ||
string_dtype = pd.ArrowDtype(pa.string()) | ||
else: | ||
string_dtype = pd.StringDtype(string_storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably useful in quite a few places in testing right? Maybe even in the core codebase? I wonder if this shouldn't be a helper function instead, or maybe the StringDtype __new__
should handle this and return a pd.ArrowDtype for the pyarrow backend.
Kind of an in between spot since pd.ArrowDtype
is still a separate concept from pd.StringDtype
, but I think in the future will be easier to refactor if we
6fc6b7f
to
c8f3b9f
Compare
…io-dtype-backend-conversion
…io-dtype-backend-conversion
Sorry, I missed that with the last merge of main there were actually relevant failures (maybe some interaction with one of my other merged PRs). Quick fixup -> #59509 |
(depends on #59470, merged and rebased on top of it)We have various places in the tests where we very explicitly construct the "raw" arrays for the different variants of the string dtype using their class constructors.
I think we can simplify this by using our normal
pd.Series
/pd.array
constructors with specifying the exact dtype (which will then rely onStringDtype.construct_array_type()._from_sequence(..)
to construct the underling string array). This should give equivalent tests for what it is testing (and the constructors itself are tested elsewhere explicitly).Also removes some xfails in those tests by fixing the logic of the expected string dtype (i.e. don't change the expected dtype when
future.infer_string
is enabled, see comment below at #59481 (review))xref #54792