Skip to content

String dtype: implement _get_common_dtype #59682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

This PR implements StringDtype._get_common_dtype to control the behaviour in concat etc.

This fixes two things:

  • When concatting with a numpy string-like dtype (S/U), we can preserve our StringDtype instead of converting to object dtype (as happens now). In general we don't store such numpy dtypes in pandas objects, but in certain corner cases we are passing numpy dtypes directly to find_common_type.
  • When concatting different instances of StringDtype. In practice this shouldn't happen often, but can certainly occur. In this case, I decided to give precedence to the pyarrow / pd.NA: 1) if pyarrow is installed, that's the default and therefore logical pick in this case as well, and 2) pd.NA is opt in, so if you have a mix, assume the user explicitly opted in and propagate that preference.

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Sep 2, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 2, 2024
if isinstance(dtype, StringDtype):
storages.add(dtype.storage)
na_values.add(dtype.na_value)
elif isinstance(dtype, np.dtype) and dtype.kind == "U":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this handle the new numpy string type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the kind check to let it also support numpy 2.0 strings

@mroeschke mroeschke merged commit 4a16b44 into pandas-dev:main Sep 6, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-common-type branch September 6, 2024 17:39
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
* String dtype: implement _get_common_dtype

* add specific tests

* try fix typing

* try fix typing

* suppress typing error

* support numpy 2.0 string

* fix typo
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2024
* String dtype: implement _get_common_dtype

* add specific tests

* try fix typing

* try fix typing

* suppress typing error

* support numpy 2.0 string

* fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants