Skip to content

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Jan 12, 2022 · 2 comments · Fixed by #49658
Closed

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

jbrockmendel opened this issue Jan 12, 2022 · 2 comments · Fixed by #49658
Labels
Astype Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 12, 2022

ser = pd.Series([b'foo'], dtype=object)
idx = pd.Index(ser)

In [3]: ser.astype(str)[0]
Out[3]: "b'foo'"

In [4]: idx.astype(str)[0]
Out[4]: 'foo'

Series goes through astype_nansafe and from there through ensure_string_array. Index just calls calls the ndarray's .astype method. The Index behavior is specifically tested in test_astype_str_from_bytes xref #38607.

Index.astype can raise on non-ascii bytestrs:

val =  'あ'
bval = val.encode("UTF-8")

ser2 = pd.Series([bval], dtype=object)
idx2 = pd.Index(ser2)

In [21]: ser2.astype(str)[0]
Out[21]: "b'\\xe3\\x81\\x82'"

In [22]: idx2.astype(str)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
[...]
TypeError: Cannot cast Index to dtype <U0

These behaviors should match. I don't really have a preference between them.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Astype labels Jan 12, 2022
@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2022
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche thoughts on preferred API?

@jorisvandenbossche
Copy link
Member

Index.astype can raise on non-ascii bytestrs:

In case we would keep the decoding-like behaviour, I think that example should also work? We currently rely on numpy, which seems to only accept ascii in astype (although you are astyping to their unicode dtype). It seems they have a specific decoding function that can do more than just ascii:

>>> idx2._values
array([b'\xe3\x81\x82'], dtype=object)
>>> idx2._values.astype(str)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> idx2._values.astype(np.bytes_)
array([b'\xe3\x81\x82'], dtype='|S3')
>>> idx2._values.astype(np.bytes_).astype(str)  # same error
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> np.char.decode(idx2._values.astype(np.bytes_), "UTF-8")
array(['あ'], dtype='<U1')

@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants