BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

jbrockmendel · 2022-01-12T03:14:41Z

ser = pd.Series([b'foo'], dtype=object)
idx = pd.Index(ser)

In [3]: ser.astype(str)[0]
Out[3]: "b'foo'"

In [4]: idx.astype(str)[0]
Out[4]: 'foo'

Series goes through astype_nansafe and from there through ensure_string_array. Index just calls calls the ndarray's .astype method. The Index behavior is specifically tested in test_astype_str_from_bytes xref #38607.

Index.astype can raise on non-ascii bytestrs:

val =  'あ'
bval = val.encode("UTF-8")

ser2 = pd.Series([bval], dtype=object)
idx2 = pd.Index(ser2)

In [21]: ser2.astype(str)[0]
Out[21]: "b'\\xe3\\x81\\x82'"

In [22]: idx2.astype(str)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
[...]
TypeError: Cannot cast Index to dtype <U0

These behaviors should match. I don't really have a preference between them.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-09-23T17:59:57Z

@jorisvandenbossche thoughts on preferred API?

jorisvandenbossche · 2022-12-09T10:23:50Z

Index.astype can raise on non-ascii bytestrs:

In case we would keep the decoding-like behaviour, I think that example should also work? We currently rely on numpy, which seems to only accept ascii in astype (although you are astyping to their unicode dtype). It seems they have a specific decoding function that can do more than just ascii:

>>> idx2._values
array([b'\xe3\x81\x82'], dtype=object)
>>> idx2._values.astype(str)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> idx2._values.astype(np.bytes_)
array([b'\xe3\x81\x82'], dtype='|S3')
>>> idx2._values.astype(np.bytes_).astype(str)  # same error
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> np.char.decode(idx2._values.astype(np.bytes_), "UTF-8")
array(['あ'], dtype='<U1')

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Astype labels Jan 12, 2022

mroeschke added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2022

jbrockmendel added this to the 2.0 milestone Oct 17, 2022

jbrockmendel mentioned this issue Nov 11, 2022

API: Series[bytes].astype(str) behavior #49658

Merged

5 tasks

mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023

mroeschke closed this as completed in #49658 Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

jbrockmendel commented Jan 12, 2022 •

edited

Loading

jbrockmendel commented Sep 23, 2022

jorisvandenbossche commented Dec 9, 2022

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

BUG: series.astype(str) vs Index.astype(str) inconsistency #45326

Comments

jbrockmendel commented Jan 12, 2022 • edited Loading

jbrockmendel commented Sep 23, 2022

jorisvandenbossche commented Dec 9, 2022

jbrockmendel commented Jan 12, 2022 •

edited

Loading