API: Series[bytes].astype(str) behavior #49658

jbrockmendel · 2022-11-11T22:43:02Z

closes BUG: series.astype(str) vs Index.astype(str) inconsistency #45326 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I'm not married to this particular API design, just easier to demonstrate the relevant issues in a PR rather than in #45326.

~~Note: this adds the possibility of raising in some cases that currently do not raise e.g. pd.Series([b'\xa7']).astype(str)~~

pandas/tests/extension/test_arrow.py

mroeschke · 2022-11-16T21:39:08Z

I would slightly prefer the Series behavior in #45326 (comment) as it matches the Python str(bytes) behavior. In an ideal world I think the decoding behavior would be better if there was a str.decode() like method

jbrockmendel · 2022-11-16T23:24:19Z

I’m fine either way as long as we’re consistent. Preferences @jorisvandenbossche @jreback @phofl?

jreback · 2022-11-17T00:36:55Z

i like series behavior as well
so +1 here

phofl · 2022-11-17T10:06:09Z

Sounds good to me too

jbrockmendel · 2022-11-17T16:58:22Z

FWIW AFAICT the arrow dtypes do a .decode

jbrockmendel · 2022-11-17T22:01:48Z

updated+green

mroeschke · 2022-11-23T19:22:46Z

FWIW AFAICT the arrow dtypes do a .decode

@jorisvandenbossche are you familiar why pa.array([bytes]).cast(pa.string) performs a decode instead of matching the std lib behavior of str(bytes_obj)?

In [2]: import pyarrow as pa

In [3]: pa.array([b"1"]).cast(pa.string())
Out[3]:
<pyarrow.lib.StringArray object at 0x7f7cfcd78fa0>
[
  "1"
]

jbrockmendel · 2022-11-30T02:31:20Z

any other thoughts here?

mroeschke · 2022-11-30T02:33:17Z

As of now I'm partial to matching the Python behavior but I could be convinced of the decode behavior given that pyarrow does it

jbrockmendel · 2022-12-07T00:27:55Z

If there aren't any objections i plan to merge this at the end of the week

jorisvandenbossche · 2022-12-09T09:44:40Z

Sorry for the very slow response here. Personally, I would prefer the Index behaviour ..

While the Series behaviour has consistency with str(bytes) for stdlib python, for me the question is also: do you ever want this behaviour of converting a column of bytes to "b'..''" strings? What's the use case for that?

While being able to convert a column of bytes to the equivalent strings (assuming default UTF8 encoding) is a useful operation.

FWIW AFAICT the arrow dtypes do a .decode

@jorisvandenbossche are you familiar why pa.array([bytes]).cast(pa.string) performs a decode instead of matching the std lib behavior of str(bytes_obj)?

That's not really an option for Arrow, because this is basically Python's default fallback for string formatting to use the "repr"? But that is very Python specific, and so using "b'..'" doesn't really make sense for Arrow. Other languages don't have the concept of such a b-prefix for strings.
So I think for Arrow the only option would be to raise an error otherwise (and provide another specific kernel to do this conversion). But then the question is also why would arrow do that, and not default to converting the bytes to strings? (which under the hood basically just validates that the bytes is valid UTF-8, since strings are also just bytes)

jorisvandenbossche · 2022-12-09T10:10:13Z

One difference though with Arrow is that Arrow has a binary dtype, specifically for bytes values. While in pandas we use object dtype for this, and so you can also have a mix of bytes and other objects. And if you have a mix of all kinds of objects, I agree that just calling str(..) on each element for converting generic objects to string dtype also makes sense (i.e. is the most generic option).

But assume for a moment that we did have a bytes dtype. Would you then still prefer the "b'..'" behaviour instead of decoding wen casting binary to string?

NumPy actually also has both dtypes, and so they also do decoding and not str-repr:

>>> np.array([b"ab"])
array([b'ab'], dtype='|S2')
>>> np.array([b"ab"]).astype(str)
array(['ab'], dtype='<U2')

Of course, at the moment we don't have a binary dtype in pandas, and people are stuck on using object dtype for bytes. So the question could then also be: do we special case bytes as long as we don't have a binary dtype (as is currently done in Index), or do keep object-dtype casting generic?

jbrockmendel · 2022-12-09T19:19:27Z

@jorisvandenbossche thanks for a thorough response. would a fair summary short-term be that you prefer the behavior in the first commit?

mroeschke · 2022-12-13T21:03:42Z

Maybe a 3rd option: What if we disallow Series[bytes].astype(str) and vice versa? This functionality is available and clearer IMO with str.encode/decode

In [9]: ser = pd.Series([b'foo'], dtype=object)
   ...: idx = pd.Index(ser)

In [10]: idx.str.decode("utf-8")
Out[10]: Index(['foo'], dtype='object')

In [11]: ser.str.decode("utf-8")
Out[11]:
0    foo
dtype: object

In [12]: ser = pd.Series(['foo'], dtype=object)
    ...: idx = pd.Index(ser)

In [13]: idx.str.encode("utf-8")
Out[13]: Index([b'foo'], dtype='object')

In [14]: ser.str.encode("utf-8")
Out[14]:
0    b'foo'
dtype: object

jbrockmendel · 2022-12-13T22:34:51Z

That works for all-bytes cases, what about mixed?

mroeschke · 2022-12-13T22:59:19Z

Oh good point. Non-bytes would go to NaN which I'm guessing we don't want to mirror in the astype behavior so nevermind

In [18]: ser = pd.Series(['foo', 1], dtype=object)
    ...: idx = pd.Index(ser)

In [19]: ser.str.encode("utf-8")
Out[19]:
0    b'foo'
1       NaN
dtype: object

jreback · 2022-12-29T22:31:25Z

Series[bytes].astype(str)

I actually think we should raise on this rather than converting to string. -1 on the automatic decoding.

jbrockmendel · 2023-01-01T18:24:35Z

I actually think we should raise on this rather than converting to string

what would you tell users to do instead?

jreback · 2023-01-01T18:31:55Z

.bytes.decode() (which doesn't exist but is the right thing to do)

jbrockmendel · 2023-01-01T18:36:18Z

Same as a few comments up: that works in an all-bytes case, but not in a mixed case

jreback · 2023-01-01T18:49:30Z

hmm that's a good point
ok then agree with your current soln

we can consider warning on this conversion i think

mroeschke · 2023-02-08T21:15:56Z

I still slightly prefer the Series behavior although stringifying a bytes might be odd just for stdlib consistency. For decoding I imagine str.decode should be preferred over astype? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.decode.html

jbrockmendel · 2023-02-08T22:03:27Z

For decoding I imagine str.decode should be preferred over astype?

Yes.

I still slightly prefer the Series behavior although stringifying a bytes might be odd just for stdlib consistency

Does it help that doing .decode is consistent with numpy's behavior? np.array([b'foo'], dtype=object).astype(str) decodes.

On today's call Joris made the point that "b'foo'" is basically never what the user wants, so it is worth special-casing. I find this reasonable. The flip side is that this means .astype(str) can raise UnicodeDecodeError.

No one else has expressed a preference, and I am adamant about "I don't care which just choose one." Do either of you feel strongly about this? If we can't come to a consensus, would you object to a coin toss?

mroeschke · 2023-02-08T22:10:48Z

I don't feel strongly enough about my opinion to block the decode option from being chosen.

jbrockmendel · 2023-02-15T15:51:08Z

Can we get this in for the rc?

mroeschke · 2023-02-15T18:08:30Z

Thanks @jbrockmendel

API: Series[bytes].astype(str) behavior

f10b95d

jbrockmendel added the Astype label Nov 14, 2022

mroeschke reviewed Nov 16, 2022

View reviewed changes

pandas/tests/extension/test_arrow.py Outdated Show resolved Hide resolved

mroeschke reviewed Nov 16, 2022

View reviewed changes

pandas/tests/extension/test_arrow.py Outdated Show resolved Hide resolved

Merge branch 'main' into api-astype-3

3f963dc

choose Series behavior

f95e712

Merge branch 'main' into api-astype-3

f532d0c

Merge branch 'main' into api-astype-3

b0ed554

MarcoGorelli self-requested a review December 13, 2022 19:28

jbrockmendel added 5 commits December 18, 2022 13:13

Merge branch 'main' into api-astype-3

d632436

Merge branch 'main' into api-astype-3

fd763d3

Merge branch 'main' into api-astype-3

d00a8bd

Merge branch 'main' into api-astype-3

b3673d1

Merge branch 'main' into api-astype-3

56e0577

jbrockmendel added this to the 2.0 milestone Jan 17, 2023

jbrockmendel added 2 commits February 10, 2023 15:39

Merge branch 'main' into api-astype-3

60e553d

use .decode

e5f6860

mroeschke approved these changes Feb 15, 2023

View reviewed changes

mroeschke merged commit 0197312 into pandas-dev:main Feb 15, 2023

jbrockmendel deleted the api-astype-3 branch February 15, 2023 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Series[bytes].astype(str) behavior #49658

API: Series[bytes].astype(str) behavior #49658

jbrockmendel commented Nov 11, 2022 •

edited

Loading

mroeschke commented Nov 16, 2022

jbrockmendel commented Nov 16, 2022

jreback commented Nov 17, 2022

phofl commented Nov 17, 2022

jbrockmendel commented Nov 17, 2022

jbrockmendel commented Nov 17, 2022 •

edited

Loading

mroeschke commented Nov 23, 2022

jbrockmendel commented Nov 30, 2022

mroeschke commented Nov 30, 2022

jbrockmendel commented Dec 7, 2022

jorisvandenbossche commented Dec 9, 2022

jorisvandenbossche commented Dec 9, 2022

jbrockmendel commented Dec 9, 2022

mroeschke commented Dec 13, 2022

jbrockmendel commented Dec 13, 2022

mroeschke commented Dec 13, 2022

jreback commented Dec 29, 2022

jbrockmendel commented Jan 1, 2023

jreback commented Jan 1, 2023

jbrockmendel commented Jan 1, 2023

jreback commented Jan 1, 2023

mroeschke commented Feb 8, 2023

jbrockmendel commented Feb 8, 2023

mroeschke commented Feb 8, 2023

jbrockmendel commented Feb 15, 2023

mroeschke commented Feb 15, 2023

API: Series[bytes].astype(str) behavior #49658

API: Series[bytes].astype(str) behavior #49658

Conversation

jbrockmendel commented Nov 11, 2022 • edited Loading

mroeschke commented Nov 16, 2022

jbrockmendel commented Nov 16, 2022

jreback commented Nov 17, 2022

phofl commented Nov 17, 2022

jbrockmendel commented Nov 17, 2022

jbrockmendel commented Nov 17, 2022 • edited Loading

mroeschke commented Nov 23, 2022

jbrockmendel commented Nov 30, 2022

mroeschke commented Nov 30, 2022

jbrockmendel commented Dec 7, 2022

jorisvandenbossche commented Dec 9, 2022

jorisvandenbossche commented Dec 9, 2022

jbrockmendel commented Dec 9, 2022

mroeschke commented Dec 13, 2022

jbrockmendel commented Dec 13, 2022

mroeschke commented Dec 13, 2022

jreback commented Dec 29, 2022

jbrockmendel commented Jan 1, 2023

jreback commented Jan 1, 2023

jbrockmendel commented Jan 1, 2023

jreback commented Jan 1, 2023

mroeschke commented Feb 8, 2023

jbrockmendel commented Feb 8, 2023

mroeschke commented Feb 8, 2023

jbrockmendel commented Feb 15, 2023

mroeschke commented Feb 15, 2023

jbrockmendel commented Nov 11, 2022 •

edited

Loading

jbrockmendel commented Nov 17, 2022 •

edited

Loading