Skip to content

API: Series[bytes].astype(str) behavior #49658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 15, 2023

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Nov 11, 2022

I'm not married to this particular API design, just easier to demonstrate the relevant issues in a PR rather than in #45326.

Note: this adds the possibility of raising in some cases that currently do not raise e.g. pd.Series([b'\xa7']).astype(str)

@mroeschke
Copy link
Member

I would slightly prefer the Series behavior in #45326 (comment) as it matches the Python str(bytes) behavior. In an ideal world I think the decoding behavior would be better if there was a str.decode() like method

@jbrockmendel
Copy link
Member Author

I’m fine either way as long as we’re consistent. Preferences @jorisvandenbossche @jreback @phofl?

@jreback
Copy link
Contributor

jreback commented Nov 17, 2022

i like series behavior as well
so +1 here

@phofl
Copy link
Member

phofl commented Nov 17, 2022

Sounds good to me too

@jbrockmendel
Copy link
Member Author

FWIW AFAICT the arrow dtypes do a .decode

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Nov 17, 2022

updated+green

@mroeschke
Copy link
Member

FWIW AFAICT the arrow dtypes do a .decode

@jorisvandenbossche are you familiar why pa.array([bytes]).cast(pa.string) performs a decode instead of matching the std lib behavior of str(bytes_obj)?

In [2]: import pyarrow as pa

In [3]: pa.array([b"1"]).cast(pa.string())
Out[3]:
<pyarrow.lib.StringArray object at 0x7f7cfcd78fa0>
[
  "1"
]

@jbrockmendel
Copy link
Member Author

any other thoughts here?

@mroeschke
Copy link
Member

As of now I'm partial to matching the Python behavior but I could be convinced of the decode behavior given that pyarrow does it

@jbrockmendel
Copy link
Member Author

If there aren't any objections i plan to merge this at the end of the week

@jorisvandenbossche
Copy link
Member

Sorry for the very slow response here. Personally, I would prefer the Index behaviour ..

While the Series behaviour has consistency with str(bytes) for stdlib python, for me the question is also: do you ever want this behaviour of converting a column of bytes to "b'..''" strings? What's the use case for that?

While being able to convert a column of bytes to the equivalent strings (assuming default UTF8 encoding) is a useful operation.

FWIW AFAICT the arrow dtypes do a .decode

@jorisvandenbossche are you familiar why pa.array([bytes]).cast(pa.string) performs a decode instead of matching the std lib behavior of str(bytes_obj)?

That's not really an option for Arrow, because this is basically Python's default fallback for string formatting to use the "repr"? But that is very Python specific, and so using "b'..'" doesn't really make sense for Arrow. Other languages don't have the concept of such a b-prefix for strings.
So I think for Arrow the only option would be to raise an error otherwise (and provide another specific kernel to do this conversion). But then the question is also why would arrow do that, and not default to converting the bytes to strings? (which under the hood basically just validates that the bytes is valid UTF-8, since strings are also just bytes)

@jorisvandenbossche
Copy link
Member

One difference though with Arrow is that Arrow has a binary dtype, specifically for bytes values. While in pandas we use object dtype for this, and so you can also have a mix of bytes and other objects. And if you have a mix of all kinds of objects, I agree that just calling str(..) on each element for converting generic objects to string dtype also makes sense (i.e. is the most generic option).

But assume for a moment that we did have a bytes dtype. Would you then still prefer the "b'..'" behaviour instead of decoding wen casting binary to string?

NumPy actually also has both dtypes, and so they also do decoding and not str-repr:

>>> np.array([b"ab"])
array([b'ab'], dtype='|S2')
>>> np.array([b"ab"]).astype(str)
array(['ab'], dtype='<U2')

Of course, at the moment we don't have a binary dtype in pandas, and people are stuck on using object dtype for bytes. So the question could then also be: do we special case bytes as long as we don't have a binary dtype (as is currently done in Index), or do keep object-dtype casting generic?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche thanks for a thorough response. would a fair summary short-term be that you prefer the behavior in the first commit?

@MarcoGorelli MarcoGorelli self-requested a review December 13, 2022 19:28
@mroeschke
Copy link
Member

Maybe a 3rd option: What if we disallow Series[bytes].astype(str) and vice versa? This functionality is available and clearer IMO with str.encode/decode

In [9]: ser = pd.Series([b'foo'], dtype=object)
   ...: idx = pd.Index(ser)

In [10]: idx.str.decode("utf-8")
Out[10]: Index(['foo'], dtype='object')

In [11]: ser.str.decode("utf-8")
Out[11]:
0    foo
dtype: object

In [12]: ser = pd.Series(['foo'], dtype=object)
    ...: idx = pd.Index(ser)

In [13]: idx.str.encode("utf-8")
Out[13]: Index([b'foo'], dtype='object')

In [14]: ser.str.encode("utf-8")
Out[14]:
0    b'foo'
dtype: object

@jbrockmendel
Copy link
Member Author

That works for all-bytes cases, what about mixed?

@mroeschke
Copy link
Member

Oh good point. Non-bytes would go to NaN which I'm guessing we don't want to mirror in the astype behavior so nevermind

In [18]: ser = pd.Series(['foo', 1], dtype=object)
    ...: idx = pd.Index(ser)

In [19]: ser.str.encode("utf-8")
Out[19]:
0    b'foo'
1       NaN
dtype: object

@jreback
Copy link
Contributor

jreback commented Dec 29, 2022

Series[bytes].astype(str)

I actually think we should raise on this rather than converting to string. -1 on the automatic decoding.

@jbrockmendel
Copy link
Member Author

I actually think we should raise on this rather than converting to string

what would you tell users to do instead?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2023

.bytes.decode() (which doesn't exist but is the right thing to do)

@jbrockmendel
Copy link
Member Author

Same as a few comments up: that works in an all-bytes case, but not in a mixed case

@jreback
Copy link
Contributor

jreback commented Jan 1, 2023

hmm that's a good point
ok then agree with your current soln

we can consider warning on this conversion i think

@jbrockmendel jbrockmendel added this to the 2.0 milestone Jan 17, 2023
@mroeschke
Copy link
Member

I still slightly prefer the Series behavior although stringifying a bytes might be odd just for stdlib consistency. For decoding I imagine str.decode should be preferred over astype? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.decode.html

@jbrockmendel
Copy link
Member Author

For decoding I imagine str.decode should be preferred over astype?

Yes.

I still slightly prefer the Series behavior although stringifying a bytes might be odd just for stdlib consistency

Does it help that doing .decode is consistent with numpy's behavior? np.array([b'foo'], dtype=object).astype(str) decodes.

On today's call Joris made the point that "b'foo'" is basically never what the user wants, so it is worth special-casing. I find this reasonable. The flip side is that this means .astype(str) can raise UnicodeDecodeError.

No one else has expressed a preference, and I am adamant about "I don't care which just choose one." Do either of you feel strongly about this? If we can't come to a consensus, would you object to a coin toss?

@mroeschke
Copy link
Member

I don't feel strongly enough about my opinion to block the decode option from being chosen.

@jbrockmendel
Copy link
Member Author

Can we get this in for the rc?

@mroeschke mroeschke merged commit 0197312 into pandas-dev:main Feb 15, 2023
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the api-astype-3 branch February 15, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: series.astype(str) vs Index.astype(str) inconsistency
5 participants