-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Series[bytes].astype(str) behavior #49658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would slightly prefer the Series behavior in #45326 (comment) as it matches the Python |
I’m fine either way as long as we’re consistent. Preferences @jorisvandenbossche @jreback @phofl? |
i like series behavior as well |
Sounds good to me too |
FWIW AFAICT the arrow dtypes do a .decode |
updated+green |
@jorisvandenbossche are you familiar why
|
any other thoughts here? |
As of now I'm partial to matching the Python behavior but I could be convinced of the |
If there aren't any objections i plan to merge this at the end of the week |
Sorry for the very slow response here. Personally, I would prefer the Index behaviour .. While the Series behaviour has consistency with While being able to convert a column of bytes to the equivalent strings (assuming default UTF8 encoding) is a useful operation.
That's not really an option for Arrow, because this is basically Python's default fallback for string formatting to use the "repr"? But that is very Python specific, and so using |
One difference though with Arrow is that Arrow has a binary dtype, specifically for bytes values. While in pandas we use object dtype for this, and so you can also have a mix of bytes and other objects. And if you have a mix of all kinds of objects, I agree that just calling But assume for a moment that we did have a bytes dtype. Would you then still prefer the NumPy actually also has both dtypes, and so they also do decoding and not str-repr: >>> np.array([b"ab"])
array([b'ab'], dtype='|S2')
>>> np.array([b"ab"]).astype(str)
array(['ab'], dtype='<U2') Of course, at the moment we don't have a binary dtype in pandas, and people are stuck on using object dtype for bytes. So the question could then also be: do we special case bytes as long as we don't have a binary dtype (as is currently done in Index), or do keep object-dtype casting generic? |
@jorisvandenbossche thanks for a thorough response. would a fair summary short-term be that you prefer the behavior in the first commit? |
Maybe a 3rd option: What if we disallow
|
That works for all-bytes cases, what about mixed? |
Oh good point. Non-bytes would go to
|
I actually think we should raise on this rather than converting to string. -1 on the automatic decoding. |
what would you tell users to do instead? |
.bytes.decode() (which doesn't exist but is the right thing to do) |
Same as a few comments up: that works in an all-bytes case, but not in a mixed case |
hmm that's a good point we can consider warning on this conversion i think |
I still slightly prefer the Series behavior although stringifying a bytes might be odd just for stdlib consistency. For decoding I imagine |
Yes.
Does it help that doing .decode is consistent with numpy's behavior? On today's call Joris made the point that No one else has expressed a preference, and I am adamant about "I don't care which just choose one." Do either of you feel strongly about this? If we can't come to a consensus, would you object to a coin toss? |
I don't feel strongly enough about my opinion to block the decode option from being chosen. |
Can we get this in for the rc? |
Thanks @jbrockmendel |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I'm not married to this particular API design, just easier to demonstrate the relevant issues in a PR rather than in #45326.
Note: this adds the possibility of raising in some cases that currently do not raise e.g.pd.Series([b'\xa7']).astype(str)