Skip to content

to_csv and bytes on Python 3. #9712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jseabold opened this issue Mar 23, 2015 · 12 comments
Open

to_csv and bytes on Python 3. #9712

jseabold opened this issue Mar 23, 2015 · 12 comments
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings

Comments

@jseabold
Copy link
Contributor

Is this desired behavior and something I need to work around or a bug? Notice the byte type marker is written to disk so you can't round-trip the data. This works fine in Python 2 with unicode AFAICT.

In [1]: pd.__version__
Out[1]: '0.15.2-252-g0d35dd4'

In [2]: pd.DataFrame.from_dict({'a': ['a', 'b', 'c']}).a.str.encode("utf-8").to_csv("tmp.csv")

In [3]: !cat tmp.csv
0,b'a'
1,b'b'
2,b'c'
@TomAugspurger
Copy link
Contributor

I'd say this is not intended, but I haven't worked on this part of the code. It's being written to file anyway, so (python 3) bytes written to csv should be identical to (python 3) str.

@dsm054
Copy link
Contributor

dsm054 commented Mar 23, 2015

FWIW I think that's actually the output I'd expect in 3.

@TomAugspurger
Copy link
Contributor

I guess I would expect behavior similar to

with open('tmp.txt', 'wb') as f:
    f.write('abc'.encode('utf-8'))

which doesn't have the b prefix.

The caveat here is that you have to explicitly open the file in wb mode since you're writing bytes. That can't work for DataFrames (I don't think) since you could have a mix of bytes and strs across columns. Do we support wb mode in to_csv? I get an error when we try to open the file handle.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

I think you just need to pass the encoding argument when writing it (otherwise it defaults to ascii on py2 and utf-8 on py3). This is from py2

In [2]: from pandas.compat import u

In [3]: df = DataFrame({u('c/\u03c3'): [1, 2, 3]})

In [4]: df
Out[4]: 
   c/?
0    1
1    2
2    3

In [5]: df.to_csv('tmp.csv',mode='w',encoding='UTF_8')

In [6]: !cat tmp.csv
,c/Ï
0,1
1,2
2,3

@jreback jreback added Usage Question Unicode Unicode strings IO CSV read_csv, to_csv labels Mar 23, 2015
@zhuoqiang
Copy link

It'd better that padas have a configurable parameter in to_csv() so that people could control how to render bytes in csv file. Otherwise we have to manally convert bytes to string before io output

df['Column'] =df['Column'].astype(str)
df.to_csv('output.csv')

@jzwinck
Copy link
Contributor

jzwinck commented Jul 6, 2016

I have this problem also. Here's a trivial example that I think most regular users would expect to work differently:

>>> import pandas as pd
>>> import sys
>>> pd.Series([b'x',b'y']).to_csv(sys.stdout)
0,b'x'
1,b'y'

>>> pd.__version__
'0.18.1'

That is, the CSV is created with Python-specific b prefixes, which other programs don't know what to do with. CSV is not just a Python data interchange format, it's what a ton of people use to dump their data into other systems, and the above should "just work" the same as it does in Python 2:

0,x
1,y

@jzwinck
Copy link
Contributor

jzwinck commented Jul 6, 2016

@zhuoqiang What I think you meant is you have to do this:

df['Column'] = df['Column'].str.decode('ascii') # or utf-8 etc.

Simply doing astype(str) doesn't help--the to_csv() output still contains b'...' wrappers.

@goodboy
Copy link

goodboy commented Jul 26, 2016

I totally agree with @jzwinck.
How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

When you use pd.read_csv() and an Array-protocol type strings dtype round tripping gets messed up:

import pandas as pd
fname = './blah.csv'
pd.Series([b'x',b'y']).to_csv(fname)
>>> pd.read_csv(fname, dtype='S5')
     0     b'x'
0  b'1'  b"b'y'"

Using dtype=str or dtype='S' does works as expected however?

>>> pd.read_csv(fname, dtype='S')
   0  b'x'
0  1  b'y'

I actually even find ^ unexpected since it seems to be interpreting as python string literals automatically?

If a user chooses to load CSV data as bytes it should be specified explicitly just like it works when you write out unicode and not inferred from python's encoding specific markup:

>>> pd.Series(['x', 'y']).to_csv(fname)
>>> pd.read_csv(fname)
   0  x
0  1  y
>>> >>> pd.read_csv(fname, dtype='S10')
   0  b'x'
0  1  b'y'

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 27, 2016

How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

I think everyone agrees that writing out the b prefixes is a bug :) My question is whether we should either

  1. attempt to decode all the bytes to text in to_csv before writing, using the provided encoding
  2. Raise an error, directing the user to perform the decoding before attempting to_csv

@goodboy
Copy link

goodboy commented Jul 27, 2016

@TomAugspurger My vote's for 1.

Since the encoding kwarg determines the file's encoding any mismatching text-like data should be apropriately encoded before writing.

I'm getting worried though (especially being new to py3) because apparently even print does this?

>>> print(b'doggy')
b'doggy'

So maybe @dsm054 was right?

@jzwinck
Copy link
Contributor

jzwinck commented Aug 1, 2016

@TomAugspurger: I prefer your number 1: just decode, because that's what most users would want.

@tgoodlet: It doesn't matter what print does. Print is sort of a hybrid between being "pretty" and showing you what you'd need to reconstruct the variable. CSV writing is somewhat orthogonal.

@sidhant007
Copy link

sidhant007 commented May 29, 2020

Proposal to fix this issue:

We introduce a new parameter passed to .to_csv namely bytes_encoding which decides the encoding scheme used to decode the bytes (This gives the user the flexibility to write to a file opened with one encoding but the bytes to be decoded are of a different encoding. Example. If you want to write to path in UTF-16 but the data has ASCII bytes)

We use the encoding argument provided to .to_csv to decode the bytes.

Do note that after the decoding of the bytes happens using the bytes_encoding scheme, it WILL be transcoded to the encoding of the path/file object eventually before being written to the file. If this transcoding results in an error, we should report that.

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020
Add a new optional parameter named bytes_encoding to allow a specific
encoding scheme to be used to decode the bytes.
sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020
Add a new optional parameter named bytes_encoding to allow a specific
encoding scheme to be used to decode the bytes.
sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020
Add a new optional parameter named bytes_encoding to allow a specific
encoding scheme to be used to decode the bytes.
sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020
Add a new optional parameter named bytes_encoding to allow a specific
encoding scheme to be used to decode the bytes.
sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 29, 2020
sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
10 participants