to_csv and bytes on Python 3. #9712

jseabold · 2015-03-23T18:21:21Z

Is this desired behavior and something I need to work around or a bug? Notice the byte type marker is written to disk so you can't round-trip the data. This works fine in Python 2 with unicode AFAICT.

In [1]: pd.__version__
Out[1]: '0.15.2-252-g0d35dd4'

In [2]: pd.DataFrame.from_dict({'a': ['a', 'b', 'c']}).a.str.encode("utf-8").to_csv("tmp.csv")

In [3]: !cat tmp.csv
0,b'a'
1,b'b'
2,b'c'

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2015-03-23T19:25:43Z

I'd say this is not intended, but I haven't worked on this part of the code. It's being written to file anyway, so (python 3) bytes written to csv should be identical to (python 3) str.

dsm054 · 2015-03-23T19:38:26Z

FWIW I think that's actually the output I'd expect in 3.

TomAugspurger · 2015-03-23T19:47:18Z

I guess I would expect behavior similar to

with open('tmp.txt', 'wb') as f:
    f.write('abc'.encode('utf-8'))

which doesn't have the b prefix.

The caveat here is that you have to explicitly open the file in wb mode since you're writing bytes. That can't work for DataFrames (I don't think) since you could have a mix of bytes and strs across columns. Do we support wb mode in to_csv? I get an error when we try to open the file handle.

jreback · 2015-03-23T22:42:10Z

I think you just need to pass the encoding argument when writing it (otherwise it defaults to ascii on py2 and utf-8 on py3). This is from py2

In [2]: from pandas.compat import u

In [3]: df = DataFrame({u('c/\u03c3'): [1, 2, 3]})

In [4]: df
Out[4]: 
   c/?
0    1
1    2
2    3

In [5]: df.to_csv('tmp.csv',mode='w',encoding='UTF_8')

In [6]: !cat tmp.csv
,c/Ï
0,1
1,2
2,3

zhuoqiang · 2016-04-14T02:01:54Z

It'd better that padas have a configurable parameter in to_csv() so that people could control how to render bytes in csv file. Otherwise we have to manally convert bytes to string before io output

df['Column'] =df['Column'].astype(str)
df.to_csv('output.csv')

jzwinck · 2016-07-06T15:27:39Z

I have this problem also. Here's a trivial example that I think most regular users would expect to work differently:

>>> import pandas as pd
>>> import sys
>>> pd.Series([b'x',b'y']).to_csv(sys.stdout)
0,b'x'
1,b'y'

>>> pd.__version__
'0.18.1'

That is, the CSV is created with Python-specific b prefixes, which other programs don't know what to do with. CSV is not just a Python data interchange format, it's what a ton of people use to dump their data into other systems, and the above should "just work" the same as it does in Python 2:

0,x
1,y

jzwinck · 2016-07-06T16:12:17Z

@zhuoqiang What I think you meant is you have to do this:

df['Column'] = df['Column'].str.decode('ascii') # or utf-8 etc.

Simply doing astype(str) doesn't help--the to_csv() output still contains b'...' wrappers.

goodboy · 2016-07-26T19:56:15Z

I totally agree with @jzwinck.
How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

When you use pd.read_csv() and an Array-protocol type strings dtype round tripping gets messed up:

import pandas as pd
fname = './blah.csv'
pd.Series([b'x',b'y']).to_csv(fname)

>>> pd.read_csv(fname, dtype='S5')
     0     b'x'
0  b'1'  b"b'y'"

Using dtype=str or dtype='S' does works as expected however?

>>> pd.read_csv(fname, dtype='S')
   0  b'x'
0  1  b'y'

I actually even find ^ unexpected since it seems to be interpreting as python string literals automatically?

If a user chooses to load CSV data as bytes it should be specified explicitly just like it works when you write out unicode and not inferred from python's encoding specific markup:

>>> pd.Series(['x', 'y']).to_csv(fname)
>>> pd.read_csv(fname)
   0  x
0  1  y
>>> >>> pd.read_csv(fname, dtype='S10')
   0  b'x'
0  1  b'y'

TomAugspurger · 2016-07-27T13:54:18Z

How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

I think everyone agrees that writing out the b prefixes is a bug :) My question is whether we should either

attempt to decode all the bytes to text in to_csv before writing, using the provided encoding
Raise an error, directing the user to perform the decoding before attempting to_csv

goodboy · 2016-07-27T14:34:13Z

@TomAugspurger My vote's for 1.

Since the encoding kwarg determines the file's encoding any mismatching text-like data should be apropriately encoded before writing.

I'm getting worried though (especially being new to py3) because apparently even print does this?

>>> print(b'doggy')
b'doggy'

So maybe @dsm054 was right?

jzwinck · 2016-08-01T21:32:04Z

@TomAugspurger: I prefer your number 1: just decode, because that's what most users would want.

@tgoodlet: It doesn't matter what print does. Print is sort of a hybrid between being "pretty" and showing you what you'd need to reconstruct the variable. CSV writing is somewhat orthogonal.

)

sidhant007 · 2020-05-29T07:12:51Z

Proposal to fix this issue:

We introduce a new parameter passed to .to_csv namely bytes_encoding which decides the encoding scheme used to decode the bytes (This gives the user the flexibility to write to a file opened with one encoding but the bytes to be decoded are of a different encoding. Example. If you want to write to path in UTF-16 but the data has ASCII bytes)

We use the encoding argument provided to .to_csv to decode the bytes.

Do note that after the decoding of the bytes happens using the bytes_encoding scheme, it WILL be transcoded to the encoding of the path/file object eventually before being written to the file. If this transcoding results in an error, we should report that.

Add a new optional parameter named bytes_encoding to allow a specific encoding scheme to be used to decode the bytes.

jreback added Usage Question Unicode Unicode strings IO CSV read_csv, to_csv labels Mar 23, 2015

jreback mentioned this issue May 3, 2016

Python 3 writing to_csv file ignores encoding argument. #13068

Closed

jzwinck mentioned this issue Jul 6, 2016

BUG: interpret 'c' PEP3118/struct type as 'S1'. numpy/numpy#7803

Merged

goodboy mentioned this issue Jul 25, 2016

Py3.5 sangoma/switchy#44

Merged

jzwinck mentioned this issue Aug 3, 2016

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

Closed

4 tasks

jzwinck added a commit to jzwinck/pandas that referenced this issue Aug 3, 2016

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (pandas-dev#9712

d165a4b

)

goodboy mentioned this issue Aug 29, 2016

pandas/CSV storage breaks on Py3.5 sangoma/switchy#53

Open

jreback mentioned this issue Oct 10, 2017

BUG: Fix default encoding for CSVFormatter.save #17821

Merged

4 tasks

bnaul mentioned this issue Sep 13, 2018

Allow writing to S3 paths #8508

Closed

jbrockmendel added the 2/3 Compat label Oct 16, 2019

Liam3851 mentioned this issue Dec 10, 2019

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

Closed

mroeschke removed the 2/3 Compat label Apr 4, 2020

mroeschke added Bug and removed Usage Question labels Apr 4, 2020

sidhant007 mentioned this issue May 25, 2020

bpo-40762: Fix writing bytes in csv python/cpython#20371

Closed

sidhant007 mentioned this issue May 29, 2020

savetxt writes bytes with the b-prefixed notation in Python3 numpy/numpy#16429

Open

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020

BUG: Fix b' prefix for bytes in to_csv() (pandas-dev#9712)

2d3d035

Add a new optional parameter named bytes_encoding to allow a specific encoding scheme to be used to decode the bytes.

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020

BUG: Fix b' prefix for bytes in to_csv() (pandas-dev#9712)

2c47cf7

Add a new optional parameter named bytes_encoding to allow a specific encoding scheme to be used to decode the bytes.

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020

BUG: Avoids b' prefix for bytes in to_csv() (pandas-dev#9712)

0b97788

Add a new optional parameter named bytes_encoding to allow a specific encoding scheme to be used to decode the bytes.

sidhant007 mentioned this issue Jun 26, 2020

BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004

Closed

5 tasks

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 26, 2020

BUG: Avoids b' prefix for bytes in to_csv() (pandas-dev#9712)

3869dc7

Add a new optional parameter named bytes_encoding to allow a specific encoding scheme to be used to decode the bytes.

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 29, 2020

BUG: Avoids b' prefix for bytes in to_csv() (pandas-dev#9712)

17cfd73

sidhant007 added a commit to sidhant007/pandas that referenced this issue Jun 29, 2020

BUG: Avoids b' prefix for bytes in to_csv() (pandas-dev#9712)

385991b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv and bytes on Python 3. #9712

to_csv and bytes on Python 3. #9712

jseabold commented Mar 23, 2015

TomAugspurger commented Mar 23, 2015

dsm054 commented Mar 23, 2015

TomAugspurger commented Mar 23, 2015

jreback commented Mar 23, 2015

zhuoqiang commented Apr 14, 2016

jzwinck commented Jul 6, 2016

jzwinck commented Jul 6, 2016

goodboy commented Jul 26, 2016

TomAugspurger commented Jul 27, 2016 •

edited

Loading

goodboy commented Jul 27, 2016

jzwinck commented Aug 1, 2016

sidhant007 commented May 29, 2020 •

edited

Loading

to_csv and bytes on Python 3. #9712

to_csv and bytes on Python 3. #9712

Comments

jseabold commented Mar 23, 2015

TomAugspurger commented Mar 23, 2015

dsm054 commented Mar 23, 2015

TomAugspurger commented Mar 23, 2015

jreback commented Mar 23, 2015

zhuoqiang commented Apr 14, 2016

jzwinck commented Jul 6, 2016

jzwinck commented Jul 6, 2016

goodboy commented Jul 26, 2016

TomAugspurger commented Jul 27, 2016 • edited Loading

goodboy commented Jul 27, 2016

jzwinck commented Aug 1, 2016

sidhant007 commented May 29, 2020 • edited Loading

TomAugspurger commented Jul 27, 2016 •

edited

Loading

sidhant007 commented May 29, 2020 •

edited

Loading