df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

eode · 2018-11-22T02:44:59Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

# !! NOTE
# This example uses `io.BytesIO`, however this also applies to file buffers that are
# returned by `io.open` (the `open` function) when opened in binary mode.
buf = io.BytesIO('a, b, 🐟\n1, 2, 3\n4, 5, 6'.encode('utf-8'))
df = pd.read_csv(buf)   # reads in fine using default encoding (utf-8)
buf = io.BytesIO()

df.to_csv(buf, encoding='utf-8')  # this should work, but doesn't.
# TypeError: a bytes-like object is required, not 'str'

buf = io.StringIO()
df.to_csv(buf, encoding='utf-8')  # this 'works', but should fail.  Data is passed in without encoding.
buf.getvalue()
# ',a, b, 🐟\n0,1,2,3\n1,4,5,6\n'

Problem description

Currently, the 'encoding' parameter is accepted and doesn't do anything when dealing with an in-memory object. This is deceptive, and can introduce encoding flaws.

I presume that pandas just sets the encoding on the file it opens. In the case of receiving an already-open filelike object, pandas should encode the string and attempt to write the bytes into the file. If it fails, that's a valid and appropriate failure, and that failure should be raised.

However, in the interest of backwards compatibility, if it fails, it should probably try to write the unencoded string into the file, and perhaps display a warning.

I'm on Pandas 0.23.4.
https://pandas-docs.github.io/pandas-docs-travis/

Expected Output

# a buffer that contains the following:
b',a, b, \xf0\x9f\x90\x9f\n0,1,2,3\n1,4,5,6\n'

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.3-041903-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-11-22T04:39:24Z

The documentation is very clear on this:

"A string representing the encoding to use in the OUTPUT FILE, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3."

We never mention support for buffers in general, so I disagree that this is deceptive. That being said, an attempt to enhance support of encoding for non-file objects would be welcomed.

eode · 2018-11-25T20:40:49Z

From the documentation you linked:

path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a string.

Thus, a file object should suffice. This issue is an issue with handling of filelike objects, not an issue specifically with BytesIO. The same behavior occurs when using (for example) a file object. The bug is that Pandas expects the file object itself to handle the encoding, and no encoding is actually used by Pandas, even though the documentation indicates path_or_buf and says file path or object. In the case of a file object (whether that be io.FileIO or io.BytesIO, or perhaps an io.BufferedWriter which you get on open(f...) in many cases), Pandas simply does no encoding.

By "deceptive" I don't mean "pandas is trying to deceive us", I mean "the documentation and docstrings state something that isn't valid, and at the very least, isn't clear."

However, my bug report was similarly unclear. I'll fix it now by updating the title (and description if necessary).

eode · 2018-11-25T20:46:40Z

There. It now reflects the fact that this occurs with any filelike object that handles bytes.

gfyoung · 2018-11-25T21:06:45Z

@eode : That's fair. I think as a start, we can clarify the documentation regarding this detail.

eode · 2018-11-29T17:02:36Z

Great! While I think a code change that can handle buffers/file objects that are open in 'bytes' or 'binary' mode would be ideal, writing into them using the given or default encoding, even a documentation change that indicates that buffers in 'bytes' mode aren't accepted would at least be clear.

I checked out your code internally -- I think the simplest thing would be to do something like this:

class _WriteEncodingWrapper:
    def __init__(self, bytes_filelike, encoding='utf-8'):
        self.bytes_filelike = bytes_filelike
        self.encoding = encoding
     def __getattr__(self, item):
        return getattr(self.bytes_filelike, item)
     def write(self, string):
        self.bytes_filelike.write(string.encode(self.encoding))
     def writelines(self, lines):
        from codecs import iterencode
        encoded_lines = iterencode(lines)
        self.bytes_filelike.writelines(encoded_lines)

..and then, if the attempt fails with the TypeError("a bytes-like object is required, not 'str'"), then use the _WriteEncodingWrapper. ..but, just because that's the simplest thing to do in the short term doesn't make it the simplest thing to do in the long term, or the 'right' thing to do. It would, however, work -- and be compatible with existing behaviors.

I haven't tried this on Python2, there may be some slight differences there. This is just a thought in case the issue will be fixed in code.

gfyoung · 2018-11-29T19:13:11Z

This is just a thought in case the issue will be fixed in code.

Fixing in code is generally the way we do things. 😉

You are more than welcome to submit a PR with your changes!

eode · 2018-12-05T00:57:53Z

Well, another way is to say "foo is just not an accepted use case", which is.. ..y'know. ..kinda a fix. :-)

gfyoung · 2018-12-05T01:02:25Z

foo is just not an accepted use case

Agreed. That could be a first step by updating the docs to reflect that.

That being said, a fix to actual enhance to_csv with the functionality would be a good long-term fix.

elembie · 2019-10-21T10:04:32Z

Hey guys - do you know if there was ever action taken on this? @eode did you get a work around? I'm facing this issue when trying to stream the output from pandas to azure blob store, which requires a byte type stream, not text.

Liam3851 · 2019-12-10T19:05:56Z

It looks like this is the same issue as #9712 and #13068, though I think the treatment here is simpler. Our firm just stumbled on this due to the python 2 EOL. Should note that the behavior with buffers worked as expected under Python 2 so I don't believe "buffers are not an accepted use case" is really correct.

gfyoung · 2019-12-10T19:19:44Z

Should note that the behavior with buffers worked as expected under Python 2 so I don't believe "buffers are not an accepted use case" is really correct.

If it's not documented, then we are not necessarily required to support it.

Technicality aside, that does not mean I don't believe we should support it. This would be a good thing to support, and it is still open to contributions!

Enteee · 2020-05-13T19:39:44Z

Hi folks, I wrote an article on my blog on how to Support Binary File Objects with pandas.DataFrame.to_csv. At the end of the article I added a monkey patch I think can also be used as a work around for this problem. Hope this helps until this is resolved in pandas.

gfyoung added IO Data IO issues that don't fit into a more specific label Enhancement labels Nov 22, 2018

eode changed the title ~~df.to_csv() ignores encoding when given a buffer object~~ df.to_csv() ignores encoding when given a file object or any other filelike object. Nov 25, 2018

gfyoung added the Docs label Nov 25, 2018

jbrockmendel added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels Dec 1, 2019

jbrockmendel removed the Enhancement label Dec 18, 2019

sidhant007 mentioned this issue May 29, 2020

to_csv and bytes on Python 3. #9712

Open

twoertwein mentioned this issue Jul 8, 2020

support binary file handles in to_csv #35129

Merged

5 tasks

jreback added this to the 1.2 milestone Aug 3, 2020

jreback closed this as completed in #35129 Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

eode commented Nov 22, 2018 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Nov 22, 2018 •

edited

Loading

eode commented Nov 25, 2018

eode commented Nov 25, 2018

gfyoung commented Nov 25, 2018

eode commented Nov 29, 2018

gfyoung commented Nov 29, 2018

eode commented Dec 5, 2018

gfyoung commented Dec 5, 2018

elembie commented Oct 21, 2019

Liam3851 commented Dec 10, 2019

gfyoung commented Dec 10, 2019 •

edited

Loading

Enteee commented May 13, 2020

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

Comments

eode commented Nov 22, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Nov 22, 2018 • edited Loading

eode commented Nov 25, 2018

eode commented Nov 25, 2018

gfyoung commented Nov 25, 2018

eode commented Nov 29, 2018

gfyoung commented Nov 29, 2018

eode commented Dec 5, 2018

gfyoung commented Dec 5, 2018

elembie commented Oct 21, 2019

Liam3851 commented Dec 10, 2019

gfyoung commented Dec 10, 2019 • edited Loading

Enteee commented May 13, 2020

eode commented Nov 22, 2018 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Nov 22, 2018 •

edited

Loading

gfyoung commented Dec 10, 2019 •

edited

Loading