Skip to content

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eode opened this issue Nov 22, 2018 · 12 comments · Fixed by #35129
Closed

df.to_csv() ignores encoding when given a file object or any other filelike object. #23854

eode opened this issue Nov 22, 2018 · 12 comments · Fixed by #35129
Labels
Docs IO CSV read_csv, to_csv
Milestone

Comments

@eode
Copy link

eode commented Nov 22, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

# !! NOTE
# This example uses `io.BytesIO`, however this also applies to file buffers that are
# returned by `io.open` (the `open` function) when opened in binary mode.
buf = io.BytesIO('a, b, 🐟\n1, 2, 3\n4, 5, 6'.encode('utf-8'))
df = pd.read_csv(buf)   # reads in fine using default encoding (utf-8)
buf = io.BytesIO()

df.to_csv(buf, encoding='utf-8')  # this should work, but doesn't.
# TypeError: a bytes-like object is required, not 'str'

buf = io.StringIO()
df.to_csv(buf, encoding='utf-8')  # this 'works', but should fail.  Data is passed in without encoding.
buf.getvalue()
# ',a, b, 🐟\n0,1,2,3\n1,4,5,6\n'

Problem description

Currently, the 'encoding' parameter is accepted and doesn't do anything when dealing with an in-memory object. This is deceptive, and can introduce encoding flaws.

I presume that pandas just sets the encoding on the file it opens. In the case of receiving an already-open filelike object, pandas should encode the string and attempt to write the bytes into the file. If it fails, that's a valid and appropriate failure, and that failure should be raised.

However, in the interest of backwards compatibility, if it fails, it should probably try to write the unencoded string into the file, and perhaps display a warning.

I'm on Pandas 0.23.4.
https://pandas-docs.github.io/pandas-docs-travis/

Expected Output

# a buffer that contains the following:
b',a, b, \xf0\x9f\x90\x9f\n0,1,2,3\n1,4,5,6\n'

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.3-041903-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label Enhancement labels Nov 22, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 22, 2018

The documentation is very clear on this:

"A string representing the encoding to use in the OUTPUT FILE, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3."

We never mention support for buffers in general, so I disagree that this is deceptive. That being said, an attempt to enhance support of encoding for non-file objects would be welcomed.

@eode
Copy link
Author

eode commented Nov 25, 2018

From the documentation you linked:

path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a string.

Thus, a file object should suffice. This issue is an issue with handling of filelike objects, not an issue specifically with BytesIO. The same behavior occurs when using (for example) a file object. The bug is that Pandas expects the file object itself to handle the encoding, and no encoding is actually used by Pandas, even though the documentation indicates path_or_buf and says file path or object. In the case of a file object (whether that be io.FileIO or io.BytesIO, or perhaps an io.BufferedWriter which you get on open(f...) in many cases), Pandas simply does no encoding.

By "deceptive" I don't mean "pandas is trying to deceive us", I mean "the documentation and docstrings state something that isn't valid, and at the very least, isn't clear."

However, my bug report was similarly unclear. I'll fix it now by updating the title (and description if necessary).

@eode eode changed the title df.to_csv() ignores encoding when given a buffer object df.to_csv() ignores encoding when given a file object or any other filelike object. Nov 25, 2018
@eode
Copy link
Author

eode commented Nov 25, 2018

There. It now reflects the fact that this occurs with any filelike object that handles bytes.

@gfyoung gfyoung added the Docs label Nov 25, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 25, 2018

@eode : That's fair. I think as a start, we can clarify the documentation regarding this detail.

@eode
Copy link
Author

eode commented Nov 29, 2018

Great! While I think a code change that can handle buffers/file objects that are open in 'bytes' or 'binary' mode would be ideal, writing into them using the given or default encoding, even a documentation change that indicates that buffers in 'bytes' mode aren't accepted would at least be clear.

I checked out your code internally -- I think the simplest thing would be to do something like this:

class _WriteEncodingWrapper:
    def __init__(self, bytes_filelike, encoding='utf-8'):
        self.bytes_filelike = bytes_filelike
        self.encoding = encoding
     def __getattr__(self, item):
        return getattr(self.bytes_filelike, item)
     def write(self, string):
        self.bytes_filelike.write(string.encode(self.encoding))
     def writelines(self, lines):
        from codecs import iterencode
        encoded_lines = iterencode(lines)
        self.bytes_filelike.writelines(encoded_lines)

..and then, if the attempt fails with the TypeError("a bytes-like object is required, not 'str'"), then use the _WriteEncodingWrapper. ..but, just because that's the simplest thing to do in the short term doesn't make it the simplest thing to do in the long term, or the 'right' thing to do. It would, however, work -- and be compatible with existing behaviors.

I haven't tried this on Python2, there may be some slight differences there. This is just a thought in case the issue will be fixed in code.

@gfyoung
Copy link
Member

gfyoung commented Nov 29, 2018

This is just a thought in case the issue will be fixed in code.

Fixing in code is generally the way we do things. 😉

You are more than welcome to submit a PR with your changes!

@eode
Copy link
Author

eode commented Dec 5, 2018

Well, another way is to say "foo is just not an accepted use case", which is.. ..y'know. ..kinda a fix. :-)

@gfyoung
Copy link
Member

gfyoung commented Dec 5, 2018

foo is just not an accepted use case

Agreed. That could be a first step by updating the docs to reflect that.

That being said, a fix to actual enhance to_csv with the functionality would be a good long-term fix.

@elembie
Copy link

elembie commented Oct 21, 2019

Hey guys - do you know if there was ever action taken on this? @eode did you get a work around? I'm facing this issue when trying to stream the output from pandas to azure blob store, which requires a byte type stream, not text.

@jbrockmendel jbrockmendel added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels Dec 1, 2019
@Liam3851
Copy link
Contributor

It looks like this is the same issue as #9712 and #13068, though I think the treatment here is simpler. Our firm just stumbled on this due to the python 2 EOL. Should note that the behavior with buffers worked as expected under Python 2 so I don't believe "buffers are not an accepted use case" is really correct.

@gfyoung
Copy link
Member

gfyoung commented Dec 10, 2019

Should note that the behavior with buffers worked as expected under Python 2 so I don't believe "buffers are not an accepted use case" is really correct.

If it's not documented, then we are not necessarily required to support it.

Technicality aside, that does not mean I don't believe we should support it. This would be a good thing to support, and it is still open to contributions!

@Enteee
Copy link

Enteee commented May 13, 2020

Hi folks, I wrote an article on my blog on how to Support Binary File Objects with pandas.DataFrame.to_csv. At the end of the article I added a monkey patch I think can also be used as a work around for this problem. Hope this helps until this is resolved in pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants