-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.to_csv() ignores encoding when given a file object or any other filelike object. #23854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The documentation is very clear on this: "A string representing the encoding to use in the OUTPUT FILE, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3." We never mention support for buffers in general, so I disagree that this is deceptive. That being said, an attempt to enhance support of |
From the documentation you linked:
Thus, a file object should suffice. This issue is an issue with handling of filelike objects, not an issue specifically with By "deceptive" I don't mean "pandas is trying to deceive us", I mean "the documentation and docstrings state something that isn't valid, and at the very least, isn't clear." However, my bug report was similarly unclear. I'll fix it now by updating the title (and description if necessary). |
There. It now reflects the fact that this occurs with any filelike object that handles bytes. |
@eode : That's fair. I think as a start, we can clarify the documentation regarding this detail. |
Great! While I think a code change that can handle buffers/file objects that are open in 'bytes' or 'binary' mode would be ideal, writing into them using the given or default encoding, even a documentation change that indicates that buffers in 'bytes' mode aren't accepted would at least be clear. I checked out your code internally -- I think the simplest thing would be to do something like this: class _WriteEncodingWrapper:
def __init__(self, bytes_filelike, encoding='utf-8'):
self.bytes_filelike = bytes_filelike
self.encoding = encoding
def __getattr__(self, item):
return getattr(self.bytes_filelike, item)
def write(self, string):
self.bytes_filelike.write(string.encode(self.encoding))
def writelines(self, lines):
from codecs import iterencode
encoded_lines = iterencode(lines)
self.bytes_filelike.writelines(encoded_lines) ..and then, if the attempt fails with the I haven't tried this on Python2, there may be some slight differences there. This is just a thought in case the issue will be fixed in code. |
Fixing in code is generally the way we do things. 😉 You are more than welcome to submit a PR with your changes! |
Well, another way is to say "foo is just not an accepted use case", which is.. ..y'know. ..kinda a fix. :-) |
Agreed. That could be a first step by updating the docs to reflect that. That being said, a fix to actual enhance |
Hey guys - do you know if there was ever action taken on this? @eode did you get a work around? I'm facing this issue when trying to stream the output from pandas to azure blob store, which requires a byte type stream, not text. |
It looks like this is the same issue as #9712 and #13068, though I think the treatment here is simpler. Our firm just stumbled on this due to the python 2 EOL. Should note that the behavior with buffers worked as expected under Python 2 so I don't believe "buffers are not an accepted use case" is really correct. |
If it's not documented, then we are not necessarily required to support it. Technicality aside, that does not mean I don't believe we should support it. This would be a good thing to support, and it is still open to contributions! |
Hi folks, I wrote an article on my blog on how to Support Binary File Objects with |
Code Sample, a copy-pastable example if possible
Problem description
Currently, the 'encoding' parameter is accepted and doesn't do anything when dealing with an in-memory object. This is deceptive, and can introduce encoding flaws.
I presume that pandas just sets the encoding on the file it opens. In the case of receiving an already-open filelike object, pandas should encode the string and attempt to write the bytes into the file. If it fails, that's a valid and appropriate failure, and that failure should be raised.
However, in the interest of backwards compatibility, if it fails, it should probably try to write the unencoded string into the file, and perhaps display a warning.
I'm on Pandas 0.23.4.
https://pandas-docs.github.io/pandas-docs-travis/
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.3-041903-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 4.0.0
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: