Skip to content

to_csv failing with encoding='utf-16' #21118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lgonzalezsa opened this issue May 18, 2018 · 4 comments · Fixed by #21300
Closed

to_csv failing with encoding='utf-16' #21118

lgonzalezsa opened this issue May 18, 2018 · 4 comments · Fixed by #21300
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@lgonzalezsa
Copy link

Code Sample:

df.to_csv('test.gz', sep='~',  header=False, index=False,compression='gzip',line_terminator='\r\n',encoding='utf-16', na_rep='')

/opt/anaconda/lib/python3.6/encodings/ascii.py in decode(self, input, final)
24 class IncrementalDecoder(codecs.IncrementalDecoder):
25 def decode(self, input, final=False):
---> 26 return codecs.ascii_decode(input, self.errors)[0]
27
28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Problem description

In first place, big thank you for supporting pandas, my life is easier and fun with pandas in the toolkit.
In previous version 0.22 we were able to do to_csv with encoding='utf-16' to handle Japanese, Chinese among other content properly. Need the utf-16 encoding for next steps like upload data to MSSQL server in bulk mode.

I would like to know if I can use a workaround to continue have the support of uft-16.

Any other suggestions are welcome.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.114-42-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: POSIX
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented May 18, 2018

Can you please post a reproducible example? Tried locally with the below on master and it worked fine:

buf = io.BytesIO(b'\xff\x34')
df = pd.read_csv(buf, encoding='utf16')

outbuf = buf.StringIO()
df.to_csv(outbuf, encoding='utf-16')

@WillAyd WillAyd added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels May 18, 2018
@lgonzalezsa
Copy link
Author

Notice my code is failing not due to data content.
Here an example using sklearn datasets

from sklearn import datasets
iris = datasets.load_iris()

data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

data1.to_csv('test.gz'
         , sep='~'
         ,  header=False, index=False
         ,compression='gzip'
         ,line_terminator='\r\n'
         ,encoding='utf-16'
         , na_rep=''
        )

If I use compression parameter I got the UnicodeDecodeError; but without the parameter, runs properly.

@WillAyd WillAyd added the Regression Functionality that used to work in a prior pandas version label May 19, 2018
@WillAyd WillAyd added this to the 0.23.1 milestone May 19, 2018
@WillAyd WillAyd removed the Needs Info Clarification about behavior needed to assess issue label May 20, 2018
@minggli
Copy link
Contributor

minggli commented Jun 4, 2018

tested your iris dataset example, this problem should go away with this PR.

@scheung38
Copy link

How do we know the generated CSV is UTF-8 or UTF-16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants