to_csv does not always handle line_terminator correctly #17365

kevinsa5 · 2017-08-28T22:34:35Z

Code Sample, a copy-pastable example if possible

def hex_print(content):
    print ' '.join(['{0:02x}'.format(ord(i)) for i in content])
    print ' '.join(['{:>2}'.format(repr(i).replace("'", '')) for i in content])
    print ' '

import pandas as pd
import tempfile

filename = tempfile.NamedTemporaryFile(delete = False).name

df = pd.DataFrame({'x':[1]})
for sep in ['\n', '\r', '\r\n', 'F']:
    print 'with separator: {} ~~~~~~~~~~~~~~~~~~~~~~~~'.format(repr(sep))
    df.to_csv(filename, line_terminator = sep)        
    with open(filename, 'rb') as f:
        content = f.read()
    print 'file method:'
    hex_print(content)
    print 'string method:'
    hex_print(df.to_csv(line_terminator = sep))

Problem description

It seems that the to_csv does not always handle the line_terminator argument correctly. The above code prints out the hexified CSV data produced from several different calls to to_csv. In particular, passing \n in fact produces \r\n, and \r\n becomes \r\r\n. Note also that this only happens when writing to a file, not directly returning the CSV data as a string.

However, this seems to be OS-dependent as well -- I have reproduced it on several machines running Windows 7, Python 2.7, and various versions of pandas (including 0.20.1), but on a linux VM, it works as expected.

Output of above code:

with separator: '\n' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 0a 30 2c 31 0d 0a
 ,  x \r \n  0  ,  1 \r \n

string method:
2c 78 0a 30 2c 31 0a
 ,  x \n  0  ,  1 \n

with separator: '\r' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 30 2c 31 0d
 ,  x \r  0  ,  1 \r

string method:
2c 78 0d 30 2c 31 0d
 ,  x \r  0  ,  1 \r

with separator: '\r\n' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 0d 0a 30 2c 31 0d 0d 0a
 ,  x \r \r \n  0  ,  1 \r \r \n

string method:
2c 78 0d 0a 30 2c 31 0d 0a
 ,  x \r \n  0  ,  1 \r \n

with separator: 'F' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 46 30 2c 31 46
 ,  x  F  0  ,  1  F

string method:
2c 78 46 30 2c 31 46
 ,  x  F  0  ,  1  F

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.1
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.5
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.7
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-08-28T23:01:46Z

@kevinsa5 : Thanks for reporting this! The fact that you only have this issue on Windows I think is telling about the issue and could indicate that it is not a pandas issue. Here is a link that might potentially clear this up for you:

https://stackoverflow.com/questions/7013034/does-windows-carriage-return-r-n-consist-of-two-characters-or-one-character

If you can find a version of pandas where this doesn't occur, then that's a regression, but as it stands, I don't believe it is a problem on our end.

kevinsa5 · 2017-08-28T23:21:45Z

@gfyoung : Thanks for the quick reply! I am familiar with how Windows uses \r\n as its newline character combo, while other OSs use only \n. However, I do not believe that is the issue here.

If you take a close look at the script output I posted above, for some values of line_terminator, to_csv will produce different output depending on if you're sending to a file or not. From this alone, I believe it to be a bug in pandas, regardless of OS linesep convention -- I would not expect the destination of the CSV data to affect the data itself.

I should mention: I am immensely grateful for all the work the pandas devs have put into this library. Thank you very much for your continued effort.

kevinsa5 · 2017-08-28T23:45:06Z

@gfyoung After further reading, I think you may be correct. Is it true that on Windows, if you write the character '\n' to a file, the OS may actually insert '\r\n' into the file? Does this depend on how the file is opened?

Maybe I've been spoilt by Linux, but it seems unacceptable to me for an OS to silently change bytes that you send to disk.

gfyoung · 2017-08-28T23:56:16Z

Does this depend on how the file is opened?

Not entirely sure to be honest. However, as someone who has worked on a Windows computer from a Linux-based repository, I can say for certain that I have seen these carriage returns sneak into diffs just merely from cloning the repository.

Maybe I've been spoilt by Linux, but it seems unacceptable to me for an OS to silently change bytes that you send to disk.

As somebody who has worked in both the Linux and Windows world, you are perfectly entitled to bytes not being "corrupted" like this. There's a reason why Linux is generally preferred for developers 😄

chris-b1 · 2017-08-29T00:22:45Z

https://stackoverflow.com/questions/3191528/csv-in-python-adding-an-extra-carriage-return

On python2 looks like we should be opening the file in mode 'b' when passing to csv, PR to fix welcome

kevinsa5 · 2017-08-29T00:33:34Z

Indeed changing the above code to df.to_csv(filename, line_terminator = sep, mode='wb') does solve the issue on my end. Whether a PR should be as simple as changing the default value or not, I'm not qualified to say:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L1448

Thank you both. Pandas is a fantastic piece of software.

jreback · 2017-08-29T01:57:33Z

actually if u look the other highly starred answer in the SO post might work better

e.g. passing line_terminator to the csv.writer itself

kevinsa5 · 2017-08-29T02:01:09Z

Note that the argument to to_csv does get passed through to the csv.writer:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1586

jreback · 2017-08-29T12:40:22Z

@kevinsa5 if you want to submit a PR with the change (for opening in binary mode on windows in binary when line_terminator is specified), and see if it passes tests.

ingmars · 2018-02-15T08:52:49Z

I everyone, I'm experiencing the same problem on Pandas 0.20.3 on Windows 7. However, mode='wb' might be a dangerous fix, as then it crashes with an encoding setting such as encoding='utf-8' saying: "ValueError: binary mode doesn't take an encoding argument".

It would be nice if there was a workaround making both line_terminator and encoding work at the same time.

redx177 · 2018-04-18T15:55:08Z

I am experiencing the same issue.

With line_terminator='\r\n', mode='w', I am getting the line endings \r\r\n

When I am using line_terminator='\r\n', mode='wb', I am getting the error:

File "C:[...]\site-packages\pandas\io\common.py", line 332, in _get_handle
f = open(path, mode, errors='replace')
ValueError: binary mode doesn't take an errors argument

And the same as @ingmars when I try to set an encoding with line_terminator='\r\n', mode='wb', encoding='utf-8'.

I settled at the end with not specifying any line_terminator. Not happy, but I will fix this with other tools after the file has been written.

Everything with Win10, Python 3.5, Pandas 0.19.2

gfyoung · 2018-06-03T22:23:55Z

Per #20353 (comment) from @jreback , let's move the discussion to #20353.

gfyoung added IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string labels Aug 28, 2017

chris-b1 added 2/3 Compat labels Aug 29, 2017

chris-b1 added this to the Next Major Release milestone Aug 29, 2017

jreback changed the title ~~to_csv does not always handle line_terminator correctly~~ to_csv does not always handle line_terminator correctly Aug 29, 2017

jreback added the Windows Windows OS label Aug 29, 2017

deflatSOCO mentioned this issue May 27, 2018

DataFrame.to_csv not using correct line terminator value #20353

Closed

gfyoung closed this as completed Jun 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv does not always handle line_terminator correctly #17365

to_csv does not always handle line_terminator correctly #17365

kevinsa5 commented Aug 28, 2017

INSTALLED VERSIONS

gfyoung commented Aug 28, 2017 •

edited

Loading

kevinsa5 commented Aug 28, 2017

kevinsa5 commented Aug 28, 2017

gfyoung commented Aug 28, 2017

chris-b1 commented Aug 29, 2017

kevinsa5 commented Aug 29, 2017

jreback commented Aug 29, 2017

kevinsa5 commented Aug 29, 2017

jreback commented Aug 29, 2017

ingmars commented Feb 15, 2018

redx177 commented Apr 18, 2018 •

edited

Loading

gfyoung commented Jun 3, 2018

to_csv does not always handle line_terminator correctly #17365

to_csv does not always handle line_terminator correctly #17365

Comments

kevinsa5 commented Aug 28, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Aug 28, 2017 • edited Loading

kevinsa5 commented Aug 28, 2017

kevinsa5 commented Aug 28, 2017

gfyoung commented Aug 28, 2017

chris-b1 commented Aug 29, 2017

kevinsa5 commented Aug 29, 2017

jreback commented Aug 29, 2017

kevinsa5 commented Aug 29, 2017

jreback commented Aug 29, 2017

ingmars commented Feb 15, 2018

redx177 commented Apr 18, 2018 • edited Loading

gfyoung commented Jun 3, 2018

Output of `pd.show_versions()`

gfyoung commented Aug 28, 2017 •

edited

Loading

redx177 commented Apr 18, 2018 •

edited

Loading