newline defaults for read_csv and write_csv are not consistent #10018

mheilman · 2015-04-29T16:32:10Z

It's possible to write out a CSV file using the default settings (i.e., line_terminator='\n' in DataFrame.to_csv) that can't be read back in using the default settings (i.e., lineterminator=None).

This problem arises when there's a text column that doesn't get quotes around it (e.g., if it has no commas) but has a carriage return. read_csv by default thinks that either '\n' or '\r' can be line terminators, so extra rows appear. See example below.

In [12]: pd.DataFrame({"text": ["foo\rbar"]}).to_csv("foo.csv", index=False)

In [13]: pd.read_csv("foo.csv")
Out[13]: 
  text
0  foo
1  bar

It might be best to have the default line terminator for read_csv be '\n'. That might reduce usability (e.g., it's nice to be able to load CSVs from UNIX or Windows using the defaults), so maybe it'd be good just to add something to the documentation. Or, maybe a warning could be printed out or exception raised if a CSV file appears to use multiple types of line endings, which shouldn't happen.

(IIRC this can also cause the default parsing engine to make python segfault in some cases.)

The text was updated successfully, but these errors were encountered:

jreback · 2015-04-29T22:19:35Z

this seems like a very particular special case to me. Normally having \r INSIDE of a string is very odd to say the least.

mheilman · 2015-04-29T22:33:59Z

Yeah, it's definitely a rare issue, but I thought it worth documenting in case somebody else runs into it or thinks of a good solution that doesn't impair usability for more typical cases.

jreback · 2015-04-29T22:36:05Z

ok, you can add a small note in io.rst / csv section that show the issues w.r.t. '\r' (try to be as general as possible though).

corr723 · 2015-05-21T17:23:16Z

Wouldn't it be wise for to_csv to emit quotes whenever the data contains either \r or \n? By default Python 3's open will convert \r to \n ("universal newline"), so omitting quotes around \r seems a bit dangerous.

ameasure · 2015-06-18T17:16:51Z

I agree, this points to a bug in to_csv. to_csv should quote special characters that occur within fields if to conform with the CSV standard. From RFC4180 documenting the CSV standard (https://tools.ietf.org/html/rfc4180):

   6.  Fields containing line breaks (CRLF), double quotes, and commas*
       should be enclosed in double-quotes.  For example:*

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

corr724 · 2015-11-12T07:37:01Z

#10911 reminds me of this. Not sure why this is tagged Docs, as it's a genuine bug in my opinion.

amytildazhang · 2017-02-13T22:12:48Z

I ran into this bug today--using python3, to_csv did not put quotes around string with \r in it.

mattayes · 2017-10-28T23:44:30Z

I ran into this issue today as well. While I was able to resolve the issue by explicitly setting the lineterminator in read_csv() to \n, that's a bit cumbersome.

I agree with others that this is almost certainly a bug in to_csv(). Pulling a bit more from the CSV standard @ameasure shared (emphasis added):

The ABNF grammar appears as follows:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

If I'm interpreting it correctly, carriage returns should definitely be escaped.

weizhongg · 2017-11-03T21:35:22Z

I ran into this bug today as well. We were analyzing online reviews, and many reviews span multiple lines but do not use commas. I support the amendment of to_csv.

jreback · 2017-11-03T23:24:08Z

well a pull request from the community would be a way to resolve this.

asishm · 2018-10-08T17:02:26Z

the csv library defaults do quote the carriage returns with quoting=0 (or csv.QUOTE_MINIMAL

The issue (as pointed above) is because the to_csv method sets the line_terminator to \n by default [1]. Instead can it not be set as None in both the to_csv method and the underlying CSVFormatter object [2] as defaults and let the csv module set the line_terminator value?

A final change would be to not initialize the writer with line_terminator=None. A possible filter when creating the writer_kwargs dict? [3]

I'm not sure how this would work in Unix environments where the default line terminator is \n (as I don't have access to one to test)

hl6 · 2021-08-18T03:57:08Z

quoting=csv.QUOTE_MINIMAL doesn't work.

pd.to_csv (v1.2.1) failed to put quotes around string values ending with a \r.

It had to be csv.QUOTE_ALL.

mattf · 2023-01-14T00:36:47Z

i ran into this issue thanks to https://twitter.com/ComputerBookNew/status/1156186489660665856 containing multiple carriage returns

i'm on a linux system w/ pandas 1.5.2

Python 3.9.15 (main, Nov 17 2022, 00:00:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.5.2'

In [3]: df = pd.DataFrame({'a': [1, 2, 3], 'b':['one', 'two\rowt', 'three']})

In [4]: df
Out[4]: 
   a         b
0  1       one
1  2  two\rowt
2  3     three

In [5]: df.to_csv('busted.csv', index=False)

In [6]: pd.read_csv('busted.csv')
Out[6]: 
     a      b
0    1    one
1    2    two
2  owt    NaN
3    3  three

In [7]: pd.read_csv('busted.csv', lineterminator='\n')
Out[7]: 
   a         b
0  1       one
1  2  two\rowt
2  3     three

In [8]: !hexdump -C busted.csv
00000000  61 2c 62 0a 31 2c 6f 6e  65 0a 32 2c 74 77 6f 0d  |a,b.1,one.2,two.|
00000010  6f 77 74 0a 33 2c 74 68  72 65 65 0a              |owt.3,three.|
0000001c

In [9]: df.to_csv('busted.csv', index=False, quoting=1)

In [10]: !hexdump -C busted.csv
00000000  22 61 22 2c 22 62 22 0a  22 31 22 2c 22 6f 6e 65  |"a","b"."1","one|
00000010  22 0a 22 32 22 2c 22 74  77 6f 0d 6f 77 74 22 0a  |"."2","two.owt".|
00000020  22 33 22 2c 22 74 68 72  65 65 22 0a              |"3","three".|
0000002c

In [11]: pd.read_csv('busted.csv')
Out[11]: 
   a         b
0  1       one
1  2  two\rowt
2  3     three

my solution was to switch to parquet and ask myself why i thought csv was ok in the first place

jreback added the IO CSV read_csv, to_csv label Apr 29, 2015

jreback added the Docs label Apr 29, 2015

rosnfeld mentioned this issue May 11, 2015

BUG: categorical doesn't handle display.width of None in Python 3 (GH10087) #10108

Merged

TomAugspurger mentioned this issue Sep 12, 2018

DF to CSV splits row if one of the data points ends in \r (Windows) #22678

Closed

jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Dec 18, 2019

asishm mentioned this issue Sep 2, 2020

QST: CSV: export, import and new lines #36062

Closed

2 tasks

asishm mentioned this issue Oct 9, 2020

to_csv and read_csv return inconsistent result when value contains \r #27737

Closed

noklam mentioned this issue Nov 12, 2020

BUG: csv newline \n \r\n in Window with opened file #37757

Closed

3 tasks

habaneraa mentioned this issue Nov 22, 2023

Function string ends with CR character wagner-group/diversevul#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

newline defaults for read_csv and write_csv are not consistent #10018

newline defaults for read_csv and write_csv are not consistent #10018

mheilman commented Apr 29, 2015

jreback commented Apr 29, 2015

mheilman commented Apr 29, 2015

jreback commented Apr 29, 2015

corr723 commented May 21, 2015

ameasure commented Jun 18, 2015

corr724 commented Nov 12, 2015

amytildazhang commented Feb 13, 2017

mattayes commented Oct 28, 2017 •

edited

Loading

weizhongg commented Nov 3, 2017

jreback commented Nov 3, 2017

asishm commented Oct 8, 2018 •

edited

Loading

hl6 commented Aug 18, 2021

mattf commented Jan 14, 2023 •

edited

Loading

newline defaults for read_csv and write_csv are not consistent #10018

newline defaults for read_csv and write_csv are not consistent #10018

Comments

mheilman commented Apr 29, 2015

jreback commented Apr 29, 2015

mheilman commented Apr 29, 2015

jreback commented Apr 29, 2015

corr723 commented May 21, 2015

ameasure commented Jun 18, 2015

corr724 commented Nov 12, 2015

amytildazhang commented Feb 13, 2017

mattayes commented Oct 28, 2017 • edited Loading

weizhongg commented Nov 3, 2017

jreback commented Nov 3, 2017

asishm commented Oct 8, 2018 • edited Loading

hl6 commented Aug 18, 2021

mattf commented Jan 14, 2023 • edited Loading

mattayes commented Oct 28, 2017 •

edited

Loading

asishm commented Oct 8, 2018 •

edited

Loading

mattf commented Jan 14, 2023 •

edited

Loading