BUG: read_csv does not parse csv files with windows line terminator correctly #3501

bmacauley · 2013-05-01T13:54:42Z

I am using pandas 0.11.in python 2.7.3

When I read csv data with a windows line terminator (\r\n), pandas creates extra rows in the dataframe ie it does not recognise the line terminator..i

test1.csv

Id,Description,Field1,Field2^M
1,"test data 1 ^M\n",2,3^M
2,"test data 2 ^M\n",3,4^M

In [80]: in_csv = 'test1.csv'

In [81]: cols = ['Id', 'Description', 'Field1', 'Field2']

In [82]: reader = pd.read_csv(in_csv, usecols=cols, dtype=object)

In [83]: print reader
Id Description Field1 Field2
0 \r NaN NaN NaN
1 1 test data 1 \r\n 2 3
2 \r NaN NaN NaN
3 2 test data 2 \r\n 3 4
4 \r NaN NaN NaN

In [86]: reader.to_csv(out_csv, index=False, encoding='utf-8')

Id,Description,Field1,Field2
^M,,,
1,test data 1 ^M\n,2,3
^M,,,
2,test data 2 ^M\n,3,4
^M,,,

I cannot use...

lineterminator : string (length 1), default None

reader = pd.read_csv(in_csv, usecols=cols, dtype=object, lineterminator='\r\n')

ValueError: Only length-1 line terminators supported

cpcloud · 2013-05-01T15:42:06Z

Related: #3453

wesm · 2013-05-25T23:40:31Z

Working fine for me:

In [3]: read_csv(StringIO('Id,Description,Field1,Field2\r\n1,"test data 1 \r\n",2,3\r\n2,"test data 2 \r\n",3,4\r\n'))
Out[3]: 
   Id       Description  Field1  Field2
0   1  test data 1 \r\n       2       3
1   2  test data 2 \r\n       3       4

Could you show us the exact binary contents of test1.csv? i.e.

open('test1.csv', 'rb').read()

wesm · 2013-06-02T20:37:39Z

not going to make it into 0.11.1. Will investigate pending failing test case

jreback · 2013-09-20T23:43:00Z

pls reopen if you can create a test case

cpcloud · 2014-09-18T04:43:25Z

@bmacauley FWIW this SO post is very helpful:

http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return

your data source was probably written with mode='w' (the default) and if you're in python 2 that will change \r\n to \r\r\n so you need write your data with mode='wb'.

if you're in python 3 you need to open with mode='w', newline=''.

this is a pretty sad state of affairs, but this isn't a pandas problem it's a problem with the csv module and python newline handling across versions

lo-co · 2015-02-20T05:15:01Z

I am not sure if the previous comment by @cpcloud was particularly helpful given that you did not specify how the file was written. I to have run into this problem recently when reading in a file that was not written in python but rather by an external program. I could go back and alter the file writing code, but I still have to be able to process the data that is written to older files that retain the EOL constant.

I found that if you specify the lineterminator='\n', then you should get the results you are looking for. Given the easy solution, I believe that this is likely not something that requires attention in pandas. Hope this helps anyone who is running into this problem.

Cheers, Matt

szeitlin · 2015-12-15T22:53:18Z

I'm having this problem too. Looking at census data. Filetype is 'ASCII English text, with CRLF line terminators'

tried the suggestions above, didn't work.

from following Nick Coghlan's post here I was able to view something: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

filecontent = open('Gaz_counties_national.txt', encoding='ASCII', errors="surrogateescape").read()

file looks like this after that:

"USPS\tGEOID\tANSICODE\tNAME\tPOP10\tHU10\tALAND\tAWATER\tALAND_SQMI\tAWATER_SQMI\tINTPTLAT\tINTPTLONG \nAL\t01001\t00161526\tAutauga County\t54571\t22135\t1539582278\t25775735\t 594.436\t 9.952\t 32.536382\t -86.644490 \nAL\t01003\t00161527\tBaldwin County\t182265\t104061\t4117521611\t1133190229\t 1589.784\t 437.527\t 30.659218\t -87.746067 \nAL\t01005\t00161528\tBarbour County\t27457\t11829\t2291818968\t50864716\t 884.876\t 19.639\t 31.870670\t -85.405456 \nAL\t01007\t00161529\tBibb County\t22915\t8981\t1612480789\t9289057\t 622.582\t 3.587\t 33.015893\t -87.127148 \nAL\t01009\t00161530\tBlount County\t57322\t23887\t1669961855\t15157440\t 644.776\t 5.852\t 33.977448\t -86.567246 \nAL\t01011\t00161531\tBullock County\t10914\t4493\t1613056905\t6056528\t 622.805\t 2.338\t 32.101759\t -85.717261 \nAL\t01013\t00161532\tButler County\t20947\t9964\t2011976894\t2726814\t 776.829\t 1.053\t 31.751667\t -86.681969 \nAL\t01015\t00161533\tCalhoun County\t118572\t53289\t1569189995\t16624267\t 605.868\t 6.419\t 33.771706\t -85.822513 \nAL\t01017\t00161534\tChambers County\t34215\t17004\t1545009282\t17048142\t 596.531\t 6.582\t 32.917943\t -85.391812 \nAL\t01019\t00161535\tCherokee County\t25989\t16267\t1434075952\t119858898\t 553.700\t 46.278\t 34.069515\t -85.654242 \nAL\t01021\t00161536\tChilton County\t43643\t19278\t1794483586\t20586383\t 692.854\t 7.948\t 32.854059\t -86.726627

If I then try to read that filecontent object with pandas:

pd.read_csv(filecontent, sep='\t', lineterminator = '\r\n')

I get

UnicodeEncodeError: 'utf-8' codec can't encode character: surrogates not allowed

tried following suggestions in the comments above, got 'UnsupportedOperation: not readable'

can't put '\r\n' in as the lineterminator argument to pandas.read_csv, because 'ValueError: Only length-1 line terminators supported' ?

I'm in python 3.4, pandas version 0.15.1. Would love to know what symbol should go here to get it to work, or if updating to a newer version would help?

cbare · 2016-02-11T01:25:11Z

I'm seeing the same error as @szeitlin on Python 3.5, Win2012. Same cause:

pd.read_csv(path, lineterminator = os.linesep)

where os.linesep == '\r\n'

cpcloud · 2016-02-11T17:51:07Z

@szeitlin Can you post a link to the file?

cbare · 2016-02-11T19:32:39Z

I don't think the actual .csv file matters. It's just the handling of the lineterminator parameter. Here's a repro:

PS C:\Users\Administrator> notepad foo.csv

...create simple .csv file in notepad... I am totally clueless about Windows.

PS C:\Users\Administrator> ipython
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.0.3 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import os

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.17.1'

In [4]: m = pd.read_csv("foo.csv")

In [5]: m
Out[5]:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

In [6]: m = pd.read_csv("foo.csv", lineterminator=os.linesep)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-63603000fef9> in <module>()
----> 1 m = pd.read_csv("foo.csv", lineterminator=os.linesep)

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequo
te, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfo
oter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace
, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_de
fault_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, n
rows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blan
k_lines)
    496                     skip_blank_lines=skip_blank_lines)
    497
--> 498         return _read(filepath_or_buffer, kwds)
    499
    500     parser_f.__name__ = name

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    273
    274     # Create the parser.
--> 275     parser = TextFileReader(filepath_or_buffer, **kwds)
    276
    277     if (nrows is not None) and (chunksize is not None):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    588             self.options['has_index_names'] = kwds['has_index_names']
    589
--> 590         self._make_engine(self.engine)
    591
    592     def _get_options_with_defaults(self, engine):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    729     def _make_engine(self, engine='c'):
    730         if engine == 'c':
--> 731             self._engine = CParserWrapper(self.f, **self.options)
    732         else:
    733             if engine == 'python':

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1101         kwds['allow_leading_cols'] = self.index_col is not False
   1102
-> 1103         self._reader = _parser.TextReader(src, **kwds)
   1104
   1105         # XXX

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3407)()

ValueError: Only length-1 line terminators supported

Looking at foo.csv in the hexedit tool, it does indeed have \r\n line endings, as you'd expect, but I don't think it ever looks at the file.

For what it's worth, the csv module from the standard libraries has some confusing opinions about what the lineterminator parameter should mean and how it interacts with line endings in text mode files. Maybe, I'm just misunderstanding how this is supposed to work.

szeitlin · 2016-02-11T19:36:56Z

sorry, I'm swamped with work right now, but if I can find the file and if I ever managed to get it working, I'll post both. I think I may have ended up solving the problem in a somewhat unintuitive way, but it was a while ago...

yasink18 · 2020-05-05T05:51:50Z

Hello,

change engine = 'python-fwf'
it will work

DavidEnergie · 2021-03-10T17:43:04Z

Hello,

change engine = 'python-fwf'
it will work

Thanks for the pointer. This solved the same CR/LF issue for me, though with this engine (equally with 'read_fwf(file.txt)'), for long lines, they break into new columns after a certain length (for me its around 10 geocoordinate strings (x.x,y.y,z.z)).
I didn't manage to resolve this, but switched back to engine='c' and using lineterminator = '\n' instead of lineterminator = '\r\n', as the later switches the engine to 'python'. Leaving out the '\r' doesn't seem to have any negative effect in my example.

The solution to the issue of @szeitlin and I was right in front of my eyes, as @lo-co indicated above.

pd.read_csv(path_to_file_txt, sep='\t', lineterminator = '\n', engine='c', keep_default_na=False)

This original issue is around 8 years old. But I still wanted to add this, as the previous comments here helped me track it down.

OrangeDog · 2022-01-19T13:08:36Z

\r\n isn't even just the windows line ending, it is the RFC-4180 standard CSV line ending.

It should at least be possible to set lineterminator='\r\n', and it should really be the default too.

rossbush · 2022-10-27T14:37:30Z

Greetings, while trying to use Pandas to read files csv with \r\n in AWS Lambda's running on linux os. We are encountering the [ERROR] ValueError: Only length-1 line terminators supported error. Currently a fallback to array.split("\r\n") is being utilized but making use of pandas for all csv operations would be ideal. Any ideas if this will be addressed in upcoming releases or the nature of an official workaround?

rtrad89 · 2022-11-24T12:42:49Z

Oddly, I face this when reading a CSV, but not when saving it! This prevented me from unifying the arguments of the parameters for all CSV operations.

cpcloud mentioned this issue May 1, 2013

to_csv does not quote fields with special characters #3503

Closed

jreback closed this as completed Sep 20, 2013

jwdeitch mentioned this issue Feb 8, 2018

remove 1 char line terminator limit #19601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

bmacauley commented May 1, 2013

cpcloud commented May 1, 2013

Uh oh!

wesm commented May 25, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!

jreback commented Sep 20, 2013

Uh oh!

cpcloud commented Sep 18, 2014

Uh oh!

lo-co commented Feb 20, 2015

Uh oh!

szeitlin commented Dec 15, 2015

Uh oh!

cbare commented Feb 11, 2016

Uh oh!

cpcloud commented Feb 11, 2016

Uh oh!

cbare commented Feb 11, 2016

Uh oh!

szeitlin commented Feb 11, 2016

Uh oh!

yasink18 commented May 5, 2020

Uh oh!

DavidEnergie commented Mar 10, 2021 •

edited

Loading

Uh oh!

OrangeDog commented Jan 19, 2022

Uh oh!

rossbush commented Oct 27, 2022

Uh oh!

rtrad89 commented Nov 24, 2022

Uh oh!

Uh oh!

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

Comments

bmacauley commented May 1, 2013

cpcloud commented May 1, 2013

Uh oh!

wesm commented May 25, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!

jreback commented Sep 20, 2013

Uh oh!

cpcloud commented Sep 18, 2014

Uh oh!

lo-co commented Feb 20, 2015

Uh oh!

szeitlin commented Dec 15, 2015

Uh oh!

cbare commented Feb 11, 2016

Uh oh!

cpcloud commented Feb 11, 2016

Uh oh!

cbare commented Feb 11, 2016

Uh oh!

szeitlin commented Feb 11, 2016

Uh oh!

yasink18 commented May 5, 2020

Uh oh!

DavidEnergie commented Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OrangeDog commented Jan 19, 2022

Uh oh!

rossbush commented Oct 27, 2022

Uh oh!

rtrad89 commented Nov 24, 2022

Uh oh!

DavidEnergie commented Mar 10, 2021 •

edited

Loading