Skip to content

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmacauley opened this issue May 1, 2013 · 16 comments
Closed
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@bmacauley
Copy link

I am using pandas 0.11.in python 2.7.3

When I read csv data with a windows line terminator (\r\n), pandas creates extra rows in the dataframe ie it does not recognise the line terminator..i

test1.csv

Id,Description,Field1,Field2^M
1,"test data 1 ^M\n",2,3^M
2,"test data 2 ^M\n",3,4^M

In [80]: in_csv = 'test1.csv'

In [81]: cols = ['Id', 'Description', 'Field1', 'Field2']

In [82]: reader = pd.read_csv(in_csv, usecols=cols, dtype=object)

In [83]: print reader
Id Description Field1 Field2
0 \r NaN NaN NaN
1 1 test data 1 \r\n 2 3
2 \r NaN NaN NaN
3 2 test data 2 \r\n 3 4
4 \r NaN NaN NaN

In [86]: reader.to_csv(out_csv, index=False, encoding='utf-8')

Id,Description,Field1,Field2
^M,,,
1,test data 1 ^M\n,2,3
^M,,,
2,test data 2 ^M\n,3,4
^M,,,

I cannot use...

lineterminator : string (length 1), default None

reader = pd.read_csv(in_csv, usecols=cols, dtype=object, lineterminator='\r\n')

ValueError: Only length-1 line terminators supported

@cpcloud
Copy link
Member

cpcloud commented May 1, 2013

Related: #3453

@wesm
Copy link
Member

wesm commented May 25, 2013

Working fine for me:

In [3]: read_csv(StringIO('Id,Description,Field1,Field2\r\n1,"test data 1 \r\n",2,3\r\n2,"test data 2 \r\n",3,4\r\n'))
Out[3]: 
   Id       Description  Field1  Field2
0   1  test data 1 \r\n       2       3
1   2  test data 2 \r\n       3       4

Could you show us the exact binary contents of test1.csv? i.e.

open('test1.csv', 'rb').read()

@wesm
Copy link
Member

wesm commented Jun 2, 2013

not going to make it into 0.11.1. Will investigate pending failing test case

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

pls reopen if you can create a test case

@jreback jreback closed this as completed Sep 20, 2013
@cpcloud
Copy link
Member

cpcloud commented Sep 18, 2014

@bmacauley FWIW this SO post is very helpful:

http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return

your data source was probably written with mode='w' (the default) and if you're in python 2 that will change \r\n to \r\r\n so you need write your data with mode='wb'.

if you're in python 3 you need to open with mode='w', newline=''.

this is a pretty sad state of affairs, but this isn't a pandas problem it's a problem with the csv module and python newline handling across versions

@lo-co
Copy link

lo-co commented Feb 20, 2015

I am not sure if the previous comment by @cpcloud was particularly helpful given that you did not specify how the file was written. I to have run into this problem recently when reading in a file that was not written in python but rather by an external program. I could go back and alter the file writing code, but I still have to be able to process the data that is written to older files that retain the EOL constant.

I found that if you specify the lineterminator='\n', then you should get the results you are looking for. Given the easy solution, I believe that this is likely not something that requires attention in pandas. Hope this helps anyone who is running into this problem.

Cheers, Matt

@szeitlin
Copy link

I'm having this problem too. Looking at census data. Filetype is 'ASCII English text, with CRLF line terminators'

tried the suggestions above, didn't work.

from following Nick Coghlan's post here I was able to view something: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

filecontent = open('Gaz_counties_national.txt', encoding='ASCII', errors="surrogateescape").read()

file looks like this after that:

"USPS\tGEOID\tANSICODE\tNAME\tPOP10\tHU10\tALAND\tAWATER\tALAND_SQMI\tAWATER_SQMI\tINTPTLAT\tINTPTLONG \nAL\t01001\t00161526\tAutauga County\t54571\t22135\t1539582278\t25775735\t 594.436\t 9.952\t 32.536382\t -86.644490 \nAL\t01003\t00161527\tBaldwin County\t182265\t104061\t4117521611\t1133190229\t 1589.784\t 437.527\t 30.659218\t -87.746067 \nAL\t01005\t00161528\tBarbour County\t27457\t11829\t2291818968\t50864716\t 884.876\t 19.639\t 31.870670\t -85.405456 \nAL\t01007\t00161529\tBibb County\t22915\t8981\t1612480789\t9289057\t 622.582\t 3.587\t 33.015893\t -87.127148 \nAL\t01009\t00161530\tBlount County\t57322\t23887\t1669961855\t15157440\t 644.776\t 5.852\t 33.977448\t -86.567246 \nAL\t01011\t00161531\tBullock County\t10914\t4493\t1613056905\t6056528\t 622.805\t 2.338\t 32.101759\t -85.717261 \nAL\t01013\t00161532\tButler County\t20947\t9964\t2011976894\t2726814\t 776.829\t 1.053\t 31.751667\t -86.681969 \nAL\t01015\t00161533\tCalhoun County\t118572\t53289\t1569189995\t16624267\t 605.868\t 6.419\t 33.771706\t -85.822513 \nAL\t01017\t00161534\tChambers County\t34215\t17004\t1545009282\t17048142\t 596.531\t 6.582\t 32.917943\t -85.391812 \nAL\t01019\t00161535\tCherokee County\t25989\t16267\t1434075952\t119858898\t 553.700\t 46.278\t 34.069515\t -85.654242 \nAL\t01021\t00161536\tChilton County\t43643\t19278\t1794483586\t20586383\t 692.854\t 7.948\t 32.854059\t -86.726627

If I then try to read that filecontent object with pandas:

pd.read_csv(filecontent, sep='\t', lineterminator = '\r\n')

I get

UnicodeEncodeError: 'utf-8' codec can't encode character: surrogates not allowed

tried following suggestions in the comments above, got 'UnsupportedOperation: not readable'

can't put '\r\n' in as the lineterminator argument to pandas.read_csv, because 'ValueError: Only length-1 line terminators supported' ?

I'm in python 3.4, pandas version 0.15.1. Would love to know what symbol should go here to get it to work, or if updating to a newer version would help?

@cbare
Copy link

cbare commented Feb 11, 2016

I'm seeing the same error as @szeitlin on Python 3.5, Win2012. Same cause:

pd.read_csv(path, lineterminator = os.linesep)

where os.linesep == '\r\n'

@cpcloud
Copy link
Member

cpcloud commented Feb 11, 2016

@szeitlin Can you post a link to the file?

@cbare
Copy link

cbare commented Feb 11, 2016

I don't think the actual .csv file matters. It's just the handling of the lineterminator parameter. Here's a repro:

PS C:\Users\Administrator> notepad foo.csv

...create simple .csv file in notepad... I am totally clueless about Windows.

PS C:\Users\Administrator> ipython
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.0.3 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import os

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.17.1'

In [4]: m = pd.read_csv("foo.csv")

In [5]: m
Out[5]:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

In [6]: m = pd.read_csv("foo.csv", lineterminator=os.linesep)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-63603000fef9> in <module>()
----> 1 m = pd.read_csv("foo.csv", lineterminator=os.linesep)

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequo
te, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfo
oter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace
, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_de
fault_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, n
rows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blan
k_lines)
    496                     skip_blank_lines=skip_blank_lines)
    497
--> 498         return _read(filepath_or_buffer, kwds)
    499
    500     parser_f.__name__ = name

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    273
    274     # Create the parser.
--> 275     parser = TextFileReader(filepath_or_buffer, **kwds)
    276
    277     if (nrows is not None) and (chunksize is not None):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    588             self.options['has_index_names'] = kwds['has_index_names']
    589
--> 590         self._make_engine(self.engine)
    591
    592     def _get_options_with_defaults(self, engine):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    729     def _make_engine(self, engine='c'):
    730         if engine == 'c':
--> 731             self._engine = CParserWrapper(self.f, **self.options)
    732         else:
    733             if engine == 'python':

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1101         kwds['allow_leading_cols'] = self.index_col is not False
   1102
-> 1103         self._reader = _parser.TextReader(src, **kwds)
   1104
   1105         # XXX

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3407)()

ValueError: Only length-1 line terminators supported

Looking at foo.csv in the hexedit tool, it does indeed have \r\n line endings, as you'd expect, but I don't think it ever looks at the file.

For what it's worth, the csv module from the standard libraries has some confusing opinions about what the lineterminator parameter should mean and how it interacts with line endings in text mode files. Maybe, I'm just misunderstanding how this is supposed to work.

@szeitlin
Copy link

sorry, I'm swamped with work right now, but if I can find the file and if I ever managed to get it working, I'll post both. I think I may have ended up solving the problem in a somewhat unintuitive way, but it was a while ago...

@yasink18
Copy link

yasink18 commented May 5, 2020

Hello,

change engine = 'python-fwf'
it will work

@DavidEnergie
Copy link

DavidEnergie commented Mar 10, 2021

Hello,

change engine = 'python-fwf'
it will work

Thanks for the pointer. This solved the same CR/LF issue for me, though with this engine (equally with 'read_fwf(file.txt)'), for long lines, they break into new columns after a certain length (for me its around 10 geocoordinate strings (x.x,y.y,z.z)).
I didn't manage to resolve this, but switched back to engine='c' and using lineterminator = '\n' instead of lineterminator = '\r\n', as the later switches the engine to 'python'. Leaving out the '\r' doesn't seem to have any negative effect in my example.

The solution to the issue of @szeitlin and I was right in front of my eyes, as @lo-co indicated above.

pd.read_csv(path_to_file_txt, sep='\t', lineterminator = '\n', engine='c', keep_default_na=False)

This original issue is around 8 years old. But I still wanted to add this, as the previous comments here helped me track it down.

@OrangeDog
Copy link

\r\n isn't even just the windows line ending, it is the RFC-4180 standard CSV line ending.

It should at least be possible to set lineterminator='\r\n', and it should really be the default too.

@rossbush
Copy link

Greetings, while trying to use Pandas to read files csv with \r\n in AWS Lambda's running on linux os. We are encountering the [ERROR] ValueError: Only length-1 line terminators supported error. Currently a fallback to array.split("\r\n") is being utilized but making use of pandas for all csv operations would be ideal. Any ideas if this will be addressed in upcoming releases or the nature of an official workaround?

@rtrad89
Copy link

rtrad89 commented Nov 24, 2022

Oddly, I face this when reading a CSV, but not when saving it! This prevented me from unifying the arguments of the parameters for all CSV operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.