-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_csv does not parse csv files with windows line terminator correctly #3501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related: #3453 |
Working fine for me:
Could you show us the exact binary contents of test1.csv? i.e.
|
not going to make it into 0.11.1. Will investigate pending failing test case |
pls reopen if you can create a test case |
@bmacauley FWIW this SO post is very helpful: http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return your data source was probably written with if you're in python 3 you need to open with this is a pretty sad state of affairs, but this isn't a pandas problem it's a problem with the |
I am not sure if the previous comment by @cpcloud was particularly helpful given that you did not specify how the file was written. I to have run into this problem recently when reading in a file that was not written in python but rather by an external program. I could go back and alter the file writing code, but I still have to be able to process the data that is written to older files that retain the EOL constant. I found that if you specify the Cheers, Matt |
I'm having this problem too. Looking at census data. Filetype is 'ASCII English text, with CRLF line terminators' tried the suggestions above, didn't work. from following Nick Coghlan's post here I was able to view something: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
file looks like this after that:
If I then try to read that filecontent object with pandas:
I get
tried following suggestions in the comments above, got 'UnsupportedOperation: not readable' can't put '\r\n' in as the lineterminator argument to pandas.read_csv, because 'ValueError: Only length-1 line terminators supported' ? I'm in python 3.4, pandas version 0.15.1. Would love to know what symbol should go here to get it to work, or if updating to a newer version would help? |
I'm seeing the same error as @szeitlin on Python 3.5, Win2012. Same cause: pd.read_csv(path, lineterminator = os.linesep) where os.linesep == '\r\n' |
@szeitlin Can you post a link to the file? |
I don't think the actual .csv file matters. It's just the handling of the lineterminator parameter. Here's a repro: PS C:\Users\Administrator> notepad foo.csv ...create simple .csv file in notepad... I am totally clueless about Windows. PS C:\Users\Administrator> ipython
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 4.0.3 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import os
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.17.1'
In [4]: m = pd.read_csv("foo.csv")
In [5]: m
Out[5]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: m = pd.read_csv("foo.csv", lineterminator=os.linesep)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-63603000fef9> in <module>()
----> 1 m = pd.read_csv("foo.csv", lineterminator=os.linesep)
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequo
te, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfo
oter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace
, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_de
fault_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, n
rows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blan
k_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
273
274 # Create the parser.
--> 275 parser = TextFileReader(filepath_or_buffer, **kwds)
276
277 if (nrows is not None) and (chunksize is not None):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
588 self.options['has_index_names'] = kwds['has_index_names']
589
--> 590 self._make_engine(self.engine)
591
592 def _get_options_with_defaults(self, engine):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
729 def _make_engine(self, engine='c'):
730 if engine == 'c':
--> 731 self._engine = CParserWrapper(self.f, **self.options)
732 else:
733 if engine == 'python':
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
1101 kwds['allow_leading_cols'] = self.index_col is not False
1102
-> 1103 self._reader = _parser.TextReader(src, **kwds)
1104
1105 # XXX
pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3407)()
ValueError: Only length-1 line terminators supported Looking at foo.csv in the hexedit tool, it does indeed have \r\n line endings, as you'd expect, but I don't think it ever looks at the file. For what it's worth, the csv module from the standard libraries has some confusing opinions about what the lineterminator parameter should mean and how it interacts with line endings in text mode files. Maybe, I'm just misunderstanding how this is supposed to work. |
sorry, I'm swamped with work right now, but if I can find the file and if I ever managed to get it working, I'll post both. I think I may have ended up solving the problem in a somewhat unintuitive way, but it was a while ago... |
Hello, change engine = 'python-fwf' |
Thanks for the pointer. This solved the same CR/LF issue for me, though with this engine (equally with The solution to the issue of @szeitlin and I was right in front of my eyes, as @lo-co indicated above.
This original issue is around 8 years old. But I still wanted to add this, as the previous comments here helped me track it down. |
It should at least be possible to set |
Greetings, while trying to use Pandas to read files csv with \r\n in AWS Lambda's running on linux os. We are encountering the [ERROR] ValueError: Only length-1 line terminators supported error. Currently a fallback to array.split("\r\n") is being utilized but making use of pandas for all csv operations would be ideal. Any ideas if this will be addressed in upcoming releases or the nature of an official workaround? |
Oddly, I face this when reading a CSV, but not when saving it! This prevented me from unifying the arguments of the parameters for all CSV operations. |
I am using pandas 0.11.in python 2.7.3
When I read csv data with a windows line terminator (\r\n), pandas creates extra rows in the dataframe ie it does not recognise the line terminator..i
test1.csv
Id,Description,Field1,Field2^M
1,"test data 1 ^M\n",2,3^M
2,"test data 2 ^M\n",3,4^M
In [80]: in_csv = 'test1.csv'
In [81]: cols = ['Id', 'Description', 'Field1', 'Field2']
In [82]: reader = pd.read_csv(in_csv, usecols=cols, dtype=object)
In [83]: print reader
Id Description Field1 Field2
0 \r NaN NaN NaN
1 1 test data 1 \r\n 2 3
2 \r NaN NaN NaN
3 2 test data 2 \r\n 3 4
4 \r NaN NaN NaN
In [86]: reader.to_csv(out_csv, index=False, encoding='utf-8')
Id,Description,Field1,Field2
^M,,,
1,test data 1 ^M\n,2,3
^M,,,
2,test data 2 ^M\n,3,4
^M,,,
I cannot use...
lineterminator : string (length 1), default None
reader = pd.read_csv(in_csv, usecols=cols, dtype=object, lineterminator='\r\n')
ValueError: Only length-1 line terminators supported
The text was updated successfully, but these errors were encountered: