Ignore EOF character in read_csv #7340

wcbeard · 2014-06-04T16:06:18Z

From a stackoverflow question, I'm working on a Mac, trying to read a csv generated on Windows that ends with a '\x1a' EOF character, and pd.read_csv creates a new row at the end, ['\x1a', NaN, NaN, ...]

In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = StringIO('a,b\r\n1,2\r\n3,4\r\n\x1a')
In [4]: df = pd.read_csv(s)
In [5]: df
Out[5]:
   a   b
0  1   2
1  3   4
2   NaN

Now I'm manually checking for that character and a bunch of NaN's in the last row, and dropping it. Would it be worth adding an option to not create this last row (or ignore the EOF automatically)? In python I'm currently doing

def strip_eof(df):
    "Drop last row if it begins with '\x1a' and ends with NaN's"
    lastrow = df.iloc[-1]
    if lastrow[0] == '\x1a' and lastrow[1:].isnull().all():
        return df.drop([lastrow.name], axis=0)
    return df

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-04T16:23:06Z

this looks very similar to #5500 yes?

wcbeard · 2014-06-04T16:32:13Z

Ah, didn't see that one. Yes looks similar, though seems the other one is referring to EOFs that are in the wrong place, rather than at end of file. I can't tell whether the solution to that issue would fix this problem as well though.

jreback · 2014-06-04T16:36:55Z

ok will mark it as a bug. I wonder if the EOF is the SAME EOF as used in 'pandas/src/parser/tokenizer.c'

this might be a windows thing. How did you generate the file?

hayd · 2014-06-04T16:48:10Z

I've seen this too (not on Windows).

wcbeard · 2014-06-04T16:49:50Z

I assume someone generated it from a Windows machine (I get it from a samba share), but other than that couldn't say. I just tried reading the same file from my Windows VM, and it also added the bogus row at the end. The character might be from some old Windows standard...?

jreback · 2014-06-04T16:51:13Z

hmm....maybe give a shot at searching old issues; seems to recall this somewhere....

otherwise would certainly appreciate a PR for this, sounds like a bug to me

wcbeard · 2014-06-04T21:02:54Z

I don't know C, so I was trying to fix it in python after the parser does the work (in pandas.io.parsers.TextFileReader.read). I'm realizing after some failed tests, however, that the bug messes up the dtype information as well. For example, adding "\x1a" to the end of

data = """A,B,C
1,2,a
4,5,b
"""

changes the dtypes of the series' from [int, int, object] to [object, float, object] because of the string in the first column and Nan's after that.

I'm not familiar with any way to reclaim the original dtypes after dropping the bad row-- from my limited understanding it seems the proper way to do it would be at the C level in the parser.

jreback · 2014-06-04T21:04:13Z

you. can convert_objects() to reinfer the dtypes

yah this needs to be fixed at the c level

jcull · 2015-03-27T22:24:06Z

One way to get around this is filtering the input and giving read_csv a StringIO object:

with open('csv.csv','rb') as myfile:
    myfh = cStringIO.StringIO(''.join(itertools.ifilter(lambda x: x[0] != '\x1a',myfile)))

df = pd.read_csv(myfh)

This will allow read_csv to properly interpret the data types of all the columns so you don't have to convert them later, which especially helpful if you have a converter or datetime columns. This will probably increase your memory usage with the extra StringIO object.

gfyoung · 2016-08-02T09:24:36Z

The issue with "skipping" an EOF on the last row is that you would have to either look ahead to check if you were on the last row OR check the entire last row all over again once you finished reading. In the average case, that's going to hurt performance.

The workaround for this would be to set skipfooter:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\r\n1,2\r\n3,4\r\n\x1a'
>>> df = read_csv(StringIO(data), engine='python', skipfooter=1)
>>> df
   a  b
0  1  2
1  3  4
>>> df.dtypes
a    int64
b    int64
dtype: object

In light of this, and the fact that "fixing it" is probably more detrimental than beneficial, I would recommend that this issue be closed.

gfyoung · 2017-11-04T17:27:13Z

No activity on this issue for about a year, and I had recommended it be closed then (because the proper fix is just skipfooter support). Reading the conversation, I agree with my earlier self.

jreback changed the title ~~Ignore EOF character in read_csv~~ Ignore EOF character in read_csv Jun 4, 2014

jreback added the CSV label Jun 4, 2014

jreback added the Bug label Jun 4, 2014

jreback added this to the 0.15.0 milestone Jun 4, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

gfyoung closed this as completed Nov 4, 2017

gfyoung modified the milestones: Next Major Release, No action Nov 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore EOF character in read_csv #7340

Ignore EOF character in read_csv #7340

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

hayd commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

jcull commented Mar 27, 2015

gfyoung commented Aug 2, 2016

gfyoung commented Nov 4, 2017 •

edited

Loading

Ignore EOF character in read_csv #7340

Ignore EOF character in read_csv #7340

Comments

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

hayd commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

wcbeard commented Jun 4, 2014

jreback commented Jun 4, 2014

jcull commented Mar 27, 2015

gfyoung commented Aug 2, 2016

gfyoung commented Nov 4, 2017 • edited Loading

gfyoung commented Nov 4, 2017 •

edited

Loading