Skip to content

Ignore EOF character in read_csv #7340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wcbeard opened this issue Jun 4, 2014 · 11 comments
Closed

Ignore EOF character in read_csv #7340

wcbeard opened this issue Jun 4, 2014 · 11 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@wcbeard
Copy link
Contributor

wcbeard commented Jun 4, 2014

From a stackoverflow question, I'm working on a Mac, trying to read a csv generated on Windows that ends with a '\x1a' EOF character, and pd.read_csv creates a new row at the end, ['\x1a', NaN, NaN, ...]

In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = StringIO('a,b\r\n1,2\r\n3,4\r\n\x1a')
In [4]: df = pd.read_csv(s)
In [5]: df
Out[5]:
   a   b
0  1   2
1  3   4
2   NaN

Now I'm manually checking for that character and a bunch of NaN's in the last row, and dropping it. Would it be worth adding an option to not create this last row (or ignore the EOF automatically)? In python I'm currently doing

def strip_eof(df):
    "Drop last row if it begins with '\x1a' and ends with NaN's"
    lastrow = df.iloc[-1]
    if lastrow[0] == '\x1a' and lastrow[1:].isnull().all():
        return df.drop([lastrow.name], axis=0)
    return df
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

this looks very similar to #5500 yes?

@jreback jreback changed the title Ignore EOF character in read_csv Ignore EOF character in read_csv Jun 4, 2014
@jreback jreback added the CSV label Jun 4, 2014
@wcbeard
Copy link
Contributor Author

wcbeard commented Jun 4, 2014

Ah, didn't see that one. Yes looks similar, though seems the other one is referring to EOFs that are in the wrong place, rather than at end of file. I can't tell whether the solution to that issue would fix this problem as well though.

@jreback jreback added the Bug label Jun 4, 2014
@jreback jreback added this to the 0.15.0 milestone Jun 4, 2014
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

ok will mark it as a bug. I wonder if the EOF is the SAME EOF as used in 'pandas/src/parser/tokenizer.c'

this might be a windows thing. How did you generate the file?

@hayd
Copy link
Contributor

hayd commented Jun 4, 2014

I've seen this too (not on Windows).

@wcbeard
Copy link
Contributor Author

wcbeard commented Jun 4, 2014

I assume someone generated it from a Windows machine (I get it from a samba share), but other than that couldn't say. I just tried reading the same file from my Windows VM, and it also added the bogus row at the end. The character might be from some old Windows standard...?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

hmm....maybe give a shot at searching old issues; seems to recall this somewhere....

otherwise would certainly appreciate a PR for this, sounds like a bug to me

@wcbeard
Copy link
Contributor Author

wcbeard commented Jun 4, 2014

I don't know C, so I was trying to fix it in python after the parser does the work (in pandas.io.parsers.TextFileReader.read). I'm realizing after some failed tests, however, that the bug messes up the dtype information as well. For example, adding "\x1a" to the end of

data = """A,B,C
1,2,a
4,5,b
"""

changes the dtypes of the series' from [int, int, object] to [object, float, object] because of the string in the first column and Nan's after that.

I'm not familiar with any way to reclaim the original dtypes after dropping the bad row-- from my limited understanding it seems the proper way to do it would be at the C level in the parser.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

you. can convert_objects() to reinfer the dtypes

yah this needs to be fixed at the c level

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jcull
Copy link

jcull commented Mar 27, 2015

One way to get around this is filtering the input and giving read_csv a StringIO object:

with open('csv.csv','rb') as myfile:
    myfh = cStringIO.StringIO(''.join(itertools.ifilter(lambda x: x[0] != '\x1a',myfile)))

df = pd.read_csv(myfh)

This will allow read_csv to properly interpret the data types of all the columns so you don't have to convert them later, which especially helpful if you have a converter or datetime columns. This will probably increase your memory usage with the extra StringIO object.

@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2016

The issue with "skipping" an EOF on the last row is that you would have to either look ahead to check if you were on the last row OR check the entire last row all over again once you finished reading. In the average case, that's going to hurt performance.

The workaround for this would be to set skipfooter:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\r\n1,2\r\n3,4\r\n\x1a'
>>> df = read_csv(StringIO(data), engine='python', skipfooter=1)
>>> df
   a  b
0  1  2
1  3  4
>>> df.dtypes
a    int64
b    int64
dtype: object

In light of this, and the fact that "fixing it" is probably more detrimental than beneficial, I would recommend that this issue be closed.

@gfyoung
Copy link
Member

gfyoung commented Nov 4, 2017

No activity on this issue for about a year, and I had recommended it be closed then (because the proper fix is just skipfooter support). Reading the conversation, I agree with my earlier self.

@gfyoung gfyoung closed this as completed Nov 4, 2017
@gfyoung gfyoung modified the milestones: Next Major Release, No action Nov 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants