-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Ignore EOF character in read_csv #7340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this looks very similar to #5500 yes? |
read_csv
Ah, didn't see that one. Yes looks similar, though seems the other one is referring to EOFs that are in the wrong place, rather than at end of file. I can't tell whether the solution to that issue would fix this problem as well though. |
ok will mark it as a bug. I wonder if the EOF is the SAME EOF as used in 'pandas/src/parser/tokenizer.c' this might be a windows thing. How did you generate the file? |
I've seen this too (not on Windows). |
I assume someone generated it from a Windows machine (I get it from a samba share), but other than that couldn't say. I just tried reading the same file from my Windows VM, and it also added the bogus row at the end. The character might be from some old Windows standard...? |
hmm....maybe give a shot at searching old issues; seems to recall this somewhere.... otherwise would certainly appreciate a PR for this, sounds like a bug to me |
I don't know C, so I was trying to fix it in python after the parser does the work (in
changes the dtypes of the series' from I'm not familiar with any way to reclaim the original dtypes after dropping the bad row-- from my limited understanding it seems the proper way to do it would be at the C level in the parser. |
you. can convert_objects() to reinfer the dtypes yah this needs to be fixed at the c level |
One way to get around this is filtering the input and giving read_csv a StringIO object:
This will allow read_csv to properly interpret the data types of all the columns so you don't have to convert them later, which especially helpful if you have a converter or datetime columns. This will probably increase your memory usage with the extra StringIO object. |
The issue with "skipping" an The workaround for this would be to set >>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\r\n1,2\r\n3,4\r\n\x1a'
>>> df = read_csv(StringIO(data), engine='python', skipfooter=1)
>>> df
a b
0 1 2
1 3 4
>>> df.dtypes
a int64
b int64
dtype: object In light of this, and the fact that "fixing it" is probably more detrimental than beneficial, I would recommend that this issue be closed. |
No activity on this issue for about a year, and I had recommended it be closed then (because the proper fix is just |
From a stackoverflow question, I'm working on a Mac, trying to read a csv generated on Windows that ends with a
'\x1a'
EOF character, andpd.read_csv
creates a new row at the end,['\x1a', NaN, NaN, ...]
Now I'm manually checking for that character and a bunch of NaN's in the last row, and dropping it. Would it be worth adding an option to not create this last row (or ignore the EOF automatically)? In python I'm currently doing
The text was updated successfully, but these errors were encountered: