read_csv treats \x00 as EOL instead of null value #14012

spillz · 2016-08-16T17:35:10Z

Not sure if this is a bug, but it took me a long time to figure out what was going on in a much bigger datafile than the sample one below.

Code Sample, a copy-pastable example if possible

import pandas
import StringIO

data='''var1,var2,var3
1,2,0
2,\x00,0
3,4,0
4,5,0
'''

print pandas.read_csv(StringIO.StringIO(data))

Expected Output

A table with 4 rows instead of 5, or an error.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2016-08-17T02:56:44Z

I don't see an error on master:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data="""var1,var2,var3
1,2,0
2,\x00,0
3,4,0
4,5,0
"""
>>> df = read_csv(StringIO(data))
>>> df
   var1  var2  var3
0   1.0   2.0   0.0
1   2.0   NaN   NaN
2   NaN   0.0   NaN
3   3.0   4.0   0.0
4   4.0   5.0   0.0

spillz · 2016-08-17T03:07:05Z

It should be:

var1 var2 var3
0 1.0 2.0 0.0
1 2.0 NaN 0.0
2 3.0 4.0 0.0
3 4.0 5.0 0.0

On Aug 16, 2016 10:57 PM, "gfyoung" [email protected] wrote:

I don't see an error on master:

from pandas import read_csv>>> from pandas.compat import StringIO>>> data="""var1,var2,var31,2,02,\x00,03,4,04,5,0""">>> df = read_csv(StringIO(data))>>> df
var1 var2 var30 1.0 2.0 0.01 2.0 NaN NaN2 NaN 0.0 NaN3 3.0 4.0 0.04 4.0 5.0 0.0

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#14012 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFd5VMOaYZqpvfkorqtIybWXjB8-Kjhks5qgnh9gaJpZM4JlqYQ
.

gfyoung · 2016-08-17T03:12:00Z

@spillz : Sorry, I was meaning to write more to clarify my comment. In the meantime, could you add that (the expected output) to your original issue?

Fixes bug in C parser in which the NULL character ('\x00') was being interpreted as a true line terminator, escape character, or comment character because it was used to indicate that a user had not specified these values. As a result, if the data contains this value, it was being incorrectly parsed. It should be parsed as NULL. Closes pandas-devgh-14012.

gfyoung · 2016-08-17T04:15:52Z

@spillz , @jreback : Actually, my PR speaks for itself here in terms of "expanding" on my comment above. In short, this is a bug.

gfyoung mentioned this issue Aug 17, 2016

BUG: Parse NULL char as null value #14019

Closed

jreback added Bug IO CSV read_csv, to_csv labels Aug 17, 2016

jreback added this to the 0.19.0 milestone Aug 17, 2016

jreback closed this as completed in cb43b6c Aug 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv treats \x00 as EOL instead of null value #14012

read_csv treats \x00 as EOL instead of null value #14012

spillz commented Aug 16, 2016

gfyoung commented Aug 17, 2016 •

edited

Loading

spillz commented Aug 17, 2016

gfyoung commented Aug 17, 2016 •

edited

Loading

gfyoung commented Aug 17, 2016 •

edited

Loading

read_csv treats \x00 as EOL instead of null value #14012

read_csv treats \x00 as EOL instead of null value #14012

Comments

spillz commented Aug 16, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Aug 17, 2016 • edited Loading

spillz commented Aug 17, 2016

gfyoung commented Aug 17, 2016 • edited Loading

gfyoung commented Aug 17, 2016 • edited Loading

output of `pd.show_versions()`

gfyoung commented Aug 17, 2016 •

edited

Loading

gfyoung commented Aug 17, 2016 •

edited

Loading

gfyoung commented Aug 17, 2016 •

edited

Loading