Skip to content

read_csv treats \x00 as EOL instead of null value #14012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spillz opened this issue Aug 16, 2016 · 4 comments
Closed

read_csv treats \x00 as EOL instead of null value #14012

spillz opened this issue Aug 16, 2016 · 4 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@spillz
Copy link

spillz commented Aug 16, 2016

Not sure if this is a bug, but it took me a long time to figure out what was going on in a much bigger datafile than the sample one below.

Code Sample, a copy-pastable example if possible

import pandas
import StringIO

data='''var1,var2,var3
1,2,0
2,\x00,0
3,4,0
4,5,0
'''

print pandas.read_csv(StringIO.StringIO(data))

Expected Output

A table with 4 rows instead of 5, or an error.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Aug 17, 2016

I don't see an error on master:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data="""var1,var2,var3
1,2,0
2,\x00,0
3,4,0
4,5,0
"""
>>> df = read_csv(StringIO(data))
>>> df
   var1  var2  var3
0   1.0   2.0   0.0
1   2.0   NaN   NaN
2   NaN   0.0   NaN
3   3.0   4.0   0.0
4   4.0   5.0   0.0

@spillz
Copy link
Author

spillz commented Aug 17, 2016

It should be:

var1 var2 var3
0 1.0 2.0 0.0
1 2.0 NaN 0.0
2 3.0 4.0 0.0
3 4.0 5.0 0.0

On Aug 16, 2016 10:57 PM, "gfyoung" [email protected] wrote:

I don't see an error on master:

from pandas import read_csv>>> from pandas.compat import StringIO>>> data="""var1,var2,var31,2,02,\x00,03,4,04,5,0""">>> df = read_csv(StringIO(data))>>> df
var1 var2 var30 1.0 2.0 0.01 2.0 NaN NaN2 NaN 0.0 NaN3 3.0 4.0 0.04 4.0 5.0 0.0


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#14012 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFd5VMOaYZqpvfkorqtIybWXjB8-Kjhks5qgnh9gaJpZM4JlqYQ
.

@gfyoung
Copy link
Member

gfyoung commented Aug 17, 2016

@spillz : Sorry, I was meaning to write more to clarify my comment. In the meantime, could you add that (the expected output) to your original issue?

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 17, 2016
Fixes bug in C parser in which the NULL character
('\x00') was being interpreted as a true line terminator,
escape character, or comment character because it was used
to indicate that a user had not specified these values. As
a result, if the data contains this value, it was being
incorrectly parsed. It should be parsed as NULL.

Closes pandas-devgh-14012.
@gfyoung
Copy link
Member

gfyoung commented Aug 17, 2016

@spillz , @jreback : Actually, my PR speaks for itself here in terms of "expanding" on my comment above. In short, this is a bug.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Aug 17, 2016
@jreback jreback added this to the 0.19.0 milestone Aug 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants