data after null character dropped in `read_csv` #19886

smsaladi · 2018-02-24T20:02:38Z

Code Sample

In [1]: from pandas.compat import StringIO
   ...: from pandas import read_csv
   ...: data = 'x,y\ntest,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[1]:
      x    y
0  test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

In [2]: data = 'x,y\n\x00,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[2]:
     x    y
0  NaN  Reg
1  NaN  Reg
2    I  Swp
3    I  Swp

In [4]: data = 'x,y\ntest\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[4]:
      x    y
0  test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

Problem description

Within a single field data before a NUL character is kept, but everything after, until the delimiter, is dropped. This result seems counterintuitive to me. I'd either expect for the field to

always become NaN with the other data dropped or
expect to have a string with the data and NUL characters (or even dropped would probably be fine). NaN only If the field was exclusively NUL (see expected output section below)

Might be related to #2741 or the related fix.

Expected Output

# last case
In [X]: data = 'x,y\ntest\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[X]:
      x    y
0  test\00test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

# another case
In [X]: data = 'x,y\n\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[X]:
      x    y
0  \00test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

Output of `pd.show_versions()`

In [9]: import pandas as pd
...: pd.show_versions()
...:

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.1.2
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.5.0a1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2018-02-25T18:33:34Z

You're certainly welcome to look into this - as discussed in #2741 - might still be some parts of the parser that uses NUL to manage state. Though in practice may be easier to pre-process your data, stripping out nulls before passing to pandas.

WillAyd · 2018-03-02T18:38:19Z

A little out of my area but isn't this just how strings work in C? For instance, if I compile the below program

#include <stdio.h>

int main () {
  char myarr[8] = "foo\0test";
  printf("%s\n", myarr);
}

And execute only foo gets printed, skipping over all of the characters after the null

chris-b1 · 2018-03-02T18:52:08Z

Yes & no. You are correct that a null byte '\0' is used to mark the end of a "c string," which is really just a char array with that convention.

But I believe the parser is working with a sized buffer (i.e. knows that myarr is length 8 in your example) - so wouldn't necessarily need to stop on the null byte. That said, I'm not familiar with low level workings of the parser, could be wrong on this.

WillAyd · 2018-03-02T20:33:44Z

Hmm OK. Well from what I can tell with VERBOSITY set I think the tokenizer interprets this correctly. Here's a small excerpt from the last example provided, where I believe Iter 8 is actually pushing the NULL byte without triggering a field end

tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3

My guess is the issue is in the parser using PyUnicode_FromString instead of PyUnicode_FromStringAndSize and providing the length of the field, which should have been detected in some form or another by the tokenizer

pandas/pandas/_libs/parsers.pyx

Line 768 in a7a7f8c

name = PyUnicode_FromString(word)

Again haven't dug this deep into the C side of things before so could be way off, but figured I'd share in case it helps anyone else looking at it

FWIW here's the full verbose output of parsing the third example provided in the original post

_tokenize_helper: Asked to tokenize 2 rows, datapos=0, datalen=0
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 38, status=0
datalen: 38
_tokenize_helper: Trying to process 38 bytes, datalen=38, datapos= 0


make_stream_space: nbytes = 38.  grow_buffer(self->stream...)
safe_realloc: buffer = 0x7fbbff62c2e0, size = 64, result = 0x7fbbff63a090
safe_realloc: buffer = 0x7fbbff63a090, size = 128, result = 0x7fbbff63a090
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 0, self->stream_cap=128, status=0
safe_realloc: buffer = 0x7fbbff6279f0, size = 48, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63dec0, size = 384, result = 0x7fbbff63dec0
make_stream_space: grow_buffer(self->self->words, 0, 48, 38, 0)
make_stream_space: cap != self->words_cap, nbytes = 38, self->words_cap=48
safe_realloc: buffer = 0x7fbbff60a100, size = 384, result = 0x7fbbff63f5f0
safe_realloc: buffer = 0x7fbbff632b20, size = 48, result = 0x7fbbff62c2e0
safe_realloc: buffer = 0x7fbbff62c2e0, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63e040
safe_realloc: buffer = 0x7fbbff63e040, size = 384, result = 0x7fbbff63f770
make_stream_space: grow_buffer(self->line_start, 1, 48, 38, 0)
make_stream_space: cap != self->lines_cap, nbytes = 38
safe_realloc: buffer = 0x7fbbff632b40, size = 384, result = 0x7fbbff63f8f0
x,y
test
tokenize_bytes - Iter: 0 Char: 0x78 Line 1 field_count 0, state 0
PUSH_CHAR: Pushing x, slen= 0, stream_cap=128, stream_len=0
tokenize_bytes - Iter: 1 Char: 0x2c Line 1 field_count 0, state 3
push_char: self->stream[2] = 0, stream_cap=128
end_field: Char diff: 0
end_field: Saw word x at: 0. Total: 1
tokenize_bytes - Iter: 2 Char: 0x79 Line 1 field_count 1, state 1
PUSH_CHAR: Pushing y, slen= 2, stream_cap=128, stream_len=2
tokenize_bytes - Iter: 3 Char: 0xa Line 1 field_count 1, state 3
push_char: self->stream[4] = 0, stream_cap=128
end_field: Char diff: 2
end_field: Saw word y at: 2. Total: 2
end_line: Line end, nfields: 2
end_line: lines: 0
end_line: ex_fields: -1
end_line: new line start: 2
end_line: Finished line, at 1
tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3
tokenize_bytes - Iter: 14 Char: 0x52 Line 2 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 14, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 15 Char: 0x65 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 15, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 16 Char: 0x67 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 16, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 17 Char: 0xa Line 2 field_count 1, state 3
push_char: self->stream[18] = 0, stream_cap=128
end_field: Char diff: 14
end_field: Saw word Reg at: 14. Total: 4
end_line: Line end, nfields: 2
end_line: lines: 1
end_line: ex_fields: 2
end_line: new line start: 4
end_line: Finished line, at 2
_TOKEN_CLEANUP: datapos: 18, datalen: 38
leaving tokenize_helper
_tokenize_helper: Asked to tokenize 262143 rows, datapos=18, datalen=38
_tokenize_helper: Trying to process 20 bytes, datalen=38, datapos= 18


make_stream_space: nbytes = 20.  grow_buffer(self->stream...)
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 18, self->stream_cap=128, status=0
make_stream_space: grow_buffer(self->self->words, 4, 48, 20, 0)
make_stream_space: grow_buffer(self->line_start, 3, 48, 20, 0)

tokenize_bytes - Iter: 18 Char: 0x0 Line 3 field_count 0, state 0
PUSH_CHAR: Pushing , slen= 18, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 19 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 19, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 20 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 20, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 21 Char: 0x2c Line 3 field_count 0, state 3
push_char: self->stream[22] = 0, stream_cap=128
end_field: Char diff: 18
end_field: Saw word  at: 18. Total: 5
tokenize_bytes - Iter: 22 Char: 0x52 Line 3 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 22, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 23 Char: 0x65 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 23, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 24 Char: 0x67 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 24, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 25 Char: 0xa Line 3 field_count 1, state 3
push_char: self->stream[26] = 0, stream_cap=128
end_field: Char diff: 22
end_field: Saw word Reg at: 22. Total: 6
end_line: Line end, nfields: 2
end_line: lines: 2
end_line: ex_fields: 2
end_line: new line start: 6
end_line: Finished line, at 3
tokenize_bytes - Iter: 26 Char: 0x49 Line 4 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 26, stream_cap=128, stream_len=26
tokenize_bytes - Iter: 27 Char: 0x2c Line 4 field_count 0, state 3
push_char: self->stream[28] = 0, stream_cap=128
end_field: Char diff: 26
end_field: Saw word I at: 26. Total: 7
tokenize_bytes - Iter: 28 Char: 0x53 Line 4 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 28, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 29 Char: 0x77 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 29, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 30 Char: 0x70 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 30, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 31 Char: 0xa Line 4 field_count 1, state 3
push_char: self->stream[32] = 0, stream_cap=128
end_field: Char diff: 28
end_field: Saw word Swp at: 28. Total: 8
end_line: Line end, nfields: 2
end_line: lines: 3
end_line: ex_fields: 2
end_line: new line start: 8
end_line: Finished line, at 4
tokenize_bytes - Iter: 32 Char: 0x49 Line 5 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 32, stream_cap=128, stream_len=32
tokenize_bytes - Iter: 33 Char: 0x2c Line 5 field_count 0, state 3
push_char: self->stream[34] = 0, stream_cap=128
end_field: Char diff: 32
end_field: Saw word I at: 32. Total: 9
tokenize_bytes - Iter: 34 Char: 0x53 Line 5 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 34, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 35 Char: 0x77 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 35, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 36 Char: 0x70 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 36, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 37 Char: 0xa Line 5 field_count 1, state 3
push_char: self->stream[38] = 0, stream_cap=128
end_field: Char diff: 34
end_field: Saw word Swp at: 34. Total: 10
end_line: Line end, nfields: 2
end_line: lines: 4
end_line: ex_fields: 2
end_line: new line start: 10
end_line: Finished line, at 5
_TOKEN_CLEANUP: datapos: 38, datalen: 38
Finished tokenizing input
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 0, status=1
datalen: 0
handling eof, datalen: 0, pstate: 0
leaving tokenize_helper
parser_consume_rows: Deleting 8 words, 32 chars
parser_trim_buffers: new_cap < self->words_cap
safe_realloc: buffer = 0x7fbbff63dec0, size = 24, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63f5f0, size = 24, result = 0x7fbbff63f5f0
parser_trim_buffers: new_cap = 9, stream_cap = 128, lines_cap = 48
parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc
safe_realloc: buffer = 0x7fbbff63a090, size = 9, result = 0x7fbbff63a090
parser_trim_buffers: new_cap < self->lines_cap
safe_realloc: buffer = 0x7fbbff63f770, size = 16, result = 0x7fbbff63f770
safe_realloc: buffer = 0x7fbbff63f8f0, size = 16, result = 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x7fbbff63a090
free_if_not_null 0x7fbbff63dec0
free_if_not_null 0x7fbbff63f5f0
free_if_not_null 0x7fbbff63f770
free_if_not_null 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0

WillAyd · 2018-03-02T21:57:49Z

Trying to confirm my suspicion, I modified the linked block above to look as follows:

for i in range(field_count):
    word = self.parser.words[start + i]
    if start + i == self.parser.words_len:  # Handle last item
        word_len = self.parser.datalen - self.parser.word_starts[start + i] - 1
    else:
        word_len = self.parser.word_starts[start + i + 1]  - self.parser.word_starts[start + i] - 1

    if path == CSTRING:
        name = PyBytes_FromString(word)
    elif path == UTF8:
        name = PyUnicode_FromStringAndSize(word, word_len)
    elif path == ENCODED:
        name = PyUnicode_Decode(word, strlen(word),
                                self.c_encoding, errors)

I noticed the code was actually only in a block to parse the header, but if I injected null bytes into the header it would read the entire field

In [7]: data = '\x00x,\x00y\n\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
In [7]: df = pd.read_csv(StringIO(data), engine='c')
In [7]: df.columns[0]
Out[7]: '\x00x'
In [6]: df.columns[1]
Out[6]: '\x00y'

I'll dig a little further into the parsing of the body of data but pretty sure this could fix the issue. Will submit a PR if it comes so far

changhiskhan · 2018-07-14T06:12:48Z

The easier solution would be to just add a parsing option and have the tokenizer swallow the nul bytes.
To preserve all of the nul characters you'd to do a bit of surgery to COLITER_NEXT and also modify some of the khash code in the parser. That's like 10x the effort of the first option easily.
What's more important here, preserving the data after nulls OR preserving the nulls?

chris-b1 added IO CSV read_csv, to_csv Difficulty Advanced labels Feb 25, 2018

jbrockmendel removed the Difficulty Advanced label Oct 21, 2019

mroeschke added the Bug label Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data after null character dropped in `read_csv` #19886

data after null character dropped in `read_csv` #19886

smsaladi commented Feb 24, 2018 •

edited

Loading

INSTALLED VERSIONS

chris-b1 commented Feb 25, 2018

WillAyd commented Mar 2, 2018

chris-b1 commented Mar 2, 2018

WillAyd commented Mar 2, 2018

WillAyd commented Mar 2, 2018

changhiskhan commented Jul 14, 2018

data after null character dropped in read_csv #19886

data after null character dropped in read_csv #19886

Comments

smsaladi commented Feb 24, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Feb 25, 2018

WillAyd commented Mar 2, 2018

chris-b1 commented Mar 2, 2018

WillAyd commented Mar 2, 2018

WillAyd commented Mar 2, 2018

changhiskhan commented Jul 14, 2018

data after null character dropped in `read_csv` #19886

data after null character dropped in `read_csv` #19886

smsaladi commented Feb 24, 2018 •

edited

Loading

Output of `pd.show_versions()`