Skip to content

data after null character dropped in read_csv #19886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
smsaladi opened this issue Feb 24, 2018 · 6 comments
Open

data after null character dropped in read_csv #19886

smsaladi opened this issue Feb 24, 2018 · 6 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@smsaladi
Copy link
Contributor

smsaladi commented Feb 24, 2018

Code Sample

In [1]: from pandas.compat import StringIO
   ...: from pandas import read_csv
   ...: data = 'x,y\ntest,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[1]:
      x    y
0  test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

In [2]: data = 'x,y\n\x00,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[2]:
     x    y
0  NaN  Reg
1  NaN  Reg
2    I  Swp
3    I  Swp

In [4]: data = 'x,y\ntest\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[4]:
      x    y
0  test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

Problem description

Within a single field data before a NUL character is kept, but everything after, until the delimiter, is dropped. This result seems counterintuitive to me. I'd either expect for the field to

  1. always become NaN with the other data dropped or
  2. expect to have a string with the data and NUL characters (or even dropped would probably be fine). NaN only If the field was exclusively NUL (see expected output section below)

Might be related to #2741 or the related fix.

Expected Output

# last case
In [X]: data = 'x,y\ntest\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[X]:
      x    y
0  test\00test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

# another case
In [X]: data = 'x,y\n\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
   ...: read_csv(StringIO(data), engine='c')
   ...:
Out[X]:
      x    y
0  \00test  Reg
1   NaN  Reg
2     I  Swp
3     I  Swp

Output of pd.show_versions()

In [9]: import pandas as pd
...: pd.show_versions()
...:

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.1.2
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.5.0a1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

You're certainly welcome to look into this - as discussed in #2741 - might still be some parts of the parser that uses NUL to manage state. Though in practice may be easier to pre-process your data, stripping out nulls before passing to pandas.

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2018

A little out of my area but isn't this just how strings work in C? For instance, if I compile the below program

#include <stdio.h>

int main () {
  char myarr[8] = "foo\0test";
  printf("%s\n", myarr);
}

And execute only foo gets printed, skipping over all of the characters after the null

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 2, 2018

Yes & no. You are correct that a null byte '\0' is used to mark the end of a "c string," which is really just a char array with that convention.

But I believe the parser is working with a sized buffer (i.e. knows that myarr is length 8 in your example) - so wouldn't necessarily need to stop on the null byte. That said, I'm not familiar with low level workings of the parser, could be wrong on this.

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2018

Hmm OK. Well from what I can tell with VERBOSITY set I think the tokenizer interprets this correctly. Here's a small excerpt from the last example provided, where I believe Iter 8 is actually pushing the NULL byte without triggering a field end

tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3

My guess is the issue is in the parser using PyUnicode_FromString instead of PyUnicode_FromStringAndSize and providing the length of the field, which should have been detected in some form or another by the tokenizer

name = PyUnicode_FromString(word)

Again haven't dug this deep into the C side of things before so could be way off, but figured I'd share in case it helps anyone else looking at it

FWIW here's the full verbose output of parsing the third example provided in the original post

_tokenize_helper: Asked to tokenize 2 rows, datapos=0, datalen=0
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 38, status=0
datalen: 38
_tokenize_helper: Trying to process 38 bytes, datalen=38, datapos= 0


make_stream_space: nbytes = 38.  grow_buffer(self->stream...)
safe_realloc: buffer = 0x7fbbff62c2e0, size = 64, result = 0x7fbbff63a090
safe_realloc: buffer = 0x7fbbff63a090, size = 128, result = 0x7fbbff63a090
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 0, self->stream_cap=128, status=0
safe_realloc: buffer = 0x7fbbff6279f0, size = 48, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63dec0, size = 384, result = 0x7fbbff63dec0
make_stream_space: grow_buffer(self->self->words, 0, 48, 38, 0)
make_stream_space: cap != self->words_cap, nbytes = 38, self->words_cap=48
safe_realloc: buffer = 0x7fbbff60a100, size = 384, result = 0x7fbbff63f5f0
safe_realloc: buffer = 0x7fbbff632b20, size = 48, result = 0x7fbbff62c2e0
safe_realloc: buffer = 0x7fbbff62c2e0, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63e040
safe_realloc: buffer = 0x7fbbff63e040, size = 384, result = 0x7fbbff63f770
make_stream_space: grow_buffer(self->line_start, 1, 48, 38, 0)
make_stream_space: cap != self->lines_cap, nbytes = 38
safe_realloc: buffer = 0x7fbbff632b40, size = 384, result = 0x7fbbff63f8f0
x,y
test
tokenize_bytes - Iter: 0 Char: 0x78 Line 1 field_count 0, state 0
PUSH_CHAR: Pushing x, slen= 0, stream_cap=128, stream_len=0
tokenize_bytes - Iter: 1 Char: 0x2c Line 1 field_count 0, state 3
push_char: self->stream[2] = 0, stream_cap=128
end_field: Char diff: 0
end_field: Saw word x at: 0. Total: 1
tokenize_bytes - Iter: 2 Char: 0x79 Line 1 field_count 1, state 1
PUSH_CHAR: Pushing y, slen= 2, stream_cap=128, stream_len=2
tokenize_bytes - Iter: 3 Char: 0xa Line 1 field_count 1, state 3
push_char: self->stream[4] = 0, stream_cap=128
end_field: Char diff: 2
end_field: Saw word y at: 2. Total: 2
end_line: Line end, nfields: 2
end_line: lines: 0
end_line: ex_fields: -1
end_line: new line start: 2
end_line: Finished line, at 1
tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3
tokenize_bytes - Iter: 14 Char: 0x52 Line 2 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 14, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 15 Char: 0x65 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 15, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 16 Char: 0x67 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 16, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 17 Char: 0xa Line 2 field_count 1, state 3
push_char: self->stream[18] = 0, stream_cap=128
end_field: Char diff: 14
end_field: Saw word Reg at: 14. Total: 4
end_line: Line end, nfields: 2
end_line: lines: 1
end_line: ex_fields: 2
end_line: new line start: 4
end_line: Finished line, at 2
_TOKEN_CLEANUP: datapos: 18, datalen: 38
leaving tokenize_helper
_tokenize_helper: Asked to tokenize 262143 rows, datapos=18, datalen=38
_tokenize_helper: Trying to process 20 bytes, datalen=38, datapos= 18


make_stream_space: nbytes = 20.  grow_buffer(self->stream...)
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 18, self->stream_cap=128, status=0
make_stream_space: grow_buffer(self->self->words, 4, 48, 20, 0)
make_stream_space: grow_buffer(self->line_start, 3, 48, 20, 0)

tokenize_bytes - Iter: 18 Char: 0x0 Line 3 field_count 0, state 0
PUSH_CHAR: Pushing , slen= 18, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 19 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 19, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 20 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 20, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 21 Char: 0x2c Line 3 field_count 0, state 3
push_char: self->stream[22] = 0, stream_cap=128
end_field: Char diff: 18
end_field: Saw word  at: 18. Total: 5
tokenize_bytes - Iter: 22 Char: 0x52 Line 3 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 22, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 23 Char: 0x65 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 23, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 24 Char: 0x67 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 24, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 25 Char: 0xa Line 3 field_count 1, state 3
push_char: self->stream[26] = 0, stream_cap=128
end_field: Char diff: 22
end_field: Saw word Reg at: 22. Total: 6
end_line: Line end, nfields: 2
end_line: lines: 2
end_line: ex_fields: 2
end_line: new line start: 6
end_line: Finished line, at 3
tokenize_bytes - Iter: 26 Char: 0x49 Line 4 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 26, stream_cap=128, stream_len=26
tokenize_bytes - Iter: 27 Char: 0x2c Line 4 field_count 0, state 3
push_char: self->stream[28] = 0, stream_cap=128
end_field: Char diff: 26
end_field: Saw word I at: 26. Total: 7
tokenize_bytes - Iter: 28 Char: 0x53 Line 4 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 28, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 29 Char: 0x77 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 29, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 30 Char: 0x70 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 30, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 31 Char: 0xa Line 4 field_count 1, state 3
push_char: self->stream[32] = 0, stream_cap=128
end_field: Char diff: 28
end_field: Saw word Swp at: 28. Total: 8
end_line: Line end, nfields: 2
end_line: lines: 3
end_line: ex_fields: 2
end_line: new line start: 8
end_line: Finished line, at 4
tokenize_bytes - Iter: 32 Char: 0x49 Line 5 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 32, stream_cap=128, stream_len=32
tokenize_bytes - Iter: 33 Char: 0x2c Line 5 field_count 0, state 3
push_char: self->stream[34] = 0, stream_cap=128
end_field: Char diff: 32
end_field: Saw word I at: 32. Total: 9
tokenize_bytes - Iter: 34 Char: 0x53 Line 5 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 34, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 35 Char: 0x77 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 35, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 36 Char: 0x70 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 36, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 37 Char: 0xa Line 5 field_count 1, state 3
push_char: self->stream[38] = 0, stream_cap=128
end_field: Char diff: 34
end_field: Saw word Swp at: 34. Total: 10
end_line: Line end, nfields: 2
end_line: lines: 4
end_line: ex_fields: 2
end_line: new line start: 10
end_line: Finished line, at 5
_TOKEN_CLEANUP: datapos: 38, datalen: 38
Finished tokenizing input
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 0, status=1
datalen: 0
handling eof, datalen: 0, pstate: 0
leaving tokenize_helper
parser_consume_rows: Deleting 8 words, 32 chars
parser_trim_buffers: new_cap < self->words_cap
safe_realloc: buffer = 0x7fbbff63dec0, size = 24, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63f5f0, size = 24, result = 0x7fbbff63f5f0
parser_trim_buffers: new_cap = 9, stream_cap = 128, lines_cap = 48
parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc
safe_realloc: buffer = 0x7fbbff63a090, size = 9, result = 0x7fbbff63a090
parser_trim_buffers: new_cap < self->lines_cap
safe_realloc: buffer = 0x7fbbff63f770, size = 16, result = 0x7fbbff63f770
safe_realloc: buffer = 0x7fbbff63f8f0, size = 16, result = 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x7fbbff63a090
free_if_not_null 0x7fbbff63dec0
free_if_not_null 0x7fbbff63f5f0
free_if_not_null 0x7fbbff63f770
free_if_not_null 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2018

Trying to confirm my suspicion, I modified the linked block above to look as follows:

for i in range(field_count):
    word = self.parser.words[start + i]
    if start + i == self.parser.words_len:  # Handle last item
        word_len = self.parser.datalen - self.parser.word_starts[start + i] - 1
    else:
        word_len = self.parser.word_starts[start + i + 1]  - self.parser.word_starts[start + i] - 1

    if path == CSTRING:
        name = PyBytes_FromString(word)
    elif path == UTF8:
        name = PyUnicode_FromStringAndSize(word, word_len)
    elif path == ENCODED:
        name = PyUnicode_Decode(word, strlen(word),
                                self.c_encoding, errors)

I noticed the code was actually only in a block to parse the header, but if I injected null bytes into the header it would read the entire field

In [7]: data = '\x00x,\x00y\n\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
In [7]: df = pd.read_csv(StringIO(data), engine='c')
In [7]: df.columns[0]
Out[7]: '\x00x'
In [6]: df.columns[1]
Out[6]: '\x00y'

I'll dig a little further into the parsing of the body of data but pretty sure this could fix the issue. Will submit a PR if it comes so far

@changhiskhan
Copy link
Contributor

The easier solution would be to just add a parsing option and have the tokenizer swallow the nul bytes.
To preserve all of the nul characters you'd to do a bit of surgery to COLITER_NEXT and also modify some of the khash code in the parser. That's like 10x the effort of the first option easily.
What's more important here, preserving the data after nulls OR preserving the nulls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants