-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
data after null character dropped in read_csv
#19886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You're certainly welcome to look into this - as discussed in #2741 - might still be some parts of the parser that uses |
A little out of my area but isn't this just how strings work in C? For instance, if I compile the below program #include <stdio.h>
int main () {
char myarr[8] = "foo\0test";
printf("%s\n", myarr);
} And execute only |
Yes & no. You are correct that a null byte But I believe the parser is working with a sized buffer (i.e. knows that |
Hmm OK. Well from what I can tell with VERBOSITY set I think the tokenizer interprets this correctly. Here's a small excerpt from the last example provided, where I believe tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3
My guess is the issue is in the parser using pandas/pandas/_libs/parsers.pyx Line 768 in a7a7f8c
Again haven't dug this deep into the C side of things before so could be way off, but figured I'd share in case it helps anyone else looking at it FWIW here's the full verbose output of parsing the third example provided in the original post _tokenize_helper: Asked to tokenize 2 rows, datapos=0, datalen=0
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 38, status=0
datalen: 38
_tokenize_helper: Trying to process 38 bytes, datalen=38, datapos= 0
make_stream_space: nbytes = 38. grow_buffer(self->stream...)
safe_realloc: buffer = 0x7fbbff62c2e0, size = 64, result = 0x7fbbff63a090
safe_realloc: buffer = 0x7fbbff63a090, size = 128, result = 0x7fbbff63a090
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 0, self->stream_cap=128, status=0
safe_realloc: buffer = 0x7fbbff6279f0, size = 48, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63dec0, size = 384, result = 0x7fbbff63dec0
make_stream_space: grow_buffer(self->self->words, 0, 48, 38, 0)
make_stream_space: cap != self->words_cap, nbytes = 38, self->words_cap=48
safe_realloc: buffer = 0x7fbbff60a100, size = 384, result = 0x7fbbff63f5f0
safe_realloc: buffer = 0x7fbbff632b20, size = 48, result = 0x7fbbff62c2e0
safe_realloc: buffer = 0x7fbbff62c2e0, size = 96, result = 0x7fbbff63a110
safe_realloc: buffer = 0x7fbbff63a110, size = 192, result = 0x7fbbff63e040
safe_realloc: buffer = 0x7fbbff63e040, size = 384, result = 0x7fbbff63f770
make_stream_space: grow_buffer(self->line_start, 1, 48, 38, 0)
make_stream_space: cap != self->lines_cap, nbytes = 38
safe_realloc: buffer = 0x7fbbff632b40, size = 384, result = 0x7fbbff63f8f0
x,y
test
tokenize_bytes - Iter: 0 Char: 0x78 Line 1 field_count 0, state 0
PUSH_CHAR: Pushing x, slen= 0, stream_cap=128, stream_len=0
tokenize_bytes - Iter: 1 Char: 0x2c Line 1 field_count 0, state 3
push_char: self->stream[2] = 0, stream_cap=128
end_field: Char diff: 0
end_field: Saw word x at: 0. Total: 1
tokenize_bytes - Iter: 2 Char: 0x79 Line 1 field_count 1, state 1
PUSH_CHAR: Pushing y, slen= 2, stream_cap=128, stream_len=2
tokenize_bytes - Iter: 3 Char: 0xa Line 1 field_count 1, state 3
push_char: self->stream[4] = 0, stream_cap=128
end_field: Char diff: 2
end_field: Saw word y at: 2. Total: 2
end_line: Line end, nfields: 2
end_line: lines: 0
end_line: ex_fields: -1
end_line: new line start: 2
end_line: Finished line, at 1
tokenize_bytes - Iter: 4 Char: 0x74 Line 2 field_count 0, state 0
PUSH_CHAR: Pushing t, slen= 4, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 5 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 5, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 6 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 6, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 7 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 7, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 8 Char: 0x0 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 8, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 9 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 9, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 10 Char: 0x65 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing e, slen= 10, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 11 Char: 0x73 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing s, slen= 11, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 12 Char: 0x74 Line 2 field_count 0, state 3
PUSH_CHAR: Pushing t, slen= 12, stream_cap=128, stream_len=4
tokenize_bytes - Iter: 13 Char: 0x2c Line 2 field_count 0, state 3
push_char: self->stream[14] = 0, stream_cap=128
end_field: Char diff: 4
end_field: Saw word test at: 4. Total: 3
tokenize_bytes - Iter: 14 Char: 0x52 Line 2 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 14, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 15 Char: 0x65 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 15, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 16 Char: 0x67 Line 2 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 16, stream_cap=128, stream_len=14
tokenize_bytes - Iter: 17 Char: 0xa Line 2 field_count 1, state 3
push_char: self->stream[18] = 0, stream_cap=128
end_field: Char diff: 14
end_field: Saw word Reg at: 14. Total: 4
end_line: Line end, nfields: 2
end_line: lines: 1
end_line: ex_fields: 2
end_line: new line start: 4
end_line: Finished line, at 2
_TOKEN_CLEANUP: datapos: 18, datalen: 38
leaving tokenize_helper
_tokenize_helper: Asked to tokenize 262143 rows, datapos=18, datalen=38
_tokenize_helper: Trying to process 20 bytes, datalen=38, datapos= 18
make_stream_space: nbytes = 20. grow_buffer(self->stream...)
make_stream_space: self->stream=0x7fbbff63a090, self->stream_len = 18, self->stream_cap=128, status=0
make_stream_space: grow_buffer(self->self->words, 4, 48, 20, 0)
make_stream_space: grow_buffer(self->line_start, 3, 48, 20, 0)
tokenize_bytes - Iter: 18 Char: 0x0 Line 3 field_count 0, state 0
PUSH_CHAR: Pushing , slen= 18, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 19 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 19, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 20 Char: 0x0 Line 3 field_count 0, state 3
PUSH_CHAR: Pushing , slen= 20, stream_cap=128, stream_len=18
tokenize_bytes - Iter: 21 Char: 0x2c Line 3 field_count 0, state 3
push_char: self->stream[22] = 0, stream_cap=128
end_field: Char diff: 18
end_field: Saw word at: 18. Total: 5
tokenize_bytes - Iter: 22 Char: 0x52 Line 3 field_count 1, state 1
PUSH_CHAR: Pushing R, slen= 22, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 23 Char: 0x65 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing e, slen= 23, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 24 Char: 0x67 Line 3 field_count 1, state 3
PUSH_CHAR: Pushing g, slen= 24, stream_cap=128, stream_len=22
tokenize_bytes - Iter: 25 Char: 0xa Line 3 field_count 1, state 3
push_char: self->stream[26] = 0, stream_cap=128
end_field: Char diff: 22
end_field: Saw word Reg at: 22. Total: 6
end_line: Line end, nfields: 2
end_line: lines: 2
end_line: ex_fields: 2
end_line: new line start: 6
end_line: Finished line, at 3
tokenize_bytes - Iter: 26 Char: 0x49 Line 4 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 26, stream_cap=128, stream_len=26
tokenize_bytes - Iter: 27 Char: 0x2c Line 4 field_count 0, state 3
push_char: self->stream[28] = 0, stream_cap=128
end_field: Char diff: 26
end_field: Saw word I at: 26. Total: 7
tokenize_bytes - Iter: 28 Char: 0x53 Line 4 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 28, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 29 Char: 0x77 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 29, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 30 Char: 0x70 Line 4 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 30, stream_cap=128, stream_len=28
tokenize_bytes - Iter: 31 Char: 0xa Line 4 field_count 1, state 3
push_char: self->stream[32] = 0, stream_cap=128
end_field: Char diff: 28
end_field: Saw word Swp at: 28. Total: 8
end_line: Line end, nfields: 2
end_line: lines: 3
end_line: ex_fields: 2
end_line: new line start: 8
end_line: Finished line, at 4
tokenize_bytes - Iter: 32 Char: 0x49 Line 5 field_count 0, state 0
PUSH_CHAR: Pushing I, slen= 32, stream_cap=128, stream_len=32
tokenize_bytes - Iter: 33 Char: 0x2c Line 5 field_count 0, state 3
push_char: self->stream[34] = 0, stream_cap=128
end_field: Char diff: 32
end_field: Saw word I at: 32. Total: 9
tokenize_bytes - Iter: 34 Char: 0x53 Line 5 field_count 1, state 1
PUSH_CHAR: Pushing S, slen= 34, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 35 Char: 0x77 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing w, slen= 35, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 36 Char: 0x70 Line 5 field_count 1, state 3
PUSH_CHAR: Pushing p, slen= 36, stream_cap=128, stream_len=34
tokenize_bytes - Iter: 37 Char: 0xa Line 5 field_count 1, state 3
push_char: self->stream[38] = 0, stream_cap=128
end_field: Char diff: 34
end_field: Saw word Swp at: 34. Total: 10
end_line: Line end, nfields: 2
end_line: lines: 4
end_line: ex_fields: 2
end_line: new line start: 10
end_line: Finished line, at 5
_TOKEN_CLEANUP: datapos: 38, datalen: 38
Finished tokenizing input
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 0, status=1
datalen: 0
handling eof, datalen: 0, pstate: 0
leaving tokenize_helper
parser_consume_rows: Deleting 8 words, 32 chars
parser_trim_buffers: new_cap < self->words_cap
safe_realloc: buffer = 0x7fbbff63dec0, size = 24, result = 0x7fbbff63dec0
safe_realloc: buffer = 0x7fbbff63f5f0, size = 24, result = 0x7fbbff63f5f0
parser_trim_buffers: new_cap = 9, stream_cap = 128, lines_cap = 48
parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc
safe_realloc: buffer = 0x7fbbff63a090, size = 9, result = 0x7fbbff63a090
parser_trim_buffers: new_cap < self->lines_cap
safe_realloc: buffer = 0x7fbbff63f770, size = 16, result = 0x7fbbff63f770
safe_realloc: buffer = 0x7fbbff63f8f0, size = 16, result = 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x7fbbff63a090
free_if_not_null 0x7fbbff63dec0
free_if_not_null 0x7fbbff63f5f0
free_if_not_null 0x7fbbff63f770
free_if_not_null 0x7fbbff63f8f0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
free_if_not_null 0x0
|
Trying to confirm my suspicion, I modified the linked block above to look as follows: for i in range(field_count):
word = self.parser.words[start + i]
if start + i == self.parser.words_len: # Handle last item
word_len = self.parser.datalen - self.parser.word_starts[start + i] - 1
else:
word_len = self.parser.word_starts[start + i + 1] - self.parser.word_starts[start + i] - 1
if path == CSTRING:
name = PyBytes_FromString(word)
elif path == UTF8:
name = PyUnicode_FromStringAndSize(word, word_len)
elif path == ENCODED:
name = PyUnicode_Decode(word, strlen(word),
self.c_encoding, errors) I noticed the code was actually only in a block to parse the header, but if I injected null bytes into the header it would read the entire field In [7]: data = '\x00x,\x00y\n\x00test,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
In [7]: df = pd.read_csv(StringIO(data), engine='c')
In [7]: df.columns[0]
Out[7]: '\x00x'
In [6]: df.columns[1]
Out[6]: '\x00y' I'll dig a little further into the parsing of the body of data but pretty sure this could fix the issue. Will submit a PR if it comes so far |
The easier solution would be to just add a parsing option and have the tokenizer swallow the nul bytes. |
Code Sample
Problem description
Within a single field data before a
NUL
character is kept, but everything after, until the delimiter, is dropped. This result seems counterintuitive to me. I'd either expect for the field toNUL
characters (or even dropped would probably be fine). NaN only If the field was exclusivelyNUL
(see expected output section below)Might be related to #2741 or the related fix.
Expected Output
Output of
pd.show_versions()
In [9]: import pandas as pd
...: pd.show_versions()
...:
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: 3.1.2
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.5.0a1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: