-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
SAS7BDAT parser: Faster string parsing #47404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -426,8 +426,10 @@ cdef class Parser: | |
jb += 1 | ||
elif column_types[j] == column_type_string: | ||
# string | ||
string_chunk[js, current_row] = np.array(source[start:( | ||
start + lngt)]).tobytes().rstrip(b"\x00 ") | ||
# Skip trailing whitespace | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add some comments here (e.g. this is like ..... but slower so we are doing xyz). also do we have asv's for this case? and can you add a whatsnew note. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a what's new for all 3 PRs here, that ok? |
||
while lngt > 0 and source[start+lngt-1] in b"\x00 ": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. source[start+lngt-1] is a uint8_t isnt it? am i wrong to think this looks weird? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Read this as: check if the last byte is a null byte or white space. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yah i get what it means, i just suspect there's an implicit casting going on in here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What kind of casting? Actually I think it doesn’t matter because the both characters are on the positive side of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cython will compile this to a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. switch ((*((__pyx_t_6pandas_2io_3sas_4_sas_uint8_t const *) ( /* dim=0 */ (__pyx_v_source.data + __pyx_t_17 * __pyx_v_source.strides[0]) )))) {
case '\x00':
case ' ':
__pyx_t_8 = 1;
break;
default:
__pyx_t_8 = 0;
break;
} with typedef unsigned char __pyx_t_6pandas_2io_3sas_4_sas_uint8_t |
||
lngt -= 1 | ||
string_chunk[js, current_row] = (&source[start])[:lngt] | ||
js += 1 | ||
|
||
self.current_row_on_page_index += 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the perf bottleneck in the
np.array
call?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like both the .tobytes and the .rstrip are non-optimized. could cython optimize it if the bytes object were declared as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check. IIRC tobytes will never be optimized and it’s very slow. The call to np.array is completely useless so I suggest to remove that in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some benchmarks on 2 files, times in ms for min duration of 10 reads of each file:
np.array()
: 6.6 / 12.7 (0.87x / 0.85x)cdef bytes
: 6.5 / 12.5rfind
plus manually slicing instead ofrstrip
: 6.4 / 11.5There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for slowness is that calls to
bytes.xyz()
are not optimized by Cython.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into this. is it worth opening an issue in cython about optimizing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cython/cython#4884