-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
SAS7BDAT parser: Speed up RLE/RDC decompression #47405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jbrockmendel mind reviewing this as well? |
@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now. |
|
||
from pandas import read_sas | ||
|
||
ROOT = Path(__file__).parents[3] / "pandas" / "tests" / "io" / "sas" / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is a style choice orthogonal to the rest of the PR? no real problem with it, but in general best to minimize these to make it easier to focus on the important bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ASV file would’ve been very confusing if I left the old code because my additions can’t use the old code and then we’d end up with two almost identical but different versions.
# cython: profile=False | ||
# cython: boundscheck=False, initializedcheck=False | ||
# cython: language_level=3, initializedcheck=False | ||
# cython: warn.undeclared=True, warn.maybe_uninitialized=True, warn.unused=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arent these the defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see
- https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L182
- https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L184
- https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L226
- https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L228
- https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L229
I set language_level
because I was getting warnings from Cython, and removed profile
because it defaults to False.
pandas/io/sas/sas.pyx
Outdated
|
||
|
||
cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0: | ||
assert offset < buf.length, f"Out of bounds read" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these assertions get expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negligible, Cython compiles it to just one check + jump (no Python involved). That said, it’s not free, but leaving them out sacrifices robustness and security.
pandas/io/sas/sas.pyx
Outdated
size_t length | ||
|
||
|
||
cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 0 instead of -1 (which we use elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because size_t
is unsigned. Could also use a signed type here but it doesn’t feel right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually I guess it makes sense to use some other value because null bytes are pretty common in SAS files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using a sentinel mixed into the return value you may be better of passing a separate error
argument and checking if that is set or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this suggestion entirely. This is using the recommended Cython error signalling machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would be to have a signature that looks like this:
cdef inline uint8_t buf_get(Buffer buf, size_t offset, int *error):
Then within the function do something like:
if something_bad_happened:
*error = 1
At least in C. I'm not as familiar with Cython semantics to know how that works. The caller passes error as an argument by address (&error
) and then check after the call if it was set to 1 or not
This is just a generic approach; if it matters or not in your current design comes back to whether or not a sentinel can safely be reserved or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Cython has its own error signaling that doesn't use error output variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I see the ? making the difference here to help disambiguate an error from a valid return value
Which usage would you need to replace? |
cc @WillAyd |
Essentially the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really in love with 'custom buffer stuff' as this can cause a lot of mental overhead for code readers; but i get the perf is worth it and its not that crazy to understand. prob worth adding add comments around L20 about why and what is happening.
can you also add a whatsnew note |
@jbrockmendel mind to review this? thanks! :) |
where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem. |
int64_t[:] column_types | ||
int64_t[:] lengths | ||
int64_t[:] offsets | ||
uint8_t[:, :] byte_chunk | ||
object[:, :] string_chunk | ||
|
||
source = np.frombuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel here
if <Py_ssize_t>len(result) != <Py_ssize_t>result_length: | ||
raise ValueError(f"RLE: {len(result)} != {result_length}") | ||
|
||
return np.asarray(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel here
if <Py_ssize_t>len(outbuff) != <Py_ssize_t>result_length: | ||
raise ValueError(f"RDC: {len(outbuff)} != {result_length}\n") | ||
|
||
return np.asarray(outbuff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel here
fine by me |
@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far. |
Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in #47403 (comment) and I think these would be more suitable for 1.6/2.0 |
@mroeschke can we please merge this together with #47403 and #47656 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a whatsnew note for 1.6.0.rst?
It’s in the other Pr |
Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review |
Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge |
Code changes are independent, just the what’s new is in one PR to avoid conflicts |
Thanks @jonashaag |
* Speed up RLE/RDC decompression * Update tests * ssize_t -> size_t * Update sas.pyx * Don't use null byte as except value * Nit * Simplify condition * Review feedback * Docstring -> comment * Revert "Simplify condition" This reverts commit 263aea6. * Lint * Speed up some Cython `except` * Typo
Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.
Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.
Also adds a bunch of
assert
statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.Alternatives considered:
Fast NumPy array creation: Didn't find a way to do it.
Using Python's
bytearray
: Much slower.Using
array.array
: Much slower. Cython has a fast path but it is incompatible with PyPy.closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.