-
-
Notifications
You must be signed in to change notification settings - Fork 539
Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This speeds up the Python version of utils.apply_mask about 20 times, using int.from_bytes so that the XOR is done in a single Python operation -- in other words, the loop over the bytes is in C rather than in Python. Note that it is a trade-off as it uses more memory: this version allocates roughly len(data) bytes for each of the intermediate values (e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I believe the original version only allocates for the return value. Still, most websocket packets aren't huge, and I believe the massive speed gain here makes it worth it. (And people that use the speedups.c version won't be affected.) Obviously the speedups.c version is still significantly faster again, but this change makes the library more usable in environments when it's not feasible to use the C extension. Data Size ForLoop IntXor Speedups ------------------------------------ 1KB 78.6us 3.79us 151ns 1MB 79.7ms 4.38ms 55.4us I got these timings by using commands like the following (with the function call adjusted, and 1024 replaced with 1024*1024 as needed). python3 -m timeit \ -s 'from websockets.utils import apply_mask' \ -s 'data=b"x"*1024; mask=b"abcd"' \ 'apply_mask(data, mask)' This idea came from Will McGugan's blog post "Speeding up Websockets 60X": https://www.willmcgugan.com/blog/tech/post/speeding-up-websockets-60x/ That post contains an ever faster (about 50% faster) way to solve it using a pre-calculated XOR lookup table, but that pre-allocates a 64K-entry table at import time, which didn't seem ideal. Still, that is how aiohttp does it, so maybe it's worth considering: https://github.com/aio-libs/aiohttp/blob/6ec33c5d841c8e845c27ebdd9384bbf72651cbb8/aiohttp/http_websocket.py#L115-L140 The int.from_bytes approach is also the approach used by the websocket-client library: https://github.com/websocket-client/websocket-client/blob/5f32b3c0cfb836c016ad2a5f6caeff2978a6a16f/websocket/_abnf.py#L46-L50
Bummer that the blog post didn't include websockets' C implementation in the benchmark. Last I checked, three years ago, it was 30 times faster than wsaccel (on x86 or amd64; on arm it must be the same). I'm mentioning this to be clear that I care about performance; I invented the approach with the highest performance in this space. Unlike aiohttp, websockets puts a strong emphasis on simplicity. I often refrain from doing something complicated when aiohttp happily does it. This explain why we landed on different implementations for the pure Python implementation — which is probably never used as the C extension is always available in practice. I stuck with the trivial implementation because it's obvious and readable. My question here is — does this matter? From a practical perspective, now that websockets ships wheels for Linux (including ARM, so Raspberry Pis etc. are covered), Mac and Windows, is anyone actually wanting this on HP/UX, VMS, Solaris, etc.? (And do I feel like making websockets more complex than it should be for them?) |
Reading the blog post again, I considered option 4 back then (after seeing it in another library) but decided to build a C extension instead. |
Fair push-back. And impressive work on However, I'm working on a use case where we can't use C extensions, and probably need to vendor the library as well. We could vendor it and patch it, but that gets messy. We've actually ended up going with websocket-client because we only need the client side, it's about 2/3 the size, and without C extensions does the masking 20x faster. In terms of simplicity, a one-liner is definitely slightly nicer, but 4 straight-forward lines is not exactly terribly complex. I didn't love the |
Upon further thought, this is very local and unlikely to cause trouble in the future, so the cost of adopting a more efficient implementation is really small. I'll revieww the more efficient implementations and use one of them. I find it somewhat weird to create a gigantic integer value, but hey, if that gets the job done, why not. Server-specific code is a rather small part of websockets. The difference in size between websockets websocket-client is because websocket-client doesn't support the permessage-deflate extension. I think you should consider the effect of compression in your choice :-) |
OK, I think you made the right choice, I'll merge your version (with cosmetic changes). |
Thanks for the pull request! |
Great, thank you! We haven't started implementing the Python side of this yet, so we may well change our minds on websocket-client yet. :-) |
This speeds up the Python version of
utils.apply_mask
about 20 times, usingint.from_bytes
so that the XOR is done in a single Python operation -- in other words, the loop over the bytes is in C rather than in Python.Note that it is a trade-off as it uses more memory: this version allocates roughly
len(data)
bytes for each of the intermediate values (e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I believe the original version only allocates for the return value.Still, most websocket packets aren't huge, and I believe the massive speed gain here makes it worth it. (And people that use the speedups.c version won't be affected.)
Obviously the speedups.c version is still significantly faster again, but this change makes the library more usable in environments when it's not feasible to use the C extension.
I got these timings by using commands like the following (with the function call adjusted, and 1024 replaced with 1024*1024 as needed).
This idea came from Will McGugan's blog post Speeding up Websockets 60X. That post contains an ever faster (about 50% faster) way to solve it using a pre-calculated XOR lookup table, but that pre-allocates a 64K-entry table at import time, which didn't seem ideal. Still, that is how aiohttp does it, so maybe it's worth considering.
The int.from_bytes approach is also the approach used by the websocket-client library.