Skip to content

Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 19, 2021

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Aug 19, 2021

This speeds up the Python version of utils.apply_mask about 20 times, using int.from_bytes so that the XOR is done in a single Python operation -- in other words, the loop over the bytes is in C rather than in Python.

Note that it is a trade-off as it uses more memory: this version allocates roughly len(data) bytes for each of the intermediate values (e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I believe the original version only allocates for the return value.

Still, most websocket packets aren't huge, and I believe the massive speed gain here makes it worth it. (And people that use the speedups.c version won't be affected.)

Obviously the speedups.c version is still significantly faster again, but this change makes the library more usable in environments when it's not feasible to use the C extension.

Data Size  ForLoop  IntXor  Speedups
------------------------------------
1KB         78.6us  3.79us     151ns
1MB         79.7ms  4.38ms    55.4us

I got these timings by using commands like the following (with the function call adjusted, and 1024 replaced with 1024*1024 as needed).

python3 -m timeit \
  -s 'from websockets.utils import apply_mask' \
  -s 'data=b"x"*1024; mask=b"abcd"' \
  'apply_mask(data, mask)'

This idea came from Will McGugan's blog post Speeding up Websockets 60X. That post contains an ever faster (about 50% faster) way to solve it using a pre-calculated XOR lookup table, but that pre-allocates a 64K-entry table at import time, which didn't seem ideal. Still, that is how aiohttp does it, so maybe it's worth considering.

The int.from_bytes approach is also the approach used by the websocket-client library.

This speeds up the Python version of utils.apply_mask about 20 times,
using int.from_bytes so that the XOR is done in a single Python
operation -- in other words, the loop over the bytes is in C rather
than in Python.

Note that it is a trade-off as it uses more memory: this version
allocates roughly len(data) bytes for each of the intermediate values
(e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I
believe the original version only allocates for the return value.

Still, most websocket packets aren't huge, and I believe the massive
speed gain here makes it worth it. (And people that use the speedups.c
version won't be affected.)

Obviously the speedups.c version is still significantly faster again,
but this change makes the library more usable in environments when it's
not feasible to use the C extension.

Data Size  ForLoop  IntXor  Speedups
------------------------------------
1KB         78.6us  3.79us     151ns
1MB         79.7ms  4.38ms    55.4us

I got these timings by using commands like the following (with the
function call adjusted, and 1024 replaced with 1024*1024 as needed).

python3 -m timeit \
  -s 'from websockets.utils import apply_mask' \
  -s 'data=b"x"*1024; mask=b"abcd"' \
  'apply_mask(data, mask)'

This idea came from Will McGugan's blog post "Speeding up Websockets
60X": https://www.willmcgugan.com/blog/tech/post/speeding-up-websockets-60x/

That post contains an ever faster (about 50% faster) way to solve it
using a pre-calculated XOR lookup table, but that pre-allocates a
64K-entry table at import time, which didn't seem ideal. Still, that is
how aiohttp does it, so maybe it's worth considering:

https://github.com/aio-libs/aiohttp/blob/6ec33c5d841c8e845c27ebdd9384bbf72651cbb8/aiohttp/http_websocket.py#L115-L140

The int.from_bytes approach is also the approach used by the
websocket-client library:

https://github.com/websocket-client/websocket-client/blob/5f32b3c0cfb836c016ad2a5f6caeff2978a6a16f/websocket/_abnf.py#L46-L50
@aaugustin
Copy link
Member

aaugustin commented Aug 19, 2021

Bummer that the blog post didn't include websockets' C implementation in the benchmark. Last I checked, three years ago, it was 30 times faster than wsaccel (on x86 or amd64; on arm it must be the same). I'm mentioning this to be clear that I care about performance; I invented the approach with the highest performance in this space.

Unlike aiohttp, websockets puts a strong emphasis on simplicity. I often refrain from doing something complicated when aiohttp happily does it. This explain why we landed on different implementations for the pure Python implementation — which is probably never used as the C extension is always available in practice. I stuck with the trivial implementation because it's obvious and readable.

My question here is — does this matter? From a practical perspective, now that websockets ships wheels for Linux (including ARM, so Raspberry Pis etc. are covered), Mac and Windows, is anyone actually wanting this on HP/UX, VMS, Solaris, etc.? (And do I feel like making websockets more complex than it should be for them?)

@aaugustin
Copy link
Member

Reading the blog post again, I considered option 4 back then (after seeing it in another library) but decided to build a C extension instead.

@benhoyt
Copy link
Contributor Author

benhoyt commented Aug 19, 2021

Fair push-back. And impressive work on speedups.c with the fancy SSE2 code!

However, I'm working on a use case where we can't use C extensions, and probably need to vendor the library as well. We could vendor it and patch it, but that gets messy. We've actually ended up going with websocket-client because we only need the client side, it's about 2/3 the size, and without C extensions does the masking 20x faster.

In terms of simplicity, a one-liner is definitely slightly nicer, but 4 straight-forward lines is not exactly terribly complex. I didn't love the translate implementation partly because it's non-obvious with those [::4] stride and translate operations; I find this "int xor" approach straight-forward by comparison. So to me this seems like an obvious win for those (admittedly few) people like us who can't use C extensions, and no loss for those that can.

@aaugustin
Copy link
Member

Upon further thought, this is very local and unlikely to cause trouble in the future, so the cost of adopting a more efficient implementation is really small. I'll revieww the more efficient implementations and use one of them. I find it somewhat weird to create a gigantic integer value, but hey, if that gets the job done, why not.

Server-specific code is a rather small part of websockets. The difference in size between websockets websocket-client is because websocket-client doesn't support the permessage-deflate extension. I think you should consider the effect of compression in your choice :-)

@aaugustin
Copy link
Member

OK, I think you made the right choice, I'll merge your version (with cosmetic changes).

@aaugustin aaugustin merged commit c7fc0d3 into python-websockets:main Aug 19, 2021
@aaugustin
Copy link
Member

Thanks for the pull request!

@benhoyt benhoyt deleted the faster-apply_mask branch August 19, 2021 21:07
@benhoyt
Copy link
Contributor Author

benhoyt commented Aug 19, 2021

Great, thank you! We haven't started implementing the Python side of this yet, so we may well change our minds on websocket-client yet. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants