Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

benhoyt · 2021-08-19T00:36:40Z

This speeds up the Python version of utils.apply_mask about 20 times, using int.from_bytes so that the XOR is done in a single Python operation -- in other words, the loop over the bytes is in C rather than in Python.

Note that it is a trade-off as it uses more memory: this version allocates roughly len(data) bytes for each of the intermediate values (e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I believe the original version only allocates for the return value.

Still, most websocket packets aren't huge, and I believe the massive speed gain here makes it worth it. (And people that use the speedups.c version won't be affected.)

Obviously the speedups.c version is still significantly faster again, but this change makes the library more usable in environments when it's not feasible to use the C extension.

Data Size  ForLoop  IntXor  Speedups
------------------------------------
1KB         78.6us  3.79us     151ns
1MB         79.7ms  4.38ms    55.4us

I got these timings by using commands like the following (with the function call adjusted, and 1024 replaced with 1024*1024 as needed).

python3 -m timeit \
  -s 'from websockets.utils import apply_mask' \
  -s 'data=b"x"*1024; mask=b"abcd"' \
  'apply_mask(data, mask)'

This idea came from Will McGugan's blog post Speeding up Websockets 60X. That post contains an ever faster (about 50% faster) way to solve it using a pre-calculated XOR lookup table, but that pre-allocates a 64K-entry table at import time, which didn't seem ideal. Still, that is how aiohttp does it, so maybe it's worth considering.

The int.from_bytes approach is also the approach used by the websocket-client library.

This speeds up the Python version of utils.apply_mask about 20 times, using int.from_bytes so that the XOR is done in a single Python operation -- in other words, the loop over the bytes is in C rather than in Python. Note that it is a trade-off as it uses more memory: this version allocates roughly len(data) bytes for each of the intermediate values (e.g., data_int, mask_repeated, mask_int, the XOR result); whereas I believe the original version only allocates for the return value. Still, most websocket packets aren't huge, and I believe the massive speed gain here makes it worth it. (And people that use the speedups.c version won't be affected.) Obviously the speedups.c version is still significantly faster again, but this change makes the library more usable in environments when it's not feasible to use the C extension. Data Size ForLoop IntXor Speedups ------------------------------------ 1KB 78.6us 3.79us 151ns 1MB 79.7ms 4.38ms 55.4us I got these timings by using commands like the following (with the function call adjusted, and 1024 replaced with 1024*1024 as needed). python3 -m timeit \ -s 'from websockets.utils import apply_mask' \ -s 'data=b"x"*1024; mask=b"abcd"' \ 'apply_mask(data, mask)' This idea came from Will McGugan's blog post "Speeding up Websockets 60X": https://www.willmcgugan.com/blog/tech/post/speeding-up-websockets-60x/ That post contains an ever faster (about 50% faster) way to solve it using a pre-calculated XOR lookup table, but that pre-allocates a 64K-entry table at import time, which didn't seem ideal. Still, that is how aiohttp does it, so maybe it's worth considering: https://github.com/aio-libs/aiohttp/blob/6ec33c5d841c8e845c27ebdd9384bbf72651cbb8/aiohttp/http_websocket.py#L115-L140 The int.from_bytes approach is also the approach used by the websocket-client library: https://github.com/websocket-client/websocket-client/blob/5f32b3c0cfb836c016ad2a5f6caeff2978a6a16f/websocket/_abnf.py#L46-L50

aaugustin · 2021-08-19T06:46:53Z

Bummer that the blog post didn't include websockets' C implementation in the benchmark. Last I checked, three years ago, it was 30 times faster than wsaccel (on x86 or amd64; on arm it must be the same). I'm mentioning this to be clear that I care about performance; I invented the approach with the highest performance in this space.

Unlike aiohttp, websockets puts a strong emphasis on simplicity. I often refrain from doing something complicated when aiohttp happily does it. This explain why we landed on different implementations for the pure Python implementation — which is probably never used as the C extension is always available in practice. I stuck with the trivial implementation because it's obvious and readable.

My question here is — does this matter? From a practical perspective, now that websockets ships wheels for Linux (including ARM, so Raspberry Pis etc. are covered), Mac and Windows, is anyone actually wanting this on HP/UX, VMS, Solaris, etc.? (And do I feel like making websockets more complex than it should be for them?)

aaugustin · 2021-08-19T06:55:33Z

Reading the blog post again, I considered option 4 back then (after seeing it in another library) but decided to build a C extension instead.

benhoyt · 2021-08-19T07:49:20Z

Fair push-back. And impressive work on speedups.c with the fancy SSE2 code!

However, I'm working on a use case where we can't use C extensions, and probably need to vendor the library as well. We could vendor it and patch it, but that gets messy. We've actually ended up going with websocket-client because we only need the client side, it's about 2/3 the size, and without C extensions does the masking 20x faster.

In terms of simplicity, a one-liner is definitely slightly nicer, but 4 straight-forward lines is not exactly terribly complex. I didn't love the translate implementation partly because it's non-obvious with those [::4] stride and translate operations; I find this "int xor" approach straight-forward by comparison. So to me this seems like an obvious win for those (admittedly few) people like us who can't use C extensions, and no loss for those that can.

aaugustin · 2021-08-19T13:30:57Z

Upon further thought, this is very local and unlikely to cause trouble in the future, so the cost of adopting a more efficient implementation is really small. I'll revieww the more efficient implementations and use one of them. I find it somewhat weird to create a gigantic integer value, but hey, if that gets the job done, why not.

Server-specific code is a rather small part of websockets. The difference in size between websockets websocket-client is because websocket-client doesn't support the permessage-deflate extension. I think you should consider the effect of compression in your choice :-)

aaugustin · 2021-08-19T13:50:26Z

OK, I think you made the right choice, I'll merge your version (with cosmetic changes).

aaugustin · 2021-08-19T13:54:10Z

Thanks for the pull request!

benhoyt · 2021-08-19T21:09:06Z

Great, thank you! We haven't started implementing the Python side of this yet, so we may well change our minds on websocket-client yet. :-)

aaugustin added 2 commits August 19, 2021 15:48

Run black

0a0905b

Run flake8

364a580

aaugustin merged commit c7fc0d3 into python-websockets:main Aug 19, 2021

benhoyt deleted the faster-apply_mask branch August 19, 2021 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

benhoyt commented Aug 19, 2021

aaugustin commented Aug 19, 2021 •

edited

Loading

aaugustin commented Aug 19, 2021

benhoyt commented Aug 19, 2021

aaugustin commented Aug 19, 2021

aaugustin commented Aug 19, 2021

aaugustin commented Aug 19, 2021

benhoyt commented Aug 19, 2021

Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

Speed up Python apply_mask 20x by using int.from_bytes/to_bytes #1034

Conversation

benhoyt commented Aug 19, 2021

aaugustin commented Aug 19, 2021 • edited Loading

aaugustin commented Aug 19, 2021

benhoyt commented Aug 19, 2021

aaugustin commented Aug 19, 2021

aaugustin commented Aug 19, 2021

aaugustin commented Aug 19, 2021

benhoyt commented Aug 19, 2021

aaugustin commented Aug 19, 2021 •

edited

Loading