Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize BSON decoding #1667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize BSON decoding #1667
Changes from 6 commits
9e109b4
c91b341
04965cd
955b8c5
dda00e3
579e2b1
8914813
9422cf9
95e890f
84e7728
1306fe8
a3d3f44
0266a90
bdfe6d2
95ab04b
7199945
d43b83d
7fc074f
5792d35
0084f4f
cf51555
9c20c99
a094261
ce754d6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we did this for all platforms, it would be slower on the ones that don't allow unaligned access? Slower than just going byte by byte? Just wondering if it's worth it to have two code paths to maintain.
I also don't see a test for when this value is false, since we don't run on any platforms that would make it so. It's a bit concerning that we don't, even though by inspection it seems obvious, at least with the code as it is, that it's correct. If we did want to add a test, we would have to add a testing backdoor to PlatformUtil to override the default behavior of examining
"os.arch"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some performance penalty, as
ByteBuffer
usesUnsafe.getLongUnaligned
, which reads bytes individually and composes along
on architectures that do not support unaligned access, potentially adding some overhead.Nearly all modern cloud providers provide architectures that support unaligned access. The ones that don’t are typically limited to embedded systems or legacy hardware. Given how rare such platforms are, I’m actually in favor of removing the platform check altogether - I think the likelihood of hitting such an architecture is extremely low. @jyemin @NathanQingyangXu What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with this. Keeping expanding the CPU list in the long future doesn't make much sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's remove the platform check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above bitwise magic is cool; but it would be also great to leave some reference to
SWAR
(like https://en.wikipedia.org/wiki/SWAR) so future coder could know how to understand and maintain it (I assumed that if he had not known ofSWAR
, the above code comments won't be super helpful as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it is an idea to present some proof that the above bitwise magic does help perf significantly so the tradeoff between perf and readability is an easy decision (Or you have provided it in the description of the PR?).
For instance, two following metrics could be juxtaposed:
SWAR
trick appliedThen it would be more convincing that the bitwise magic is really justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the
findMany
case with Netty transport settings (emptying the cursor), I observed a 28.42% improvement. Results: linkWith Java IO (default transport), the gains are more modest:
findMany
: ~3%Flat BSON decoding:
~8.65%Results: link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could delay this cryptic
SWAR
bitwise optimization to next round of "extra-mile" perf optimization. Currently without its usage the perf improvement has been good enough. Needless to say, the bitwise code logic is hard to understand, difficult to maintain (even for the code author in the long future). Without compelling perf gain, it seems too risky to adopt such perf optmization trick. The above benchmark shows mediore metrics on default Java IO transport, so to me there is no reason to introduce this scary SWAR usage for the initial round of perf optmization.@jyemin , how do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Java IO gains are indeed modest, but I can't come up with a hypothesis for why Netty would be so much better in this case.
Since the gains from the other optimizations still seem worthwhile, perhaps we should just revert the SWAR piece and consider it in follow-on ticket and get another set of eyes on it (perhaps our Netty expert collaborator from other PRs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this, as there is still a discussion.
From my perspective, the maintenance burden is quite minimal - the code spans just ~8 lines with bitwise complexity that might not be immediately obvious, however, i believe would not require deep study. For a ~30% improvement in the Netty case, the trade-off seems worthwhile.
I ran a local benchmark with JIT compilation (C2 on ARM), which gave some visibility into the generated assembly for both Java’s ByteBuffer and Netty’s ByteBuf.
TLDR; The original
byte-by-byte
search seems to be inefficient onNetty
due to virtual calls, component navigation, and additional checks. Java’sByteBuffer
, by contrast, was already quite optimized by JIT. The SWAR change improved inlining and reduced overhead in Netty, while bringing limited gains for JDK buffers as there were not much room for improvement, and additionallygetLongUnalignhed
was not inlined.Before SWAR (Netty):
In its original pre-SWAR
byte-by-byte loop
, Netty’sByteBuff
had several sources of overhead:Virtual Calls
: Each getByte call involved virtual dispatch through theByteBuf
hierarchy (e.g., CompositeByteBuf, SwappedByteBuf), introducing latency from vtable lookups and branching.Buffer Navigation
: For buffers likeCompositeByteBuf
,getByte
required navigating the buffer structure viafindComponent
, involving array lookups.Heavy Bounds Checking
: It seems that Netty’scheckIndex
included multiple conditionals and reference count checks (e.g.,ensureAccessible
), which were more costly than Java’s simpler bounds checks.getByte()
per iteration which reduced branching overhead, but Netty's hasn't.Netty pre-SWAR (byte-by-byte) assembly
Before SWAR (JDK):
In contrast, Java’s ByteBuffer pre-SWAR
byte-by-byte loop
was already quite efficient, using an inlinedgetByte
on a directbyte[]
with lightweight bounds checks.After SWAR (Netty):
̶g̶e̶t̶L̶o̶n̶g̶
̶ ̶e̶l̶i̶m̶i̶n̶a̶t̶e̶d̶ ̶v̶i̶r̶t̶u̶a̶l̶ ̶c̶a̶l̶l̶s̶ ̶w̶h̶i̶c̶h̶ ̶l̶e̶a̶d̶ ̶t̶o̶ ̶i̶n̶l̶i̶n̶e̶d̶ ̶̶g̶e̶t̶L̶o̶n̶g̶
̶ ̶d̶i̶r̶e̶c̶t̶l̶y̶ ̶t̶o̶ ̶a̶ ̶̶l̶d̶r̶
̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶.̶ ̶This addressed Netty’s significant overhead
After SWAR (JDK):
The SWAR version used a native
getLongUnaligned
call, which introduced overhead. The JVM’s C2 compiler did not inlinegetLongUnaligned
, so the compiler generated a stub, which is evident from the produced JIT assembly:Non-inlined getLongUnaligned ~10 cycles of overhead
`0x000000011424bc34: bl 0x000000011424bf00 ; Call getLongUnaligned (stub)`Java’s efficient pre-SWAR loop leaves little room for improvement, and SWAR’s unlined native call limits gains.
Assembly for JDK's already inlined `getByte() in pre-SWAR`
That being said, I’m in favor of keeping this optimization given the substantial performance improvement for Netty users and the low risk associated with unaligned access. However, I’m also fine with moving it to a separate ticket to unblock the rest of the optimizations. I am reverting the SWAR changes for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is getLong better because the method is not virtual, or because it's invoked 1/8 of the time? (It seems like if getByte is virtual then getLong would have to be as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it reverted yet, but as for me this is convincing rationale to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually
getLong
also stayed virtual forSwappedByteBuff
path, which is in our case. JIT produced code with inlinedgetLong
forPooledUnsafeDirectByteBuf
but virtual forSwappedByteBuff
.Java interpretation of compiled code
Most of gains comes from 1/8 fewer calls reducing virtual call overhead, checks and component lookups. I also noticed, that JDK's ByteBuffer had 4 unrolled
getByte
operations per iteration, which was not the case for Netty.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the changes in these commits:
I will create a separate follow-up PR after merging this one to have a focused discussion.