Optimize BSON decoding #1667

vbabanin · 2025-03-31T04:06:32Z

Summary

This PR improves BSON decoding performance by addressing common sources of inefficiency in string deserialization and buffer handling:

Key changes

Avoiding redundant byte buffer copies: We relax the requirement for read-only buffers and instead pass a backing array view into the String constructor for UTF-8 decoding. While this makes the buffer mutable, it avoids costly intermediate array allocations and memory copies.
Scratch buffer reuse: For direct buffers (which don't expose a backing array), we now reuse a preallocated scratch array rather than allocating a fresh byte[] on every decode. This reduces allocation time and GC pressure.
B̶u̶l̶k̶ ̶s̶c̶a̶n̶n̶i̶n̶g̶ ̶f̶o̶r̶ ̶n̶u̶l̶l̶ ̶t̶e̶r̶m̶i̶n̶a̶t̶o̶r̶s̶:̶ ̶W̶h̶e̶n̶ ̶s̶e̶a̶r̶c̶h̶i̶n̶g̶ ̶f̶o̶r̶ ̶C̶-̶s̶t̶y̶l̶e̶ ̶n̶u̶l̶l̶ ̶t̶e̶r̶m̶i̶n̶a̶t̶o̶r̶s̶ ̶i̶n̶ ̶U̶T̶F̶-̶8̶ ̶d̶e̶c̶o̶d̶i̶n̶g̶,̶ ̶w̶e̶ ̶n̶o̶w̶ ̶s̶c̶a̶n̶ ̶i̶n̶ ̶8̶-̶b̶y̶t̶e̶
c̶h̶u̶n̶k̶s̶ ̶o̶n̶ ̶p̶l̶a̶t̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶u̶n̶a̶l̶i̶g̶n̶e̶d̶ ̶m̶e̶m̶o̶r̶y̶ ̶a̶c̶c̶e̶s̶s̶.̶ (Reverted, details of the discussion: Optimize BSON decoding #1667 (comment))

NOTE: THE BELOW BENCHMARK RESULTS ARE OUTDATED, as SWAR optimization is being split out into a separate PR.
perf task:

Test	Base Mainline Mean (MB/s)	Patched Mean (MB/s)	Diff (%)	Z-Score
Deep BSON Decoding	94.1	111.3	+18.2%	3.71
Find & Empty Cursor	138.6	164.0	+18.4%	6.50
Flat BSON Decoding	313.6	394.6	+25.8%	6.59
Full BSON Decoding	244.2	292.4	+19.8%	4.91

Perf analyzer results: Link

perf-netty task:

Test	Base Mainline Mean (MB/s)	Patched Mean (MB/s)	Diff (%)	Z-Score
Find many & Empty the cursor	62.9	89.1	+41.6%	15.3

Perf analyzer results: Link

Primary review: @NathanQingyangXu

JAVA-5842

- Removed redundant String copying when possible, replacing with view-based access - Added faster null-termination lookups to improve decoding performance

JAVA-5814

jyemin · 2025-04-03T17:22:38Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+        int pos = buffer.position();
+        int limit = buffer.limit();
+
+        if (UNALIGNED_ACCESS_SUPPORTED) {


So if we did this for all platforms, it would be slower on the ones that don't allow unaligned access? Slower than just going byte by byte? Just wondering if it's worth it to have two code paths to maintain.

I also don't see a test for when this value is false, since we don't run on any platforms that would make it so. It's a bit concerning that we don't, even though by inspection it seems obvious, at least with the code as it is, that it's correct. If we did want to add a test, we would have to add a testing backdoor to PlatformUtil to override the default behavior of examining "os.arch"

There might be some performance penalty, as ByteBuffer uses Unsafe.getLongUnaligned, which reads bytes individually and composes a long on architectures that do not support unaligned access, potentially adding some overhead.

Nearly all modern cloud providers provide architectures that support unaligned access. The ones that don’t are typically limited to embedded systems or legacy hardware. Given how rare such platforms are, I’m actually in favor of removing the platform check altogether - I think the likelihood of hitting such an architecture is extremely low. @jyemin @NathanQingyangXu What do you think?

I am ok with this. Keeping expanding the CPU list in the long future doesn't make much sense to me.

Yes, let's remove the platform check.

bson/src/main/org/bson/io/ByteBufferBsonInput.java

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputeTest.java

bson/src/main/org/bson/io/ByteBufferBsonInput.java

JAVA-5814

Add tests. JAVA-5842

JAVA-5842

…read-optm

NathanQingyangXu · 2025-04-09T15:29:51Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+                    // Found the null at pos + offset; reset buffer's position.
+                    return (pos - prevPos) + offset + 1;
+                }
+                pos += 8;


the above bitwise magic is cool; but it would be also great to leave some reference to SWAR (like https://en.wikipedia.org/wiki/SWAR) so future coder could know how to understand and maintain it (I assumed that if he had not known of SWAR, the above code comments won't be super helpful as well.

I am not sure whether it is an idea to present some proof that the above bitwise magic does help perf significantly so the tradeoff between perf and readability is an easy decision (Or you have provided it in the description of the PR?).
For instance, two following metrics could be juxtaposed:

baseline or current main branch metrics (without any optimization)

metrics when only the above SWAR trick applied

Then it would be more convincing that the bitwise magic is really justified.

For the findMany case with Netty transport settings (emptying the cursor), I observed a 28.42% improvement. Results: link

With Java IO (default transport), the gains are more modest:
findMany: ~3%
Flat BSON decoding: ~8.65%
Results: link

For the findMany case with Netty transport settings (emptying the cursor), I observed a 28.42% improvement. Results: link

With Java IO (default transport), the gains are more modest: findMany: ~3% Flat BSON decoding: ~8.65% Results: link

I think we could delay this cryptic SWAR bitwise optimization to next round of "extra-mile" perf optimization. Currently without its usage the perf improvement has been good enough. Needless to say, the bitwise code logic is hard to understand, difficult to maintain (even for the code author in the long future). Without compelling perf gain, it seems too risky to adopt such perf optmization trick. The above benchmark shows mediore metrics on default Java IO transport, so to me there is no reason to introduce this scary SWAR usage for the initial round of perf optmization.

@jyemin , how do you think?

The Java IO gains are indeed modest, but I can't come up with a hypothesis for why Netty would be so much better in this case.

Since the gains from the other optimizations still seem worthwhile, perhaps we should just revert the SWAR piece and consider it in follow-on ticket and get another set of eyes on it (perhaps our Netty expert collaborator from other PRs).

Unresolving this, as there is still a discussion.

difficult to maintain

From my perspective, the maintenance burden is quite minimal - the code spans just ~8 lines with bitwise complexity that might not be immediately obvious, however, i believe would not require deep study. For a ~30% improvement in the Netty case, the trade-off seems worthwhile.

a hypothesis for why Netty would be so much better in this case.

I ran a local benchmark with JIT compilation (C2 on ARM), which gave some visibility into the generated assembly for both Java’s ByteBuffer and Netty’s ByteBuf.

TLDR; The original byte-by-byte search seems to be inefficient on Netty due to virtual calls, component navigation, and additional checks. Java’s ByteBuffer, by contrast, was already quite optimized by JIT. The SWAR change improved inlining and reduced overhead in Netty, while bringing limited gains for JDK buffers as there were not much room for improvement, and additionally getLongUnalignhed was not inlined.

Before SWAR (Netty):

In its original pre-SWAR byte-by-byte loop, Netty’s ByteBuff had several sources of overhead:

Virtual Calls: Each getByte call involved virtual dispatch through the ByteBuf hierarchy (e.g., CompositeByteBuf, SwappedByteBuf), introducing latency from vtable lookups and branching.

Buffer Navigation: For buffers like CompositeByteBuf, getByte required navigating the buffer structure via findComponent, involving array lookups.

Heavy Bounds Checking: It seems that Netty’s checkIndex included multiple conditionals and reference count checks (e.g., ensureAccessible), which were more costly than Java’s simpler bounds checks.

'Not unrolled while loop': JDKs ByteBuffer has been unrolled to 4 getByte() per iteration which reduced branching overhead, but Netty's hasn't.

Netty pre-SWAR (byte-by-byte) assembly

0x0000000111798150: orr w29, w12, w15 ; Combine indices for bounds check 0x0000000111798154: tbnz w29, #31, 0x0000000111798638 ; Check for negative index, branch to trap 0x0000000111798158: ldr x12, [x1, #80] ; Load memory address 0x000000011179815c: ldrsb w0, [x12, w0, sxtw #0] ; Load byte (getByte, after navigation) 0x0000000111798160: cbz w0, 0x000000011179833c ; Check if byte is null (0) 0x0000000111798164: cmp w11, w17 ; Compare pos with limit 0x0000000111798168: b.ge 0x00000001117985cc ; Branch if pos >= limit 0x000000011179816c: mov w10, w11 ; Update pos ; findComponent navigation (part of getByte) 0x00000001117981c4: cmp w14, w15 ; Compare index with component bounds 0x00000001117981c8: b.eq 0x000000011179829c ; Branch to component lookup 0x000000011179829c: ldr w0, [x14, #20] ; Load component index 0x00000001117982a0: ldr w12, [x14, #32] ; Load component array 0x00000001117982a4: ldr w2, [x14, #24] ; Load component bounds 0x00000001117982a8: ldr w15, [x14, #16] ; Load component offset ; Virtual call for getByte 0x00000001117983a8: bl 0x0000000111228600 ; Virtual call to getByte ; - com.mongodb.internal.connection.netty.NettyByteBuf::get@5 ; UncommonTrap for bounds check (checkIndex) 0x0000000111798638: bl 0x000000011122ec80 ; Call UncommonTrapBlob (checkIndex failure) ; - io.netty.util.internal.MathUtil::isOutOfBounds@15 ; - io.netty.buffer.AbstractByteBuf::checkIndex0@14

Before SWAR (JDK):

In contrast, Java’s ByteBuffer pre-SWAR byte-by-byte loop was already quite efficient, using an inlined getByte on a direct byte[] with lightweight bounds checks.

After SWAR (Netty):

̶ ̶S̶w̶i̶t̶c̶h̶i̶n̶g̶ ̶t̶o̶ ̶̶g̶e̶t̶L̶o̶n̶g̶̶ ̶e̶l̶i̶m̶i̶n̶a̶t̶e̶d̶ ̶v̶i̶r̶t̶u̶a̶l̶ ̶c̶a̶l̶l̶s̶ ̶w̶h̶i̶c̶h̶ ̶l̶e̶a̶d̶ ̶t̶o̶ ̶i̶n̶l̶i̶n̶e̶d̶ ̶̶g̶e̶t̶L̶o̶n̶g̶̶ ̶d̶i̶r̶e̶c̶t̶l̶y̶ ̶t̶o̶ ̶a̶ ̶̶l̶d̶r̶̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶.̶ ̶

Bypassed buffer navigation for direct buffers (like PooledUnsafeDirectByteBuf), and reduced bounds checks to one per 8 bytes.

This addressed Netty’s significant overhead

After SWAR (JDK):

The SWAR version used a native getLongUnaligned call, which introduced overhead. The JVM’s C2 compiler did not inline getLongUnaligned, so the compiler generated a stub, which is evident from the produced JIT assembly:

Non-inlined getLongUnaligned ~10 cycles of overhead
`0x000000011424bc34: bl 0x000000011424bf00 ; Call getLongUnaligned (stub)`
0x000000011424bf00: isb 0x000000011424bf04: mov x12, #0x7b10 ; Metadata for getLongUnaligned 0x000000011424bf08: movk x12, #0xb, lsl #16 0x000000011424bf0c: movk x12, #0x8, lsl #32 0x000000011424bf10: mov x8, #0x17c4 ; Jump address 0x000000011424bf14: movk x8, #0x13d0, lsl #16 0x000000011424bf18: movk x8, #0x1, lsl #32 0x000000011424bf1c: br x8 ; Branch to trampoline 0x000000011424bf20: ldr x8, 0x000000011424bf28 ; Load trampoline address 0x000000011424bf24: br x8 ; Branch to trampoline 0x000000011424bf28: b 0x0000000114b7bbe8 ; Trampoline address

Java’s efficient pre-SWAR loop leaves little room for improvement, and SWAR’s unlined native call limits gains.

Assembly for JDK's already inlined `getByte() in pre-SWAR`

; Byte-by-byte loop (org.bson.io.ByteBufferBsonInput::computeCStringLength) 0x00000001302dc760: ldrsb w10, [x10, #16] ; Load byte from byte[] (getByte) 0x00000001302dc764: cbz w10, 0x00000001302dc900 ; Check for null

That being said, I’m in favor of keeping this optimization given the substantial performance improvement for Netty users and the low risk associated with unaligned access. However, I’m also fine with moving it to a separate ticket to unblock the rest of the optimizations. I am reverting the SWAR changes for now.

Switching to getLong eliminated virtual calls which lead to inlined getLong directly to a ldr instruction

Is getLong better because the method is not virtual, or because it's invoked 1/8 of the time? (It seems like if getByte is virtual then getLong would have to be as well.)

I am reverting the SWAR changes for now.

I don't see it reverted yet, but as for me this is convincing rationale to keep it.

Is getLong better because the method is not virtual, or because it's invoked 1/8 of the time? (It seems like if getByte is virtual then getLong would have to be as well.)

Ah, actually getLong also stayed virtual for SwappedByteBuff path, which is in our case. JIT produced code with inlined getLong for PooledUnsafeDirectByteBuf but virtual for SwappedByteBuff.

Java interpretation of compiled code

if (underlying instanceof UnsafeDirectSwappedByteBuf) { UnsafeDirectSwappedByteBuf swappedBuf = (UnsafeDirectSwappedByteBuf) underlying; ByteBuf innerBuf = swappedBuf.unwrap(); if (innerBuf instanceof PooledUnsafeDirectByteBuf) { PooledUnsafeDirectByteBuf directBuf = (PooledUnsafeDirectByteBuf) innerBuf; readerIndex = directBuf.readerIndex(); writerIndex = directBuf.writerIndex(); //processPooledUnsafeDirect has INLINED getLong return processPooledUnsafeDirect(directBuf, startIndex, readerIndex, writerIndex); } else { // Fallback for other types readerIndex = innerBuf.readerIndex(); writerIndex = innerBuf.writerIndex(); return processGeneric(innerBuf, startIndex, readerIndex, writerIndex); } } //Our case else if (underlying instanceof SwappedByteBuf) { SwappedByteBuf swappedBuf = (SwappedByteBuf) underlying; ByteBuf innerBuf = swappedBuf.unwrap(); if (innerBuf instanceof CompositeByteBuf) { CompositeByteBuf compositeBuf = (CompositeByteBuf) innerBuf; readerIndex = compositeBuf.readerIndex(); writerIndex = compositeBuf.writerIndex(); // Virtual call inside (not inlined) return processComposite(compositeBuf, startIndex, readerIndex, write } ........ }

Most of gains comes from 1/8 fewer calls reducing virtual call overhead, checks and component lookups. I also noticed, that JDK's ByteBuffer had 4 unrolled getByte operations per iteration, which was not the case for Netty.

i don't see it reverted yet, but as for me this is convincing rationale to keep it.

I have reverted the changes in these commits:

0084f4f

cf51555

I will create a separate follow-up PR after merging this one to have a focused discussion.

bson/src/main/org/bson/io/ByteBufferBsonInput.java

driver-core/src/main/com/mongodb/internal/connection/netty/NettyByteBuf.java

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputTest.java

bson/src/main/org/bson/io/ByteBufferBsonInput.java

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputTest.java

NathanQingyangXu

I finished my complete review. Looks good to me. I am especially impressed by the testing completeness in ByteBufferBsonInputTest.

Left some minor comments, but none of which is block of LGTM.

As long as the decision on unaligned memory access is finalized, I can approve this PR.

NathanQingyangXu

From my side, this PR LGTM unless we want to switch to unconditional SWAR usage (dropping PlatformUtil, maybe after some perf testing on some rare CPU not supporting unaligned memory access).

# Conflicts: # bson/src/main/org/bson/ByteBuf.java # bson/src/main/org/bson/ByteBufNIO.java # driver-core/src/main/com/mongodb/internal/connection/CompositeByteBuf.java # driver-core/src/main/com/mongodb/internal/connection/netty/NettyByteBuf.java

vbabanin · 2025-04-24T07:28:20Z

ClientEncryptionCustomEndpointTest.testCustomEndpoint_4.__aws__invalid_amazon_region_in_endpoint_ is failing due to a unrelated error observed in other builds as well.

Optimize BSON decoding

9e109b4

- Removed redundant String copying when possible, replacing with view-based access - Added faster null-termination lookups to improve decoding performance

vbabanin self-assigned this Mar 31, 2025

vbabanin added 5 commits April 2, 2025 20:04

Merge branch 'refs/heads/main' into string-read-optm

c91b341

Add scratch buffer read optimization.

04965cd

JAVA-5814

Remove TODO.

955b8c5

JAVA-5814

Address static check warnings.

dda00e3

JAVA-5814

Fix index calculation.

579e2b1

JAVA-5814

vbabanin marked this pull request as ready for review April 3, 2025 07:59

vbabanin requested review from rozza and NathanQingyangXu April 3, 2025 15:12

jyemin requested review from jyemin and removed request for rozza April 3, 2025 17:38

jyemin reviewed Apr 3, 2025

View reviewed changes

Allocate direct buffer in tests.

8914813

JAVA-5814

This comment was marked as outdated.

Sign in to view

vbabanin added 7 commits April 4, 2025 15:21

Remove redundant branching.

9422cf9

Add tests. JAVA-5842

Split when & then comments.

95e890f

JAVA-5842

Rename tests.

84e7728

JAVA-5842

Merge branch 'main' into string-read-optm

1306fe8

Rename method.

a3d3f44

JAVA-5842

Merge remote-tracking branch 'vbabanin/string-read-optm' into string-…

0266a90

…read-optm

Fix static checks.

bdfe6d2

NathanQingyangXu reviewed Apr 9, 2025

View reviewed changes

Fix test.

95ab04b

NathanQingyangXu reviewed Apr 9, 2025

View reviewed changes

Add comments.

7199945

NathanQingyangXu reviewed Apr 12, 2025

View reviewed changes

vbabanin added 2 commits April 17, 2025 13:06

Merge branch 'refs/heads/main' into string-read-optm

d43b83d

vbabanin added 4 commits April 17, 2025 16:46

Fix ByteBufferBsonOutput buffer caching logic.

5792d35

Revert SWAR optimization.

0084f4f

Remove PlatformUtil.

cf51555

Fix static checks.

9c20c99

vbabanin requested review from jyemin and NathanQingyangXu April 24, 2025 01:32

vbabanin added 2 commits April 23, 2025 18:35

Merge branch 'main' into string-read-optm

a094261

Make computeCStringLength private.

ce754d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BSON decoding #1667

Optimize BSON decoding #1667

vbabanin commented Mar 31, 2025 •

edited

Loading

jyemin Apr 3, 2025

vbabanin Apr 17, 2025 •

edited

Loading

NathanQingyangXu Apr 17, 2025

jyemin Apr 18, 2025

This comment was marked as outdated.

NathanQingyangXu Apr 9, 2025 •

edited

Loading

NathanQingyangXu Apr 9, 2025 •

edited

Loading

vbabanin Apr 17, 2025 •

edited

Loading

NathanQingyangXu Apr 18, 2025

jyemin Apr 18, 2025

vbabanin Apr 23, 2025 •

edited

Loading

jyemin Apr 23, 2025

jyemin Apr 23, 2025

vbabanin Apr 23, 2025 •

edited

Loading

vbabanin Apr 24, 2025

NathanQingyangXu left a comment

NathanQingyangXu left a comment •

edited

Loading

vbabanin commented Apr 24, 2025

Optimize BSON decoding #1667

Are you sure you want to change the base?

Optimize BSON decoding #1667

Conversation

vbabanin commented Mar 31, 2025 • edited Loading

Summary

Key changes

Choose a reason for hiding this comment

vbabanin Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

NathanQingyangXu Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

NathanQingyangXu Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

vbabanin Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbabanin Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

Before SWAR (Netty):

Before SWAR (JDK):

After SWAR (Netty):

After SWAR (JDK):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbabanin Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanQingyangXu left a comment

Choose a reason for hiding this comment

NathanQingyangXu left a comment • edited Loading

Choose a reason for hiding this comment

vbabanin commented Apr 24, 2025

vbabanin commented Mar 31, 2025 •

edited

Loading

vbabanin Apr 17, 2025 •

edited

Loading

NathanQingyangXu Apr 9, 2025 •

edited

Loading

NathanQingyangXu Apr 9, 2025 •

edited

Loading

vbabanin Apr 17, 2025 •

edited

Loading

vbabanin Apr 23, 2025 •

edited

Loading

vbabanin Apr 23, 2025 •

edited

Loading

NathanQingyangXu left a comment •

edited

Loading