Optimize String write #1651

vbabanin · 2025-03-18T05:26:38Z

Summary

This PR optimizes string writing in BSON by minimizing redundant checks and internal method indirection in hot paths.

Key Changes:

Removed internal method indirection (e.g., write() wrappers) that introduced extra bound checks.
Avoided repeated calls to ensureOpen(), hasRemaining(), and related checks by caching position and limit values inside hot paths.
When writing ASCII, write directly to the underlying ByteBuffer array if capacity allows. This avoids the extra bounds and range checks in put(). Logic follows the same fast path approach used in String.getBytes() for ASCII.
Used str.charAt() instead of Character.toCodePoint() to avoid unnecessary surrogate checks when not needed (e.g., for characters within the ASCII or 2-byte UTF-8 code unit range). Fall back only when multi-unit characters (e.g., 3-byte UTF-8) are detected.

Performance analysis

To ensure accurate performance comparison and reduce noise, 11 versions (Comparison Versions) were aggregated and compared against a stable region of data around the Base Mainline Version. The percentage difference and z-score of the mean of the Comparison Versions were calculated relative to the Base Mainline Version’s stable region mean.

The following tables show improvements across two configurations:

The first uses ASCII-only strings from the regular benchmark suite.
The second uses an augmented benchmark dataset with UTF-8 strings containing 3-byte characters.

ASCII Benchmark Suite (Regular JSON Workloads)

Test	Base Mean (MB/s)	Patched Mean (ops/s)	Diff	Z-Score
Deep BSON Encoding	47.02	56.97	+21.2%	2.97
Flat BSON Encoding	105.49	238.03	+125.6%	6.65
Full BSON Encoding	110.68	176.65	+59.6%	4.21
Large Doc Bulk Insert	46.75	62.48	+33.6%	2.99
Large Doc InsertOne	50.52	67.18	+33.0%	3.03
Small Doc Bulk Insert	25.09	28.30	+12.8%	2.07

Perf analyzer results: Link

Augmented Benchmark Suite (UTF-8 Strings with 3-Byte Characters)

To evaluate performance on multi-byte UTF-8 input, the large_doc.json, small_doc.json, and tweet.json datasets were modified to use UTF-8 characters encoded with 3 bytes (code units). These changes were introduced on mainline via an Evergreen patch, and benchmark results were collected from:

11 runs on patched mainline (baseline with UTF-8 input)
11 runs on this branch (UTF-8 input + string write optimizations)

Test	Base Mean (MB/s)	Patched Mean (ops/s)	Diff	Z-Score
Deep BSON Encoding	91.31	124.3	+35.8%	6.94
Flat BSON Encoding	95.79	193.47	+100.2%	13.2
Full BSON Encoding	86.3	151.8	+73.4%	9.76
Large Doc Bulk Insert	35.5	46.0	+31.8%	8.79
Large Doc InsertOne	42.8	53.28	+25.9%	8.06
Small Doc Bulk Insert	43.6	53.85	+23.6%	8.45

Perf analyzer results: Link

JAVA-5816

- Remove extra bounds checking. - Add ASCII fast-loop similar to String.getBytes().

franz1981 · 2025-03-18T08:22:19Z

That's related to #1629 🙏

…ncoding tests. - Adjust logic to handle non-zero ByteBuffer.arrayOffset, as some Netty Pooled ByteBuffer implementations return an offset != 0. - Add unit tests for UTF-8 encoding across buffer boundaries and for malformed surrogate pairs. - Fix issue with a leacked reference count on ByteBufs in the pipe() method (2 non-released reference counts were retained). JAVA-5816

JAVA-5816

franz1981 · 2025-03-25T15:00:32Z

driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java

+
+            if (c < 0x80) {
+                if (remaining == 0) {
+                    buf = getNextByteBuffer();


I still suggest to give at shot at the PR I made which use a separate index to access single bytes in the internalNio Buffer within the Netty buffers, for two reasons:

NIO ByteBuffer can benefit from additional optimizations from the JVM since it's a known type to it

Netty buffer read/write both move forward the internal indexes and force Netty to verify accessibility of the buffer for each operation, which have some Java Memory Model effects (.e.g. any subsequent load has to happen for real, each time!)

Good point - accessing the NIO buffer directly sounds like a potential win. I’m aiming to keep this PR focused and incremental for easier review and integration. We could consider Netty-specific optimizations in a follow-up PR/ scope, once we have Netty benchmarks running in CI

rozza

Looks good a couple of comments and one suggestion to help improve the readability of writeCharacters

driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java

JAVA-5816

rozza

LGTM!

franz1981 · 2025-04-01T10:27:44Z

driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java

+        }
+
+        // We get here, when the buffer is not backed by an array, or when the string contains at least one non-ASCII characters.
+        return writeOnBuffers(str,


Can't we have within this a fast PATH for ASCII too?
It will grant more chances to get inlined (since is a smaller method) and be more unrolled...
If we can have a JMH bench it would be fairly easy (I can do it) to peek into the assembly produced to verify it

I’d expect the fast path for buffers to be in the else branch of if (curBuffer.hasArray()).
However, once we detect UTF-8 characters there, we call a fallback writeOnBuffers(maybe we could rename it to writeUtf8OnBuffers).

Are you suggesting we add a fast path similar to writeOnArrayAscii, but using dynamic buffer allocation and falling back to writeOnBuffers/writeUtf8OnBuffers when a UTF-8 character is encountered?

Are you suggesting we add a fast path similar to writeOnArrayAscii, but using dynamic buffer allocation and falling back to writeOnBuffers/writeUtf8OnBuffers when a UTF-8 character is encountered?

Yep, since I see the ascii path there is already taking care to change the buffer to write against instead of performing a lookup per each byte to write.
A tighter loop increase the chance it to be loop unrolled, although...the fact we can change the buffer where to write during the loop, can affect this - both for the array case and this.

Yeah, it makes sense. Thanks for the suggestion. I implemented writeOnBuffersAsccii and ran local benchmarks.

The implementation of writeOnBuffersAsccii

private int writeOnBuffersAsccii(final String str, final boolean checkNullTermination, final int stringPointer, final int bufferLimit, final int bufferPos, final ByteBuf buffer) { int remaining; int sp = stringPointer; int curBufferPos = bufferPos; int curBufferLimit = bufferLimit; ByteBuf curBuffer = buffer; final int length = str.length(); while (sp < length) { remaining = curBufferLimit - curBufferPos; char c = str.charAt(sp); if (checkNullTermination && c == 0) { throw new BsonSerializationException( format("BSON cstring '%s' is not valid because it contains a null character " + "at index %d", str, sp)); } if (c >= 0x80) { break; } if (remaining == 0) { curBuffer = getNextByteBuffer(); curBufferPos = 0; curBufferLimit = curBuffer.limit(); } curBuffer.put((byte) c); position++; sp++; curBufferPos++; } return sp; }

I printed the assembly to see the JIT’s behavior. It looks like the loop wasn’t unrolled by JIT - there’s only one charAt and put per iteration in the main loop. However, there seems to be an additional charAt in the buffer allocation path (after getNextByteBuffer), but it’s a separate code path and mostly an edge-case.

I’ve shared a GitHub Gist with the shortened assembly (keeping the key parts) and a pseudo-Java interpretation to show how the assembly might map back to the logic: Gist. Local perf showed modest gains likely limited by dynamic buffer allocation, as you noted. I’ll run more tests on a dedicated perf instance to confirm. If I missed anything in the assembly, please let me know!

I’m merging this PR for the current improvements, but I agree tighter loops or manual unrolling could be further explored, keeping in mind the maintainability trade-off.

I cannot see the assembly there , but the not decoded binary instead - did you miss the https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler so in your class path?
that will help me a lot

I mean something which uses compile command too as TechEmpower/FrameworkBenchmarks#9800 (comment)

You’re right - my earlier Gist showed raw hex. I recompiled on Oracle JDK 17.0.7 with hsdis for readable assembly. Thanks!

The main loop (0x0000000113a12c40–0x0000000113a12d3c) seems to have no unrolling:

Condition: 0x0000000113a12c40 (cmp w4, w15, line 453: sp < length). charAt: One instance at 0x0000000113a12c60–0x0000000113a12c88 (line 455). Checks: Null termination (0x0000000113a12c8c, line 456), ASCII (0x0000000113a12c94, line 460), buffer space (0x0000000113a12c98, line 463). put: One instance at 0x0000000113a12ca4–0x0000000113a12d20 (line 468). Index Updates: 0x0000000113a12d24–0x0000000113a12d34 (lines 469–471). Branch Back: 0x0000000113a12d3c: b.lt 0x0000000113a12c60, looping to 0x0000000113a12c40.

A second charAt seem to appear in the getNextByteBuffer path, not the main loop after compilation. I’ve created new Gist with the readable assembly.

Mmm It looks to me that the unrolling was having a factor of two, I will re read it again since I am more used to x86 asm :)

Anyway, I suggest to look at the PR I sent for this same optimization: having the check for remaining buffer space in the loop would bloat the loop body, reducing the chances that C2 will unroll it many times.
You should keep the loop as simple and branch free as possible

Another reason why the Netty version loop body is too fat is because the JIT doesn't trust final fields and since you can get a new mongo ByteBuf at each iteration, the required amount of pointer chase (mongo buf, Netty swapped big, Netty buf, Unsafe..) is way to much; this prevent massively to have unrolling.
The reason why I have unwrapped till Netty or NIO, the buffers, is to be as per byte[], at the level in which the loop body can just use whatever is saved into a register (the reference of the address field in the Netty buffer) assuming it to be constant, and trust it.

# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java

JAVA-5816

vbabanin · 2025-04-17T03:03:31Z

I've added additional tests for code points represented by surrogate pairs:
dd5fe4d

Adding a second reviewer to look at just this commit - the rest has already been reviewed by @rozza.

vbabanin · 2025-04-17T03:12:14Z

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java

+                                    1 << 21, 1 << 22, 1 << 23, 1 << 24, 1 << 24),
+                            byteBuffers.stream().map(ByteBuf::capacity).collect(toList()));
+                } finally {
+                    byteBuffers.forEach(ByteBuf::release);


Netty was logging warnings due to a direct Netty ByteBuf not being released properly.

Yep leaving leak detection paranoid enabled for test is key to save leaks to happen

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufSpecification.groovy

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java

vbabanin added 2 commits March 17, 2025 22:25

Optimize writeCharacters for NIO Buffers.

d0f68c4

- Remove extra bounds checking. - Add ASCII fast-loop similar to String.getBytes().

Merge branch 'main' into string-opt

73abcc9

vbabanin self-assigned this Mar 20, 2025

vbabanin changed the title ~~writeCharacters for NIO Buffers.~~ Optimize String write Mar 23, 2025

Move List creation to a static variable.

502f84b

JAVA-5816

vbabanin requested a review from rozza March 25, 2025 14:44

Merge branch 'main' into string-opt

879ddbd

franz1981 suggested changes Mar 25, 2025

View reviewed changes

vbabanin mentioned this pull request Mar 26, 2025

JAVA-5788 Improve ByteBufferBsonOutput::writeCharacters #1629

Open

vbabanin marked this pull request as ready for review March 26, 2025 05:22

Merge branch 'main' into string-opt

759381d

rozza requested changes Mar 31, 2025

View reviewed changes

vbabanin added 4 commits March 31, 2025 17:51

Add comment to code flow.

e29638d

JAVA-5816

Change comment.

8ba1940

JAVA-5816

Add additional comments.

3980457

JAVA-5816

Add comments.

d9cf649

JAVA-5816

vbabanin requested a review from rozza April 1, 2025 01:12

Merge branch 'main' into string-opt

3ff0644

rozza approved these changes Apr 1, 2025

View reviewed changes

franz1981 suggested changes Apr 1, 2025

View reviewed changes

vbabanin added 2 commits April 8, 2025 18:57

Merge branch 'main' into string-opt

43f1663

# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java

Add tests for valid surrogates.

dd5fe4d

JAVA-5816

vbabanin requested a review from katcharov April 17, 2025 03:06

vbabanin commented Apr 17, 2025

View reviewed changes

vbabanin requested review from nhachicha and removed request for katcharov April 17, 2025 07:46

Fix static check issues.

e4f6f31

nhachicha reviewed Apr 17, 2025

View reviewed changes

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufSpecification.groovy Show resolved Hide resolved

driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java Outdated Show resolved Hide resolved

Fix method delegates.

db424d1

nhachicha approved these changes Apr 17, 2025

View reviewed changes

vbabanin merged commit ea85540 into mongodb:main Apr 17, 2025
228 of 245 checks passed

vbabanin mentioned this pull request Apr 17, 2025

Fix ByteBufferBsonOutput buffer caching logic. #1683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize String write #1651

Optimize String write #1651

vbabanin commented Mar 18, 2025 •

edited

Loading

franz1981 commented Mar 18, 2025

franz1981 Mar 25, 2025

vbabanin Mar 31, 2025

rozza left a comment

rozza left a comment

franz1981 Apr 1, 2025 •

edited

Loading

vbabanin Apr 2, 2025 •

edited

Loading

franz1981 Apr 3, 2025 •

edited

Loading

vbabanin Apr 14, 2025 •

edited

Loading

franz1981 Apr 14, 2025

franz1981 Apr 14, 2025

vbabanin Apr 17, 2025 •

edited

Loading

franz1981 Apr 17, 2025 •

edited

Loading

franz1981 Apr 17, 2025

vbabanin commented Apr 17, 2025

vbabanin Apr 17, 2025 •

edited

Loading

franz1981 Apr 17, 2025

Optimize String write #1651

Optimize String write #1651

Conversation

vbabanin commented Mar 18, 2025 • edited Loading

Summary

Key Changes:

Performance analysis

ASCII Benchmark Suite (Regular JSON Workloads)

Augmented Benchmark Suite (UTF-8 Strings with 3-Byte Characters)

franz1981 commented Mar 18, 2025

franz1981 Mar 25, 2025

Choose a reason for hiding this comment

vbabanin Mar 31, 2025

Choose a reason for hiding this comment

rozza left a comment

Choose a reason for hiding this comment

rozza left a comment

Choose a reason for hiding this comment

franz1981 Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

vbabanin Apr 2, 2025 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 3, 2025 • edited Loading

Choose a reason for hiding this comment

vbabanin Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 14, 2025

Choose a reason for hiding this comment

franz1981 Apr 14, 2025

Choose a reason for hiding this comment

vbabanin Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 17, 2025

Choose a reason for hiding this comment

vbabanin commented Apr 17, 2025

vbabanin Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 17, 2025

Choose a reason for hiding this comment

vbabanin commented Mar 18, 2025 •

edited

Loading

franz1981 Apr 1, 2025 •

edited

Loading

vbabanin Apr 2, 2025 •

edited

Loading

franz1981 Apr 3, 2025 •

edited

Loading

vbabanin Apr 14, 2025 •

edited

Loading

vbabanin Apr 17, 2025 •

edited

Loading

franz1981 Apr 17, 2025 •

edited

Loading

vbabanin Apr 17, 2025 •

edited

Loading