Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize String write #1651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize String write #1651
Changes from 13 commits
d0f68c4
73abcc9
577f6bf
502f84b
879ddbd
759381d
e29638d
8ba1940
3980457
d9cf649
3ff0644
43f1663
dd5fe4d
e4f6f31
db424d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have within this a fast PATH for ASCII too?
It will grant more chances to get inlined (since is a smaller method) and be more unrolled...
If we can have a JMH bench it would be fairly easy (I can do it) to peek into the assembly produced to verify it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d expect the fast path for buffers to be in the
else
branch ofif (curBuffer.hasArray())
.However, once we detect UTF-8 characters there, we call a fallback
writeOnBuffers
(maybe we could rename it towriteUtf8OnBuffers
).Are you suggesting we add a fast path similar to
writeOnArrayAscii
, but using dynamic buffer allocation and falling back towriteOnBuffers/writeUtf8OnBuffers
when a UTF-8 character is encountered?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, since I see the ascii path there is already taking care to change the buffer to write against instead of performing a lookup per each byte to write.
A tighter loop increase the chance it to be loop unrolled, although...the fact we can change the buffer where to write during the loop, can affect this - both for the array case and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense. Thanks for the suggestion. I implemented
writeOnBuffersAsccii
and ran local benchmarks.The implementation of writeOnBuffersAsccii
I printed the assembly to see the JIT’s behavior. It looks like the loop wasn’t unrolled by JIT - there’s only one
charAt
andput
per iteration in the main loop. However, there seems to be an additionalcharAt
in the buffer allocation path (aftergetNextByteBuffer
), but it’s a separate code path and mostly an edge-case.I’ve shared a GitHub Gist with the shortened assembly (keeping the key parts) and a pseudo-Java interpretation to show how the assembly might map back to the logic: Gist. Local perf showed modest gains likely limited by dynamic buffer allocation, as you noted. I’ll run more tests on a dedicated perf instance to confirm. If I missed anything in the assembly, please let me know!
I’m merging this PR for the current improvements, but I agree tighter loops or manual unrolling could be further explored, keeping in mind the maintainability trade-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot see the assembly there , but the not decoded binary instead - did you miss the https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler
so
in your class path?that will help me a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean something which uses compile command too as TechEmpower/FrameworkBenchmarks#9800 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right - my earlier Gist showed raw hex. I recompiled on Oracle JDK 17.0.7 with
hsdis
for readable assembly. Thanks!The main loop (0x0000000113a12c40–0x0000000113a12d3c) seems to have no unrolling:
A second
charAt
seem to appear in thegetNextByteBuffer
path, not the main loop after compilation. I’ve created new Gist with the readable assembly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm It looks to me that the unrolling was having a factor of two, I will re read it again since I am more used to x86 asm :)
Anyway, I suggest to look at the PR I sent for this same optimization: having the check for remaining buffer space in the loop would bloat the loop body, reducing the chances that C2 will unroll it many times.
You should keep the loop as simple and branch free as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why the Netty version loop body is too fat is because the JIT doesn't trust final fields and since you can get a new mongo ByteBuf at each iteration, the required amount of pointer chase (mongo buf, Netty swapped big, Netty buf, Unsafe..) is way to much; this prevent massively to have unrolling.
The reason why I have unwrapped till Netty or NIO, the buffers, is to be as per byte[], at the level in which the loop body can just use whatever is saved into a register (the reference of the address field in the Netty buffer) assuming it to be constant, and trust it.