-
Notifications
You must be signed in to change notification settings - Fork 904
Implementation of auto batch #3265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please review the auto batch implementation.
|
I have checkouted this branch. As i see in my game :
|
@1scaR1 Thanks for testing, what do you mean by this one ?
|
This should have been fixed by the last commit |
Yes, there are no zooms now) Last commit also fixed:
|
Scale9Sprite issue was due to refactoration of variable name, I forgot to change all of them 😆 , now the other two issues should have been fixed @1scaR1 Please also tell me how performance benefit from the auto batch in your game |
@pandamicro, Yes, all issues are fixed. But in Safari on a scene with a lot of same sprites(for about 500 and more), on different layers, i have lags with scrolling in Parallax or ScrollView.(Without this PR i have the same lags;) ) |
function objEqual (a, b) { | ||
if (!a || !b) return false; | ||
|
||
var keys = Object.keys(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use "for var in" to avoid gc
+10086 |
In this repo, Scale9Sprite already support auto batch, creator has a different Scale9Sprite implementation, we need to support it in creator repo |
There is cpu profilers in Chrome and Safari: These profiles shows the common performance consumers:
The scene consists of ScrollView with 8 Layouts in it. Each layout has a layer with about 100 sprites in average. Also there is a background in scene which has ~600 sprites. And with it we can see performance difference: as you can see, render of sprites (on screens above) on background costs about 8% of cpu time in Chrome and 14% in Safari(thats the main difference in them). Before this PR render of background sprites costed about 30% of cpu time in both browsers. But render of sprites in ScrollView is the same before and after pull request(about ~50%). I think that batching doesn't affect sprites in ScrollView. Also in this PR cpu time of rendering of sprites consumed by |
@1scaR1 Many thanks for your test ! I will continue to improve the implementation |
Should be the first ones are ignored by updating transform in buffer logic, should be fixed now |
@pandamicro Now all is perfect! |
Some more test data: On iOS it's not very stable, but it shows a little bit improvement using v4, and on Android the performance suffers, thought I must say Mi 2S is a very old device, new Android 5.0 devices should do a lot better than that. But I think between v2 and v4, v4 use too much memory so it's not very stable on mobile and I think we should desactivate v4 on mobile, activate it on desktop. |
if (ACTIVATE_V4) { | ||
CACHING_BUFFER = true; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be combine it into one variable CACHING_BUFFER
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
I think that the PR can be merged. I propose to remove render cmd of ccui.ScrollView. They don't benefit to render process. |
var len = children.length; | ||
var minZ = Number.MAX_VALUE; | ||
var maxZ = -Number.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should comment minZ and maxZ out
Found issues that causing v4 slow on mobile, will update tomorrow |
Will this be part of 3.11? I.e. backwards compatible with 3.10 and before? Also please give info on what's slow on mobile, would be interesting to know 👍 BTW, I made some more work on our own branch of cocos2d-html5 and improved performance a bit. First of all, I changed the way the z values are assigned: I removed the assignZ stuff that is called in director.drawScene, and instead assign the z values directly in rendererWebGL.rendering. The draw calls arrive in the correct order anyways, so you can compute the z values from there and if you want to draw stuff out-of-order you can still do that after. That also means that I assign the correct z values to the stackMatrix of each renderCmd there directly. Our game (Idle Raiders) now does about 1.5 milliseconds less work each frame as a result. Next, I just did the && thing I mentioned above to avoid additional transformations. I have tested it a lot and can't detect any errors, and it was a giant performance save in scenes with lots of nodes (we have about 2000 nodes in total in our hierarchy, as far as I remember), so I leave it for now. I should mention though, that if you want to do this as well, you need to change the z assigning also. Because if you don't, you will get flickering because z values are sometimes updated too late (this is actually what made me do the change above). Also, we are now using a single index buffer for everything that wants to render quads... if you think about it, it's just a list of indices that stays the same for all render cmds that want to draw quads (as long as those take care of uploading the stuff into buffers correctly), the only thing that changes for renderCmds is the number of quads they draw (and MAYBE index at which they start), both of which can be handed as arguments to gl.drawElements. This also sped up the entire application by about 5% and was a pretty small change. If you want to take a look at it you can look at all commits from April 15 and 16 in the gotta-batch-them-all branch here: https://github.com/Talisca/cocos2d-html5/commits/gotta-batch-them-all Overall for our game, it improved idle time from 24.43% to 53.30%. Some more details here: https://gist.github.com/Heishe/3814703722e498e14fdc0095c201fcff I also have some ideas for future optimizations about all rendering in general, but I don't know if I'm even gonna have time to work on it. I can summarize it: All renderCmds create too many WebGL buffers, and they change state too often. Stuff like gl.vertexAttribPointer is very expensive (I'm not exaggerating, it takes up 11% of our total rendering time in our game), so I'd try to make sure that vertexAttribPointer is only called when necessary. If you look at various renderCmds, they have very similar vertex layouts, so for many you don't need to call vertexAttribPointer again if you have called it before with the same arguments. Also having 3000+ array buffers hanging around (for example because each sprite creates its own) and binding them for each draw call is very slow also (bindBuffer alone takes up 7% of all rendering time in our game). It probably also puts some load on the graphics driver in the background. I would heavily suggest that instead of each renderCmd creating their own buffers for rendering, they should use some kind of global buffer allocator provided by the cc framework which only returns subsections of larger buffers (optimally just a single one, but you can split it up if you must. Just take note that a single 1 megabyte large buffer can fit a ton of sprites and their QUAD data). The only thing you have to change in the renderCmds is where they upload their data to webgl and then use correct offsets in gl.drawArrays and gl.drawElements. I actually think this might not be all that complicated. |
@heishe Thanks again, you provided some very interesting ideas.
Yes, this PR is totally backward compatible, and I will try my best to ensure its stability and merge it in v3.11 About mobile performance, on iOS, it's similar with desktop, it's just every thing take a little more time, so improvements applies equally on desktop and iOS. But on poor Android devices, things just don't get improved, I think we are using too much memory on Android so the rendering save up is just invisible.
Make sense
Indeed, only some very rare case can cause problems, I haven't got time to test, but feel free to remove it in your own branch, it's not a proper fix anyway, and if it doesn't cause any issue, means your game don't trigger it, so it's safe. Last few things you mentioned are all good orientations, I think the most important thing is to reduce WebGL buffer instances, sharing buffers is a great idea. Auto batching v5Now to my part, I spent the weekend to implemented a new version of auto batching, actually most of the time, I was struggling fix v4 issues. At last I came to conclusion that v4 can't be maintained, it's too complicated and it's just impossible to take care of all different use cases. The core problem of v4 is to check based on buffers, it was guessing how to split into virtual buffers, when to regenerate buffers and so on. In v5, I use more informations to simplify my life, the general idea is still the same as v4: to lazy update buffers only when necessary. But rules are different in v5:
In this way, when render commands don't change, the auto batching buffers don't change, even if the render commands change, it will cleverly reuse as much as possible the old buffers. Here is the result: Future improvements
|
@heishe You idea to use only one element buffer is really clever ! I merged it into my branch too. |
@pandamicro , May be move info, such as triangles indices and it's count, back to Sprite render cmd? |
@@ -603,8 +608,9 @@ return { | |||
gl.vertexAttribPointer(cc.VERTEX_ATTRIB_MVMAT0 + i, 4, gl.FLOAT, false, bytesPerRow * 4, matrixOffset + bytesPerRow * i); //stride is one row | |||
} | |||
|
|||
gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, _currentBuffer.buffer.elementBuffer); | |||
gl.drawElements(gl.TRIANGLES, _currentBuffer.totalIndexSize, gl.UNSIGNED_SHORT, _currentBuffer.indexOffset); | |||
var elemBuffer = getQuadIndexBuffer(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool now you just using one index buffer everywhere, very clever!
But my question is, how to change the rendering order without customizing index buffer? using z buffer?
I don't understand. For polygon sprite, we need the renderer to support triangle render commands (mesh), the last improvement may not be very good here, because it consider only quad commands, triangle commands have different index buffer layout. Same for Scale9Sprite, which should have 12 vertices for 54 index buffer size (18 triangles). So maybe finally we need to revert the last commit. |
@pandamicro I mean that generation of index buffer for sprite is business of sprite cmd itself. |
If you want to have automatic batching of everything, you need to have a rendering scheme that I described at the end in my first big post (but it's a big refactor), where the rendering itself isn't implemented by render Cmds, but rather by a central renderer that just gets info on what data it needs for drawing from the various render Cmds. This is because renderCmds keep rendering information strictly local to them, while for batching, you need to have global information from multiple renderCmds. Actually, with this automatic batching stuff you are already partially doing this: You check 'what the node needs to draw' (by just checking for its type) in order to know if you can batch it, and then you somehow extract the data required to render it and put it into common buffers for batching. If you keep extending the current system you will naturally arrive at this point for automatic batching, where renderCmds will have their used resources (buffers,textures), fixed function state (blending, etc.), the shaders they use to draw, and vertex layouts (etc.) available as information somehow and a 'central renderer' (your automatic batching rendering code) will use it to draw everything automatically. The difference will be that renderCmds will still need custom render logic for non-batched drawing even though it isn't really necessary. Actually recently I have found this presentation that describes the concept perfectly. It calls these collections of draw information 'draw items'. It's all for a high end 3D renderer and DirectX 11+/ OpenGL 3+, but the concepts are of course all the same for 2D and WebGL: |
Yes, previously, I collect index buffer layout from render commands, so it's more flexible. |
I tested v5 for two days, and it's working well, and this PR became too big so I'm merging this PR. I will have some followed up PRs to improve the performance from different ways. |
No description provided.