Implementation of auto batch #3265

pandamicro · 2016-04-11T03:39:46Z

No description provided.

pandamicro · 2016-04-11T14:16:44Z

@dabingnn @heishe @jareguo

Please review the auto batch implementation.

It's based on @heishe 's PR Gotta batch them all #3248 with some improvements: like moving batch rendering logic into RendererWebGL, render command only upload data.
It now only supports Sprite and all Node type using Sprite

1scaR1 · 2016-04-11T14:33:12Z

I have checkouted this branch. As i see in my game :

Scale9Sprite doesn't work completely : only some parts of image are drawn
Something wrong with positions of spites in Buttons on click
After scene transition image on screen a little bit zooms in.
Suddenly image on screen starts scale infinitely
Sprite frames from the same sprite sheet are not all drawn on one Layer.

pandamicro · 2016-04-11T15:11:12Z

@1scaR1 Thanks for testing, what do you mean by this one ?

Sprite frames from the same sprite sheet are not all drawn on one Layer.

1scaR1 · 2016-04-11T15:22:18Z

I looks like this in my game:

Original engine:

This PR:

Walls and terrains are from one sprite sheet:

Also, as you can see, the background is not full.(It fills also from sprite sheet).

I also found that on layer(and this layer is in clipping node and clipping node is in ScrollView), that does't have anything except sprites from sprite frames, all seems to be ok.

pandamicro · 2016-04-11T15:24:41Z

After scene transition image on screen a little bit zooms in.

Suddenly image on screen starts scale infinitely

This should have been fixed by the last commit

1scaR1 · 2016-04-11T15:28:16Z

Yes, there are no zooms now) Last commit also fixed:

Something wrong with positions of spites in Buttons on click

pandamicro · 2016-04-11T15:31:55Z

Scale9Sprite issue was due to refactoration of variable name, I forgot to change all of them 😆 , now the other two issues should have been fixed @1scaR1

Please also tell me how performance benefit from the auto batch in your game

1scaR1 · 2016-04-11T15:49:03Z

@pandamicro, Yes, all issues are fixed.
It's strange, but i don't see improvements(visible for eyes) with common usage of my game.
In profiler of Google Chrome i see that page have more idle time(for about ~10-20%).

But in Safari on a scene with a lot of same sprites(for about 500 and more), on different layers, i have lags with scrolling in Parallax or ScrollView.(Without this PR i have the same lags;) )

jareguo · 2016-04-11T16:17:07Z

cocos2d/core/renderer/RendererWebGL.js

+function objEqual (a, b) {
+    if (!a || !b) return false;
+
+    var keys = Object.keys(a);


can use "for var in" to avoid gc

jareguo · 2016-04-11T17:19:27Z

+10086
But could you port to Scale9Sprite? (then Creator can also use it)

pandamicro · 2016-04-12T03:07:03Z

But could you port to Scale9Sprite? (then Creator can also use it)

In this repo, Scale9Sprite already support auto batch, creator has a different Scale9Sprite implementation, we need to support it in creator repo

… autobatch

1scaR1 · 2016-04-12T08:43:09Z

There is cpu profilers in Chrome and Safari:

Safari:
Chrome:

These profiles shows the common performance consumers:

_forwardBatch function(and it's strange that function objEqual in it costs a lot of time).
Rendering of sprites in ScrollView

The scene consists of ScrollView with 8 Layouts in it. Each layout has a layer with about 100 sprites in average. Also there is a background in scene which has ~600 sprites. And with it we can see performance difference: as you can see, render of sprites (on screens above) on background costs about 8% of cpu time in Chrome and 14% in Safari(thats the main difference in them).

Before this PR render of background sprites costed about 30% of cpu time in both browsers. But render of sprites in ScrollView is the same before and after pull request(about ~50%).

I think that batching doesn't affect sprites in ScrollView. Also in this PR cpu time of rendering of sprites consumed by _forwardBatch function. So total render time is almost the same.
Thats why, i assume, this PR doesn't seriously affect on performance in my game .

pandamicro · 2016-04-12T08:58:30Z

@1scaR1 Many thanks for your test ! I will continue to improve the implementation

pandamicro · 2016-04-14T10:04:52Z

Sprites seems to be first o last(i can't understand which) in batch buffers are not moving with it.

Should be the first ones are ignored by updating transform in buffer logic, should be fixed now

1scaR1 · 2016-04-14T10:14:25Z

@pandamicro Now all is perfect!
Average time of rendering reduced in my game from 40-35%(without batching) to 17-15%(with latest version of batching).

pandamicro · 2016-04-14T15:56:38Z

Some more test data:

On iOS it's not very stable, but it shows a little bit improvement using v4, and on Android the performance suffers, thought I must say Mi 2S is a very old device, new Android 5.0 devices should do a lot better than that.

But I think between v2 and v4, v4 use too much memory so it's not very stable on mobile and I think we should desactivate v4 on mobile, activate it on desktop.

pandamicro · 2016-04-14T16:06:11Z

@dabingnn @jareguo @1scaR1 @heishe Please recheck, if there is no other problem, I will merge this PR

1scaR1 · 2016-04-14T17:36:11Z

cocos2d/core/renderer/RendererWebGL.js

+if (ACTIVATE_V4) {
+    CACHING_BUFFER = true;
+}
+


May be combine it into one variable CACHING_BUFFER?

1scaR1 · 2016-04-14T17:43:30Z

I think that the PR can be merged.
I don't have any artifacts in all parts of my game. And I didn't find mistakes on bound cases.

I propose to remove render cmd of ccui.ScrollView. They don't benefit to render process.
I've tested it without them, and all seems to be ok.

jareguo · 2016-04-15T01:58:00Z

cocos2d/core/base-nodes/CCNodeCanvasRenderCmd.js

        var len = children.length;
+        var minZ = Number.MAX_VALUE;
+        var maxZ = -Number.MAX_VALUE;


you should comment minZ and maxZ out

pandamicro · 2016-04-15T15:59:53Z

Found issues that causing v4 slow on mobile, will update tomorrow

TravisGesslein · 2016-04-16T13:11:32Z

Will this be part of 3.11? I.e. backwards compatible with 3.10 and before? Also please give info on what's slow on mobile, would be interesting to know 👍

BTW, I made some more work on our own branch of cocos2d-html5 and improved performance a bit. First of all, I changed the way the z values are assigned: I removed the assignZ stuff that is called in director.drawScene, and instead assign the z values directly in rendererWebGL.rendering. The draw calls arrive in the correct order anyways, so you can compute the z values from there and if you want to draw stuff out-of-order you can still do that after. That also means that I assign the correct z values to the stackMatrix of each renderCmd there directly. Our game (Idle Raiders) now does about 1.5 milliseconds less work each frame as a result.

Next, I just did the && thing I mentioned above to avoid additional transformations. I have tested it a lot and can't detect any errors, and it was a giant performance save in scenes with lots of nodes (we have about 2000 nodes in total in our hierarchy, as far as I remember), so I leave it for now. I should mention though, that if you want to do this as well, you need to change the z assigning also. Because if you don't, you will get flickering because z values are sometimes updated too late (this is actually what made me do the change above).

Also, we are now using a single index buffer for everything that wants to render quads... if you think about it, it's just a list of indices that stays the same for all render cmds that want to draw quads (as long as those take care of uploading the stuff into buffers correctly), the only thing that changes for renderCmds is the number of quads they draw (and MAYBE index at which they start), both of which can be handed as arguments to gl.drawElements. This also sped up the entire application by about 5% and was a pretty small change.

If you want to take a look at it you can look at all commits from April 15 and 16 in the gotta-batch-them-all branch here: https://github.com/Talisca/cocos2d-html5/commits/gotta-batch-them-all

Overall for our game, it improved idle time from 24.43% to 53.30%. Some more details here: https://gist.github.com/Heishe/3814703722e498e14fdc0095c201fcff
Note that I haven't integrated this PR into our own branch so far, so it's based on the code I accidentally PR-ed to you a week ago.

I also have some ideas for future optimizations about all rendering in general, but I don't know if I'm even gonna have time to work on it. I can summarize it: All renderCmds create too many WebGL buffers, and they change state too often. Stuff like gl.vertexAttribPointer is very expensive (I'm not exaggerating, it takes up 11% of our total rendering time in our game), so I'd try to make sure that vertexAttribPointer is only called when necessary. If you look at various renderCmds, they have very similar vertex layouts, so for many you don't need to call vertexAttribPointer again if you have called it before with the same arguments.

Also having 3000+ array buffers hanging around (for example because each sprite creates its own) and binding them for each draw call is very slow also (bindBuffer alone takes up 7% of all rendering time in our game). It probably also puts some load on the graphics driver in the background.

I would heavily suggest that instead of each renderCmd creating their own buffers for rendering, they should use some kind of global buffer allocator provided by the cc framework which only returns subsections of larger buffers (optimally just a single one, but you can split it up if you must. Just take note that a single 1 megabyte large buffer can fit a ton of sprites and their QUAD data). The only thing you have to change in the renderCmds is where they upload their data to webgl and then use correct offsets in gl.drawArrays and gl.drawElements. I actually think this might not be all that complicated.

pandamicro · 2016-04-17T15:54:25Z

@heishe Thanks again, you provided some very interesting ideas.

Will this be part of 3.11? I.e. backwards compatible with 3.10 and before? Also please give info on what's slow on mobile, would be interesting to know

Yes, this PR is totally backward compatible, and I will try my best to ensure its stability and merge it in v3.11

About mobile performance, on iOS, it's similar with desktop, it's just every thing take a little more time, so improvements applies equally on desktop and iOS. But on poor Android devices, things just don't get improved, I think we are using too much memory on Android so the rendering save up is just invisible.

I removed the assignZ stuff that is called in director.drawScene, and instead assign the z values directly in rendererWebGL.rendering.

Make sense

Next, I just did the && thing I mentioned above to avoid additional transformations. I have tested it a lot and can't detect any errors

Indeed, only some very rare case can cause problems, I haven't got time to test, but feel free to remove it in your own branch, it's not a proper fix anyway, and if it doesn't cause any issue, means your game don't trigger it, so it's safe.

Last few things you mentioned are all good orientations, I think the most important thing is to reduce WebGL buffer instances, sharing buffers is a great idea.

Auto batching v5

Now to my part, I spent the weekend to implemented a new version of auto batching, actually most of the time, I was struggling fix v4 issues. At last I came to conclusion that v4 can't be maintained, it's too complicated and it's just impossible to take care of all different use cases.

The core problem of v4 is to check based on buffers, it was guessing how to split into virtual buffers, when to regenerate buffers and so on.

In v5, I use more informations to simplify my life, the general idea is still the same as v4: to lazy update buffers only when necessary. But rules are different in v5:

Generate auto batch sections and buffers in _forwardBatch, same as before.
When children order dirty (means command list change), invoke _refreshVirtualBuffers to update the auto batch sections.
Save the previous command list, compare with the current one in _refreshVirtualBuffers, it can:
- Reuse as much as possible the old buffers, by split it into different sub virtual buffers but using the same data and webgl buffer.
- Clear invalid _vBuffer in render commands that need to be re-batched.
- Batch other render commands without _vBuffer property when possible, including invalid ones and new commands.
Virtual buffers will be regenerated or disposed in the following case:
- When commands in the old auto batch section are all removed or only one of them left.
- When new command is added and can be batched together with an existing auto batch section.
_forwardCheck assumes all virtual buffers is correct, and try to use them as is, no modifications made here.

In this way, when render commands don't change, the auto batching buffers don't change, even if the render commands change, it will cleverly reuse as much as possible the old buffers. Here is the result:

Future improvements

Remove matrix data in vertex buffer, now matrix data are duplicated for each vertex, but we can calculate the final vertex position in CPU, it will bring more calculations in CPU but will save a great amount of memory. It can also reduce vertexAttribPointer call by 1 per batch.
I will also try to reduce WebGL buffers, activate depth test and batch more render commands, but not in v3.11

pandamicro · 2016-04-18T18:27:29Z

@heishe You idea to use only one element buffer is really clever ! I merged it into my branch too.

1scaR1 · 2016-04-18T19:07:48Z

@pandamicro , May be move info, such as triangles indices and it's count, back to Sprite render cmd?
Now, this commit removes possibility to implement PolygonSprite, like in cocos2d-x, without changing of renderer, isn't it?

jareguo · 2016-04-19T05:54:10Z

cocos2d/core/renderer/RendererWebGL.js

@@ -603,8 +608,9 @@ return {
            gl.vertexAttribPointer(cc.VERTEX_ATTRIB_MVMAT0 + i, 4, gl.FLOAT, false, bytesPerRow * 4, matrixOffset + bytesPerRow * i); //stride is one row
        }

-        gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, _currentBuffer.buffer.elementBuffer);
-        gl.drawElements(gl.TRIANGLES, _currentBuffer.totalIndexSize, gl.UNSIGNED_SHORT, _currentBuffer.indexOffset);
+        var elemBuffer = getQuadIndexBuffer(count);


Cool now you just using one index buffer everywhere, very clever!
But my question is, how to change the rendering order without customizing index buffer? using z buffer?

pandamicro · 2016-04-19T07:58:40Z

Now, this commit removes possibility to implement PolygonSprite, like in cocos2d-x, without changing of renderer, isn't it?

I don't understand.

For polygon sprite, we need the renderer to support triangle render commands (mesh), the last improvement may not be very good here, because it consider only quad commands, triangle commands have different index buffer layout. Same for Scale9Sprite, which should have 12 vertices for 54 index buffer size (18 triangles).

So maybe finally we need to revert the last commit.

1scaR1 · 2016-04-19T08:22:36Z

@pandamicro I mean that generation of index buffer for sprite is business of sprite cmd itself.
I think renderer must deal with quad(in cocos2d-x now it's a subclass of triangle command) & triangle commands generated by sprite cmd and then insert then in common batch buffer by own rules.

TravisGesslein · 2016-04-19T08:36:35Z

If you want to have automatic batching of everything, you need to have a rendering scheme that I described at the end in my first big post (but it's a big refactor), where the rendering itself isn't implemented by render Cmds, but rather by a central renderer that just gets info on what data it needs for drawing from the various render Cmds.

This is because renderCmds keep rendering information strictly local to them, while for batching, you need to have global information from multiple renderCmds. Actually, with this automatic batching stuff you are already partially doing this: You check 'what the node needs to draw' (by just checking for its type) in order to know if you can batch it, and then you somehow extract the data required to render it and put it into common buffers for batching.

If you keep extending the current system you will naturally arrive at this point for automatic batching, where renderCmds will have their used resources (buffers,textures), fixed function state (blending, etc.), the shaders they use to draw, and vertex layouts (etc.) available as information somehow and a 'central renderer' (your automatic batching rendering code) will use it to draw everything automatically. The difference will be that renderCmds will still need custom render logic for non-batched drawing even though it isn't really necessary.

Actually recently I have found this presentation that describes the concept perfectly. It calls these collections of draw information 'draw items'. It's all for a high end 3D renderer and DirectX 11+/ OpenGL 3+, but the concepts are of course all the same for 2D and WebGL:

https://www.dropbox.com/sh/4uvmgy14je7eaxv/AAAFYxdx3r2MZloWPYrQLA3Ka/Designing%20a%20Modern%20GPU%20Interface.pptx?dl=0

pandamicro · 2016-04-19T08:39:44Z

I mean that generation of index buffer for sprite is business of sprite cmd itself.
I think renderer must deal with quad(in cocos2d-x now it's a subclass of triangle command) & triangle commands generated by sprite cmd and then insert them in common batch buffer.

Yes, previously, I collect index buffer layout from render commands, so it's more flexible.

pandamicro · 2016-04-20T16:57:04Z

I tested v5 for two days, and it's working well, and this PR became too big so I'm merging this PR. I will have some followed up PRs to improve the performance from different ways.

pandamicro mentioned this pull request Apr 11, 2016

Gotta batch them all #3248

Closed

pandamicro changed the title ~~Draft implementation of auto batch~~ Implementation of auto batch Apr 11, 2016

jareguo reviewed Apr 11, 2016
View reviewed changes

pandamicro added 4 commits April 12, 2016 11:05

Draft implementation of auto batch

400859e

Bind attribute MVMAT and Fix matrix attribute set issue

b845a16

Improve auto batch implementation

66231d5

Improve details and add comments & Add reference to @heishe

cf37f56

pandamicro force-pushed the autobatch branch from 3b3ca60 to cf37f56 Compare April 12, 2016 03:07

Merge branch 'develop' of git://github.com/cocos2d/cocos2d-html5 into…

4d689aa

… autobatch

Small performance improvement for auto batching

f8f873e

Fix transform issue with first sprite batched

8201f3e

Disable autobatch v4 on mobile devices

eb590d7

1scaR1 reviewed Apr 14, 2016
View reviewed changes

jareguo reviewed Apr 15, 2016
View reviewed changes

Rearrange auto batch flags: ACTIVATE_AUTO_BATCH + CACHING_BUFFER

a459947

Add SimplePool and fix status sync issue with sprite in auto batch

3b9b712

pandamicro added 2 commits April 17, 2016 23:10

Auto batch v5

4cf480e

Activate CACHING_BUFFER by default

d0e3ea6

pandamicro added 2 commits April 18, 2016 17:02

Fix GLProgram's uniform cache issue

c7a0c05

Use only one index buffer for auto batching

5c252e6

jareguo reviewed Apr 19, 2016
View reviewed changes

pandamicro merged commit 7c3b0d8 into cocos2d:develop Apr 20, 2016

pandamicro mentioned this pull request Apr 20, 2016

Add GlobalVertexBuffer for Sprites to share #3268

Merged

pandamicro mentioned this pull request Apr 28, 2016

Fix issues caused by performance improvement #3273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of auto batch #3265

Implementation of auto batch #3265

pandamicro commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

jareguo Apr 11, 2016

jareguo commented Apr 11, 2016

pandamicro commented Apr 12, 2016

1scaR1 commented Apr 12, 2016

pandamicro commented Apr 12, 2016

pandamicro commented Apr 14, 2016

1scaR1 commented Apr 14, 2016

pandamicro commented Apr 14, 2016

pandamicro commented Apr 14, 2016

1scaR1 Apr 14, 2016

jareguo Apr 15, 2016

pandamicro Apr 15, 2016

1scaR1 commented Apr 14, 2016

jareguo Apr 15, 2016

pandamicro commented Apr 15, 2016

TravisGesslein commented Apr 16, 2016 •

edited

Loading

pandamicro commented Apr 17, 2016 •

edited

Loading

pandamicro commented Apr 18, 2016

1scaR1 commented Apr 18, 2016 •

edited

Loading

jareguo Apr 19, 2016

pandamicro commented Apr 19, 2016

1scaR1 commented Apr 19, 2016 •

edited

Loading

TravisGesslein commented Apr 19, 2016 •

edited

Loading

pandamicro commented Apr 19, 2016

pandamicro commented Apr 20, 2016

Implementation of auto batch #3265

Implementation of auto batch #3265

Conversation

pandamicro commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

pandamicro commented Apr 11, 2016

1scaR1 commented Apr 11, 2016

jareguo Apr 11, 2016

Choose a reason for hiding this comment

jareguo commented Apr 11, 2016

pandamicro commented Apr 12, 2016

1scaR1 commented Apr 12, 2016

pandamicro commented Apr 12, 2016

pandamicro commented Apr 14, 2016

1scaR1 commented Apr 14, 2016

pandamicro commented Apr 14, 2016

pandamicro commented Apr 14, 2016

1scaR1 Apr 14, 2016

Choose a reason for hiding this comment

jareguo Apr 15, 2016

Choose a reason for hiding this comment

pandamicro Apr 15, 2016

Choose a reason for hiding this comment

1scaR1 commented Apr 14, 2016

jareguo Apr 15, 2016

Choose a reason for hiding this comment

pandamicro commented Apr 15, 2016

TravisGesslein commented Apr 16, 2016 • edited Loading

pandamicro commented Apr 17, 2016 • edited Loading

Auto batching v5

Future improvements

pandamicro commented Apr 18, 2016

1scaR1 commented Apr 18, 2016 • edited Loading

jareguo Apr 19, 2016

Choose a reason for hiding this comment

pandamicro commented Apr 19, 2016

1scaR1 commented Apr 19, 2016 • edited Loading

TravisGesslein commented Apr 19, 2016 • edited Loading

pandamicro commented Apr 19, 2016

pandamicro commented Apr 20, 2016

TravisGesslein commented Apr 16, 2016 •

edited

Loading

pandamicro commented Apr 17, 2016 •

edited

Loading

1scaR1 commented Apr 18, 2016 •

edited

Loading

1scaR1 commented Apr 19, 2016 •

edited

Loading

TravisGesslein commented Apr 19, 2016 •

edited

Loading