Skip to content

Implementation of auto batch #3265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 20, 2016
Merged

Implementation of auto batch #3265

merged 17 commits into from
Apr 20, 2016

Conversation

pandamicro
Copy link
Contributor

No description provided.

@pandamicro pandamicro mentioned this pull request Apr 11, 2016
@pandamicro
Copy link
Contributor Author

@dabingnn @heishe @jareguo

Please review the auto batch implementation.

  • It's based on @heishe 's PR Gotta batch them all #3248 with some improvements: like moving batch rendering logic into RendererWebGL, render command only upload data.
  • It now only supports Sprite and all Node type using Sprite

@pandamicro pandamicro changed the title Draft implementation of auto batch Implementation of auto batch Apr 11, 2016
@1scaR1
Copy link
Contributor

1scaR1 commented Apr 11, 2016

I have checkouted this branch. As i see in my game :

  • Scale9Sprite doesn't work completely : only some parts of image are drawn
  • Something wrong with positions of spites in Buttons on click
  • After scene transition image on screen a little bit zooms in.
  • Suddenly image on screen starts scale infinitely
  • Sprite frames from the same sprite sheet are not all drawn on one Layer.

@pandamicro
Copy link
Contributor Author

@1scaR1 Thanks for testing, what do you mean by this one ?

  • Sprite frames from the same sprite sheet are not all drawn on one Layer.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 11, 2016

I looks like this in my game:

  • Original engine:

2016-04-11 18 12 28

  • This PR:

2016-04-11 18 14 24

Walls and terrains are from one sprite sheet:
repair_level-tileset0

Also, as you can see, the background is not full.(It fills also from sprite sheet).

I also found that on layer(and this layer is in clipping node and clipping node is in ScrollView), that does't have anything except sprites from sprite frames, all seems to be ok.

@pandamicro
Copy link
Contributor Author

  • After scene transition image on screen a little bit zooms in.
  • Suddenly image on screen starts scale infinitely

This should have been fixed by the last commit

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 11, 2016

Yes, there are no zooms now) Last commit also fixed:

  • Something wrong with positions of spites in Buttons on click

@pandamicro
Copy link
Contributor Author

Scale9Sprite issue was due to refactoration of variable name, I forgot to change all of them 😆 , now the other two issues should have been fixed @1scaR1

Please also tell me how performance benefit from the auto batch in your game

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 11, 2016

@pandamicro, Yes, all issues are fixed.
It's strange, but i don't see improvements(visible for eyes) with common usage of my game.
In profiler of Google Chrome i see that page have more idle time(for about ~10-20%).

But in Safari on a scene with a lot of same sprites(for about 500 and more), on different layers, i have lags with scrolling in Parallax or ScrollView.(Without this PR i have the same lags;) )

function objEqual (a, b) {
if (!a || !b) return false;

var keys = Object.keys(a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use "for var in" to avoid gc

@jareguo
Copy link
Contributor

jareguo commented Apr 11, 2016

+10086
But could you port to Scale9Sprite? (then Creator can also use it)

@pandamicro
Copy link
Contributor Author

But could you port to Scale9Sprite? (then Creator can also use it)

In this repo, Scale9Sprite already support auto batch, creator has a different Scale9Sprite implementation, we need to support it in creator repo

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 12, 2016

There is cpu profilers in Chrome and Safari:

  • Safari:
    2016-04-12 11 05 38
  • Chrome:
    2016-04-12 11 13 10

These profiles shows the common performance consumers:

  • _forwardBatch function(and it's strange that function objEqual in it costs a lot of time).
  • Rendering of sprites in ScrollView

The scene consists of ScrollView with 8 Layouts in it. Each layout has a layer with about 100 sprites in average. Also there is a background in scene which has ~600 sprites. And with it we can see performance difference: as you can see, render of sprites (on screens above) on background costs about 8% of cpu time in Chrome and 14% in Safari(thats the main difference in them).

Before this PR render of background sprites costed about 30% of cpu time in both browsers. But render of sprites in ScrollView is the same before and after pull request(about ~50%).

I think that batching doesn't affect sprites in ScrollView. Also in this PR cpu time of rendering of sprites consumed by _forwardBatch function. So total render time is almost the same.
Thats why, i assume, this PR doesn't seriously affect on performance in my game .

@pandamicro
Copy link
Contributor Author

@1scaR1 Many thanks for your test ! I will continue to improve the implementation

@pandamicro
Copy link
Contributor Author

Sprites seems to be first o last(i can't understand which) in batch buffers are not moving with it.

Should be the first ones are ignored by updating transform in buffer logic, should be fixed now

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 14, 2016

@pandamicro Now all is perfect!
Average time of rendering reduced in my game from 40-35%(without batching) to 17-15%(with latest version of batching).

@pandamicro
Copy link
Contributor Author

Some more test data:

On iOS it's not very stable, but it shows a little bit improvement using v4, and on Android the performance suffers, thought I must say Mi 2S is a very old device, new Android 5.0 devices should do a lot better than that.

But I think between v2 and v4, v4 use too much memory so it's not very stable on mobile and I think we should desactivate v4 on mobile, activate it on desktop.

@pandamicro
Copy link
Contributor Author

@dabingnn @jareguo @1scaR1 @heishe Please recheck, if there is no other problem, I will merge this PR

if (ACTIVATE_V4) {
CACHING_BUFFER = true;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be combine it into one variable CACHING_BUFFER?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 14, 2016

I think that the PR can be merged.
I don't have any artifacts in all parts of my game. And I didn't find mistakes on bound cases.

I propose to remove render cmd of ccui.ScrollView. They don't benefit to render process.
I've tested it without them, and all seems to be ok.

var len = children.length;
var minZ = Number.MAX_VALUE;
var maxZ = -Number.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should comment minZ and maxZ out

@pandamicro
Copy link
Contributor Author

Found issues that causing v4 slow on mobile, will update tomorrow

@TravisGesslein
Copy link

TravisGesslein commented Apr 16, 2016

Will this be part of 3.11? I.e. backwards compatible with 3.10 and before? Also please give info on what's slow on mobile, would be interesting to know 👍

BTW, I made some more work on our own branch of cocos2d-html5 and improved performance a bit. First of all, I changed the way the z values are assigned: I removed the assignZ stuff that is called in director.drawScene, and instead assign the z values directly in rendererWebGL.rendering. The draw calls arrive in the correct order anyways, so you can compute the z values from there and if you want to draw stuff out-of-order you can still do that after. That also means that I assign the correct z values to the stackMatrix of each renderCmd there directly. Our game (Idle Raiders) now does about 1.5 milliseconds less work each frame as a result.

Next, I just did the && thing I mentioned above to avoid additional transformations. I have tested it a lot and can't detect any errors, and it was a giant performance save in scenes with lots of nodes (we have about 2000 nodes in total in our hierarchy, as far as I remember), so I leave it for now. I should mention though, that if you want to do this as well, you need to change the z assigning also. Because if you don't, you will get flickering because z values are sometimes updated too late (this is actually what made me do the change above).

Also, we are now using a single index buffer for everything that wants to render quads... if you think about it, it's just a list of indices that stays the same for all render cmds that want to draw quads (as long as those take care of uploading the stuff into buffers correctly), the only thing that changes for renderCmds is the number of quads they draw (and MAYBE index at which they start), both of which can be handed as arguments to gl.drawElements. This also sped up the entire application by about 5% and was a pretty small change.

If you want to take a look at it you can look at all commits from April 15 and 16 in the gotta-batch-them-all branch here: https://github.com/Talisca/cocos2d-html5/commits/gotta-batch-them-all

Overall for our game, it improved idle time from 24.43% to 53.30%. Some more details here: https://gist.github.com/Heishe/3814703722e498e14fdc0095c201fcff
Note that I haven't integrated this PR into our own branch so far, so it's based on the code I accidentally PR-ed to you a week ago.

I also have some ideas for future optimizations about all rendering in general, but I don't know if I'm even gonna have time to work on it. I can summarize it: All renderCmds create too many WebGL buffers, and they change state too often. Stuff like gl.vertexAttribPointer is very expensive (I'm not exaggerating, it takes up 11% of our total rendering time in our game), so I'd try to make sure that vertexAttribPointer is only called when necessary. If you look at various renderCmds, they have very similar vertex layouts, so for many you don't need to call vertexAttribPointer again if you have called it before with the same arguments.

Also having 3000+ array buffers hanging around (for example because each sprite creates its own) and binding them for each draw call is very slow also (bindBuffer alone takes up 7% of all rendering time in our game). It probably also puts some load on the graphics driver in the background.

I would heavily suggest that instead of each renderCmd creating their own buffers for rendering, they should use some kind of global buffer allocator provided by the cc framework which only returns subsections of larger buffers (optimally just a single one, but you can split it up if you must. Just take note that a single 1 megabyte large buffer can fit a ton of sprites and their QUAD data). The only thing you have to change in the renderCmds is where they upload their data to webgl and then use correct offsets in gl.drawArrays and gl.drawElements. I actually think this might not be all that complicated.

@pandamicro
Copy link
Contributor Author

pandamicro commented Apr 17, 2016

@heishe Thanks again, you provided some very interesting ideas.

Will this be part of 3.11? I.e. backwards compatible with 3.10 and before? Also please give info on what's slow on mobile, would be interesting to know

Yes, this PR is totally backward compatible, and I will try my best to ensure its stability and merge it in v3.11

About mobile performance, on iOS, it's similar with desktop, it's just every thing take a little more time, so improvements applies equally on desktop and iOS. But on poor Android devices, things just don't get improved, I think we are using too much memory on Android so the rendering save up is just invisible.

I removed the assignZ stuff that is called in director.drawScene, and instead assign the z values directly in rendererWebGL.rendering.

Make sense

Next, I just did the && thing I mentioned above to avoid additional transformations. I have tested it a lot and can't detect any errors

Indeed, only some very rare case can cause problems, I haven't got time to test, but feel free to remove it in your own branch, it's not a proper fix anyway, and if it doesn't cause any issue, means your game don't trigger it, so it's safe.

Last few things you mentioned are all good orientations, I think the most important thing is to reduce WebGL buffer instances, sharing buffers is a great idea.

Auto batching v5

Now to my part, I spent the weekend to implemented a new version of auto batching, actually most of the time, I was struggling fix v4 issues. At last I came to conclusion that v4 can't be maintained, it's too complicated and it's just impossible to take care of all different use cases.

The core problem of v4 is to check based on buffers, it was guessing how to split into virtual buffers, when to regenerate buffers and so on.

In v5, I use more informations to simplify my life, the general idea is still the same as v4: to lazy update buffers only when necessary. But rules are different in v5:

  • Generate auto batch sections and buffers in _forwardBatch, same as before.
  • When children order dirty (means command list change), invoke _refreshVirtualBuffers to update the auto batch sections.
  • Save the previous command list, compare with the current one in _refreshVirtualBuffers, it can:
    • Reuse as much as possible the old buffers, by split it into different sub virtual buffers but using the same data and webgl buffer.
    • Clear invalid _vBuffer in render commands that need to be re-batched.
    • Batch other render commands without _vBuffer property when possible, including invalid ones and new commands.
  • Virtual buffers will be regenerated or disposed in the following case:
    • When commands in the old auto batch section are all removed or only one of them left.
    • When new command is added and can be batched together with an existing auto batch section.
  • _forwardCheck assumes all virtual buffers is correct, and try to use them as is, no modifications made here.

In this way, when render commands don't change, the auto batching buffers don't change, even if the render commands change, it will cleverly reuse as much as possible the old buffers. Here is the result:

Future improvements

  1. Remove matrix data in vertex buffer, now matrix data are duplicated for each vertex, but we can calculate the final vertex position in CPU, it will bring more calculations in CPU but will save a great amount of memory. It can also reduce vertexAttribPointer call by 1 per batch.
  2. I will also try to reduce WebGL buffers, activate depth test and batch more render commands, but not in v3.11

@pandamicro
Copy link
Contributor Author

@heishe You idea to use only one element buffer is really clever ! I merged it into my branch too.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 18, 2016

@pandamicro , May be move info, such as triangles indices and it's count, back to Sprite render cmd?
Now, this commit removes possibility to implement PolygonSprite, like in cocos2d-x, without changing of renderer, isn't it?

@@ -603,8 +608,9 @@ return {
gl.vertexAttribPointer(cc.VERTEX_ATTRIB_MVMAT0 + i, 4, gl.FLOAT, false, bytesPerRow * 4, matrixOffset + bytesPerRow * i); //stride is one row
}

gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, _currentBuffer.buffer.elementBuffer);
gl.drawElements(gl.TRIANGLES, _currentBuffer.totalIndexSize, gl.UNSIGNED_SHORT, _currentBuffer.indexOffset);
var elemBuffer = getQuadIndexBuffer(count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool now you just using one index buffer everywhere, very clever!
But my question is, how to change the rendering order without customizing index buffer? using z buffer?

@pandamicro
Copy link
Contributor Author

Now, this commit removes possibility to implement PolygonSprite, like in cocos2d-x, without changing of renderer, isn't it?

I don't understand.

For polygon sprite, we need the renderer to support triangle render commands (mesh), the last improvement may not be very good here, because it consider only quad commands, triangle commands have different index buffer layout. Same for Scale9Sprite, which should have 12 vertices for 54 index buffer size (18 triangles).

So maybe finally we need to revert the last commit.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 19, 2016

@pandamicro I mean that generation of index buffer for sprite is business of sprite cmd itself.
I think renderer must deal with quad(in cocos2d-x now it's a subclass of triangle command) & triangle commands generated by sprite cmd and then insert then in common batch buffer by own rules.

@TravisGesslein
Copy link

TravisGesslein commented Apr 19, 2016

If you want to have automatic batching of everything, you need to have a rendering scheme that I described at the end in my first big post (but it's a big refactor), where the rendering itself isn't implemented by render Cmds, but rather by a central renderer that just gets info on what data it needs for drawing from the various render Cmds.

This is because renderCmds keep rendering information strictly local to them, while for batching, you need to have global information from multiple renderCmds. Actually, with this automatic batching stuff you are already partially doing this: You check 'what the node needs to draw' (by just checking for its type) in order to know if you can batch it, and then you somehow extract the data required to render it and put it into common buffers for batching.

If you keep extending the current system you will naturally arrive at this point for automatic batching, where renderCmds will have their used resources (buffers,textures), fixed function state (blending, etc.), the shaders they use to draw, and vertex layouts (etc.) available as information somehow and a 'central renderer' (your automatic batching rendering code) will use it to draw everything automatically. The difference will be that renderCmds will still need custom render logic for non-batched drawing even though it isn't really necessary.

Actually recently I have found this presentation that describes the concept perfectly. It calls these collections of draw information 'draw items'. It's all for a high end 3D renderer and DirectX 11+/ OpenGL 3+, but the concepts are of course all the same for 2D and WebGL:

https://www.dropbox.com/sh/4uvmgy14je7eaxv/AAAFYxdx3r2MZloWPYrQLA3Ka/Designing%20a%20Modern%20GPU%20Interface.pptx?dl=0

@pandamicro
Copy link
Contributor Author

I mean that generation of index buffer for sprite is business of sprite cmd itself.
I think renderer must deal with quad(in cocos2d-x now it's a subclass of triangle command) & triangle commands generated by sprite cmd and then insert them in common batch buffer.

Yes, previously, I collect index buffer layout from render commands, so it's more flexible.

@pandamicro
Copy link
Contributor Author

I tested v5 for two days, and it's working well, and this PR became too big so I'm merging this PR. I will have some followed up PRs to improve the performance from different ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants