Skip to content

Add GlobalVertexBuffer for Sprites to share #3268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 25, 2016

Conversation

pandamicro
Copy link
Contributor

@pandamicro pandamicro commented Apr 20, 2016

This PR is a continue of #3265

It contains mainly two important improvements:

  1. Moving MVMatrix * a_position calculation from GPU to CPU (Saving a great amount of buffer for auto batching)
  2. Provide a global unified buffer for all Sprites to share

There are some details to discuss:

I found a strange result inconsistence between desktop browser and mobile browsers, for the sprites global buffer, I have two option to update them:

  1. Whenever a sprite hosted in the global buffer is dirty, use bufferSubData to refresh the WebGL buffer, so probably one gl bufferSubData call with the whole buffer per frame.
  2. Separately invoking bufferSubData for each dirty sprite during its rendering function, so multiple (one for each dirty sprite) bufferSubData with small data sections per frame.

It turns out on desktop it's faster using option 1 than using option 2, while on mobile (iOS Safari and Android Chrome) it's just the contrary, option 2 is faster (significantly).
It turns out on desktop it's faster using option 2 than using option 1, while on mobile (iOS Safari and Android Chrome) it's just the contrary, option 1 is faster (significantly).

Secondly, this PR doesn't bring any performance improvement on desktop Chrome, it's even not as good as before. But it do improve the performance on mobile especially on Android.

Does anyone know what's happening here ?

@dabingnn @jareguo @1scaR1 @heishe

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 20, 2016

I have explored this question in Internet, and what i found:

  1. The main reason is performance of mobile CPU and GPU: when glBufferSubData called, webgl must finish to process all vertex's in previous buffer(apply shaders and etc.). Also it increases execution time of drawElements.
  2. better to use TRIANGLE_STRIP in drawElements, to reduce size of vertex buffer.
  3. enable double of triple buffering to reduce performance cost of glBufferSubData
  4. reduce calculations on shaders(i think, it's the biggest problem)

Here is interesting guide from Apple https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html
It's for OpenGL ES, but it's very similar to Web GL.
GL ES has a great function for such things - glMapBufferRange. But there is no such function int Web GL:(

@TravisGesslein
Copy link

TravisGesslein commented Apr 20, 2016

It's important to note that the key to sharing a singular buffer is to reduce gl.bindBuffer calls and (perhaps more importantly) reduce vertexAttribPointer calls overall. If they just stay at the same amount as before and the only thing that changes is the shared buffers, the whole performance will likely be slightly slower (but shouldn't be much) because OpenGL drivers must have some overhead in the background to make sure that updating one part of the buffer doesn't mess with potential rendering of another part of the buffer (these things are likely to be going on at the same time).

As a first step, it would probably work to cache them similar to how you cache texture calls with cc.glBindTextures. At least in our game we notice that a lot of 2D load uses the same vertex layouts (or could use it) and not changing that with every draw as long as the last draw call uses the same buffer & vertex layout would probably give a couple of easy % of performance.

Also with shared buffers, make sure to free "slots" when sprites are unused (for example, when they are removed from their parents and aren't added back to another parent for an entire frame)

Regarding @1scaR1 advice from above: Haven't tried #2, but #3 is all about making job for the driver easier: If you render from a separate buffer than what you use to upload, it's easier for the driver to make sure they don't conflict. Remember, when you call bufferSubData, the previous rendering that uses that same buffer you are now uploading to might still be ongoing (because the GPUs generally render everything parallel to whatever your CPU does). So what you can do is manually double or triple buffer your buffers.

The idea is that in frame #1 you upload data from buffer #1 and draw with buffer #1, in frame #2 you upload to buffer #2 and draw with buffer #2, and in frame 3 you can switch back to #1. This is double buffering. Triple buffering (doing this with 3 separate buffers) is supposed to be even faster in some cases but I can't remember why right now.

  1. should only be a problem when you're GPU limited, which I suppose can easily happen on mobile GPUs (as opposed to CPU limited, which is probably far more likely to be the case for 2D games in PC).

I don't know what's going on in mobile vs PC @pandamicro , I pretty much have no experience with mobile platforms unfortunately.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 21, 2016

I've found in my game, that in scene with a lot of sprites, which can be batched(~500-600), animations has a lot of lags.
Lags are both with #3265 and this PR.
What i mean by lags:

  • I see separate frames when playing animation from plist files.(without these PR's-3265 and 3268-it's smooth)
  • I see sprites suddenly freezes when moving for long distances(300+ px)
  • Sometimes fps reduces from 56-60 to 40-44(with no reason)

@pandamicro
Copy link
Contributor Author

pandamicro commented Apr 21, 2016

Thanks for your feedbacks guys, I'm travailing today so haven't got time to digest all the informations, but will take more time tomorrow. I fixed an important issue today which will cause vertex buffer error and produce artifacts when removing and then adding back sprites.

I see separate frames when playing animation from plist files.(without these PR's-3265 and 3268-it's smooth)

@1scaR1 When you have those lags, can you break in rendering function of RendererWebGL.js, and take a look at how many _gbuffers are active ? By the way, if you have online test links, I can profile it with Chrome dev tool to see what's happening

@pandamicro
Copy link
Contributor Author

As a first step, it would probably work to cache them similar to how you cache texture calls with cc.glBindTextures.

Thanks ! It's exactly what I need to do to complete this PR, I will update the test result here.

Also with shared buffers, make sure to free "slots" when sprites are unused (for example, when they are removed from their parents and aren't added back to another parent for an entire frame)

Yes, I'm doing this with SpriteWebGLRenderCmd's freeBuffer in Sprite's cleanup function, and GLobalVertexBuffer will automatically reclaim this part of buffer and make it free for other commands.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 22, 2016

@pandamicro Hm, this lags were the result of unusual state of my notebook of browser. After I reboot, I don't see any lags and i can't reproduce scene where i can see lags of such type.
If i find this state, i will write here.

@pandamicro
Copy link
Contributor Author

pandamicro commented Apr 24, 2016

@heishe here is my findings today:

  1. I have rewrote gl.bindBuffer to check whether the current buffer is the same one, and skip if yes. Turns out it doesn't bring noticeable difference in performance, which is normal, I think bindBuffer itself doesn't do anything heavy
  2. I have tried and failed to reduce vertexAttribPointer calls, when I use unified buffer, the offset I set to each pointer will change for every draw call, so can't be avoided. I tried not using unified buffer, but either it doesn't render things or it can't be avoided. Here is what I tried:
var _currPointers = {};

WebGLRenderingContext.prototype.glVertexAttribPointer = WebGLRenderingContext.prototype.vertexAttribPointer;
WebGLRenderingContext.prototype.vertexAttribPointer = function (index, size, type, normalized, stride, offset) {
    var obj = _currPointers[index];
    if (!obj) {
        obj = {
            size: 0,
            type: this.BYTE,
            normalized: false,
            stride: 0,
            offset: 0
        };
        _currPointers[index] = obj;
    }
    if (obj.size === size && 
        obj.type === type &&
        obj.normalized === normalized &&
        obj.stride === stride &&
        obj.offset === offset) {
        return;
    }

    obj.size = size;
    obj.type = type;
    obj.normalized = normalized;
    obj.stride = stride;
    obj.offset = offset;
    this.glVertexAttribPointer(index, size, type, normalized, stride, offset);
};

Is there anyway to set the attribute pointers only once (if that will never change) during configuration of a buffer or is that necessary to set attribute pointers for each draw call ?

@pandamicro
Copy link
Contributor Author

By the way, I was suggested to activate depth test and try to batch more sprites together, I think it worth try, but have you guys some hint about this solution ? About the cost of depth test and the draw call reduction benefit from auto batching.

@pandamicro
Copy link
Contributor Author

pandamicro commented Apr 24, 2016

I think the difference between desktop and mobile is due the fact previously mentioned, WebGL can't use the same buffer data simultaneously.

There are two competing issues with buffer sizes.

  1. Larger buffers means putting multiple objects in one buffer. This allows you to render more objects without having to change buffer object state. Thus improving performance.
  2. Larger buffers means putting multiple objects in one buffer. However, mapping a buffer means that the entire buffer cannot be used (unless you map it persistently). So if you have many objects in one, and you need to map the data for one object, then the others will not be usable while you are modifying that one object.

Quote from https://www.opengl.org/wiki/Vertex_Specification_Best_Practices

So I think on Desktop the rendering is so fast that it can process multiple buffers very quickly, so if I put all data into one buffer, the render commands must be processed one after another which is slower. But on mobile, the GPU performance is significantly lower, so the benefit of using one buffer is showing, especially on some middle-low end Android devices.

@TravisGesslein
Copy link

TravisGesslein commented Apr 24, 2016

  • Regarding vertexAttribPointer: You don't need to change the offsets via vertexAttribPointer. You can instead target a different offset in the used buffer by using the [first] parameter in drawArrays (or the [offset] parameter in drawElements). Conceptually vertexAttribPointer is just there to tell the driver how the data is formatted inside the buffers, not "where it starts" for each draw call, so to speak.

So if you have 2 QUADs in one buffer, and you want to draw the first quad (4 vertices) of that buffer, you only have to set up the vertexAttribPointer data once and then drawArrays for the first call would be drawArrays(GL_TRIANGLE_STRIP, 0,4) and the drawing the second one would be (GL_TRIANGLE_STRIP,4,4) and it would fetch the correct data from the buffer without you having to change the offsets via vertexAttribPointer.

But that only works for unified buffer because you bind the current ARRAY_BUFFER to a certain vertex attribute slot (first parameter of vertexAttribPointer) whenever you call vertexAttribPointer, and you can't switch that "during" a drawArrays or drawElements call. I think it IS possible to assign different ARRAY_BUFFERS to each attribute slot (so doing something like bindBuffer(...)-> vertexAttribPointer(0,...)->bindBuffer(...)->vertexAttribPointer(1,...) etc.)) but I don't think that helps in this case.

  • Regarding depth tests: Overall depth testing is probably one of the most optimized parts of the pipeline, when you draw many opaque things out of order (i.e. not back-to-front) it is actually supposed to increase performance because the renderer can skip shader executions for stuff that is drawn "behind" already drawn stuff.

The problem is that if you want to ensure correct rendering of your game data, out of order rendering in general only works for fully opaque sprites (that is, sprites that don't have alpha in their texture at all OR where alpha is only used for pixels that are completely transparent or transparent enought hat you cant notice a difference), because correct blending requires correctly ordered rendering.

So overall if the entire thing is a win or loss depends quite heavily on application. In 3D it's an easy win in general because there really aren't that many transparent sections in a 3D world overall (and for those, usually there is a separate rendering path that draws everything in the correct order), but my subjective feeling is that 2D uses alpha a lot in order to make images look smoother (so just disabling that might make images look bad) and for antialiasing for things like text rendering.

I think in this case you just have to try. The general strategy in terms of rendering is that you can draw everything that is opaque first, and after that you can do a separate render path where you draw all the objects that use semi-transparency in order. Of course you have to detect if textures (or the textureRect region of a texture that is used by a sprite) for example use partial transparency. You can do this when loading textures and setting textureRects (which shouldn't happen all that often), and you could do it with some kind of shader (that draws to a 1x1 buffer that you read back on the CPU) or you could do it in client side javascript entirely (don't know how slow/fast that is). OR you could just let the user define it, set a flag in all cc.Sprites for example that say whether or not that sprite uses partial transparency.

  • Regarding PC vs mobile buffers: The quote you mentioned specifically deals with the glMapBuffer functions of desktop GL which are not supported with WebGL, I don't know if that advice applies to raw bufferData/bufferSubData calls in general.

Overall regarding options 1 vs 2 in your initial post at the top, I think it's just because bottlenecks are in different places for PCs vs mobile: When you do one bufferSubData call to update the entire buffer, the major load for desktops is going to be on memory bandwidth and the connection of the PCI bus that is used to transfer the data to the GPU. This is something that PC hardware has been optimized to do very well for years now (used to transfer and draw millions of vertices and similar data every second for a long time now). So for 2D games and simple 3D games, PCs get more easily bottlenecked by everything that requires lots of CPU power (because CPUs in contrast have not been getting better as quickly as other parts of PCs), notably something like WebGL calls of any kind. So only using one bufferSubData vs multiple makes the job easy for PCs. Meanwhile on mobile, GPUs are generally weak and memory bandwidth is extremely low compared to PCs, so the bottleneck is in actually copying data around. More WebGL calls are slower for mobile just as they are for PC, but overall the reduction of "memory you touch" by using multiple bufferSubDatas to transfer the minimal amount of memory is just a more important optimization for mobile.

No idea what to do here. If you're fine with supporting two implementations you could just select what the renderer does depending on whether you detect a mobile or PC browser.

@pandamicro
Copy link
Contributor Author

pandamicro commented Apr 25, 2016

Sorry I was mistaken about the difference between desktop and mobile, the desktop is performing better with bufferSubData call per sprite, and mobile is performing better with one single bufferSubData call for the entire global vertex buffer each frame.

You can instead target a different offset in the used buffer by using the [first] parameter in drawArrays (or the [offset] parameter in drawElements). Conceptually vertexAttribPointer is just there to tell the driver how the data is formatted inside the buffers, not "where it starts" for each draw call, so to speak.

I haven't succeed to do this, I specified the offset in drawArrays, but if I remove vertexAttribPointer call, it just render nothing, and if keep them, it renders strange things, the vertex data it use in this way is not correct.

// The draw call for a sprite, the _bufferOffset is offset in byte in the buffer,  
// as drawArrays requires the vertex offset, I divided it by vertex data byte size per unit which is 96
gl.drawArrays(gl.TRIANGLE_STRIP, this._bufferOffset / this.vertexBytesPerUnit, 4);

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

  1. According to

Quote from https://www.opengl.org/wiki/Vertex_Specification_Best_Practices

I think, in cocos2d is better to use separate buffer for vertex, color and UV's. Most often changed parameter in engine is position of sprites. Color and UV changes recently relatively to vertex position. It can reduce buffer size to update.
2. gl.bindBuffer and gl.vertexAttribPointer can be moved out rendering cycle. For example:

        //Here check all batch buffer parts with _forwardCheck
         for (i = 0, len = locCmds.length; i < len; ++i) {
         ....
        //Bind whole global buffer
        gl.bindBuffer(gl.ARRAY_BUFFER, _GLOBAL_VERTEX_BATCH_BUFFER);

        cc.glEnableVertexAttribs(cc.VERTEX_ATTRIB_FLAG_POS_COLOR_TEX);

        gl.vertexAttribPointer(0, 3, gl.FLOAT, false, 24, 0);
        gl.vertexAttribPointer(1, 4, gl.UNSIGNED_BYTE, true, 24, 12);
        gl.vertexAttribPointer(2, 2, gl.FLOAT, false, 24, 16);

        gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, _GLOBAL_ELEMENT_BATCH_BUFFER);
        //Here draw all cmd's with _batchRendering or calling rendering func
        // and do not call all functions above in _batchRendering
        for (i = 0, len = locCmds.length; i < len; ++i) {
        ....

As the result we have two cycles within all cmd, but do not call heavy gl function for each cmd.
I think performance should be compared.

@pandamicro
Copy link
Contributor Author

OK, I got it working, now vertexAttribPointer is only called when bound buffer is changed, I will submit my code later, this improves the process time ratio from 15.55% to 10.65% on desktop chrome. On mobile, the improvement is smaller but visible.

@pandamicro
Copy link
Contributor Author

I think, in cocos2d is better to use separate buffer for vertex, color and UV's. Most often changed parameter in engine is position of sprites. Color and UV changes recently relatively to vertex position. It can reduce buffer size to update.

I think we need to test the result later, for v3.11 I'm gonna move on to do the performance test case.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

@pandamicro , what do you think about my second proposal? You introduced one big buffer for all sprites, but not using it's benefits in batch rendering(or there is no samples how to use it in batching).

@pandamicro
Copy link
Contributor Author

@pandamicro , what do you think about my second proposal? You introduced one big buffer for all sprites, but not using it's benefits in batch rendering(or there is no samples how to use it in batching).

The next commit will remove unnecessary gl.bindBuffer and gl.vertexAttribPointer calls

@pandamicro
Copy link
Contributor Author

@1scaR1 Please take a look at the last commit, if you can test it with your game, that would be great !

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

@pandamicro Hmm, how can I use your changes concerning vertexAttribPointer and global buffer?
It needs to call rendering func of sprite.
I need to disable ACTIVATE_AUTO_BATCH, am i right?

@pandamicro
Copy link
Contributor Author

@pandamicro Hmm, how can I use your changes concerning vertexAttribPointer and global buffer?
It needs to call rendering fun of sprite.
I need to disable ACTIVATE_AUTO_BATCH, am i right?

Don't need to, in fact, global buffer is for all sprites that couldn't be batched, the breaking ones, like all label ttf.

So you can just pull my branch, and nothing need to be changed

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

This commits brings no performance improvement, because this

               if (_resetPointers || _bufferchanged) {
                    cc.glEnableVertexAttribs(cc.VERTEX_ATTRIB_FLAG_POS_COLOR_TEX);
                    gl.vertexAttribPointer(cc.VERTEX_ATTRIB_POSITION, 3, gl.FLOAT, false, 24, 0);
                    gl.vertexAttribPointer(cc.VERTEX_ATTRIB_COLOR, 4, gl.UNSIGNED_BYTE, true, 24, 12);
                    gl.vertexAttribPointer(cc.VERTEX_ATTRIB_TEX_COORDS, 2, gl.FLOAT, false, 24, 16);
                }

still called each frame, while there is nothing happens in game.
I didn't found any code about _resetPointers || _bufferchanged setting to false.

P.S.
All this PR improves performance by reducing time of rendering by 6-8%.

P.P.S.
This PR only for sprites that are not batched? What about global buffer for batched sprites?

@pandamicro
Copy link
Contributor Author

Sorry, there was a state reset I missed. Now it should be good.

This PR only for sprites that are not batched? What about global buffer for batched sprites?

Yes, batched sprites still use separated buffers, I haven't refactor that part, and it shouldn't have big difference cause there shouldn't be much of them.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

Usage of vertexAttribPointer is reduced by 2%. But i found strange behavior of rendering function:
If there is nothing happens on scene, locTexture becomes null, so _resetPointers sets to true, so vertexAttribPointer calls again.
Is it normal?

P.S. One purpose: May be move all buffer staff to renderer? So sprite render cmd will tell render only it's info about verts and sprite state(blend, texture, offset in buffer, etc)

@pandamicro
Copy link
Contributor Author

Thanks for testing

Usage of vertexAttribPointer is reduced by 2%. But i found strange behavior of rendering function:
If there is nothing happens on scene, locTexture becomes null, so _resetPointers sets to true, so vertexAttribPointer calls again.
Is it normal?

This is due to the fact that we permit Sprite with no texture, while it will go into the else section of the rendering function. If I don't force update the attribute pointers, the rendering will be messed up. It's a ugly work around, I really want to skip the sprites without texture, but it will break compatibility...

P.S. One purpose: May be move all buffer staff to renderer? So sprite price will tell render only it's info about verts and sprite state(blend, texture, offset in buffer, etc)

Totally, that should be done in the next step. And unified buffer should be activated for batched sprites too.

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

@pandamicro But i don't have any Sprites with no texture on Scene...

@pandamicro
Copy link
Contributor Author

But i don't have any Sprites with no texture on Scene...

Do you have any empty label ? You can actually break in the else section, and see which render command is setting the _resetPointers, just take a look at the type of command._node

@1scaR1
Copy link
Contributor

1scaR1 commented Apr 25, 2016

I found that my game spawn a lot of incorrect sprites(with incorrect sprite frame name) on background, by debugging rendering process:)
Now vertexAttribPointer usage reduced by 6-7% more.
Some empty Sprites are spawned by Buttons.(May be not if some state of button wasn't presented-disabled or clicked.)

@pandamicro
Copy link
Contributor Author

Some empty Sprites are spawned by Buttons.(May be not if some state of button wasn't presented-disabled or clicked.)

That should be a bug, I fixed it but haven't pushed yet

@pandamicro
Copy link
Contributor Author

So I have tried to use unified buffer for all sprites including auto batched sprites (vertexAttribPointer call reduced), but performance dropped on all devices I got. So won't dig further for now

@pandamicro pandamicro merged commit df32063 into cocos2d:develop Apr 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants