-
Notifications
You must be signed in to change notification settings - Fork 904
Add GlobalVertexBuffer for Sprites to share #3268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have explored this question in Internet, and what i found:
Here is interesting guide from Apple https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html |
It's important to note that the key to sharing a singular buffer is to reduce gl.bindBuffer calls and (perhaps more importantly) reduce vertexAttribPointer calls overall. If they just stay at the same amount as before and the only thing that changes is the shared buffers, the whole performance will likely be slightly slower (but shouldn't be much) because OpenGL drivers must have some overhead in the background to make sure that updating one part of the buffer doesn't mess with potential rendering of another part of the buffer (these things are likely to be going on at the same time). As a first step, it would probably work to cache them similar to how you cache texture calls with cc.glBindTextures. At least in our game we notice that a lot of 2D load uses the same vertex layouts (or could use it) and not changing that with every draw as long as the last draw call uses the same buffer & vertex layout would probably give a couple of easy % of performance. Also with shared buffers, make sure to free "slots" when sprites are unused (for example, when they are removed from their parents and aren't added back to another parent for an entire frame) Regarding @1scaR1 advice from above: Haven't tried #2, but #3 is all about making job for the driver easier: If you render from a separate buffer than what you use to upload, it's easier for the driver to make sure they don't conflict. Remember, when you call bufferSubData, the previous rendering that uses that same buffer you are now uploading to might still be ongoing (because the GPUs generally render everything parallel to whatever your CPU does). So what you can do is manually double or triple buffer your buffers. The idea is that in frame #1 you upload data from buffer #1 and draw with buffer #1, in frame #2 you upload to buffer #2 and draw with buffer #2, and in frame 3 you can switch back to #1. This is double buffering. Triple buffering (doing this with 3 separate buffers) is supposed to be even faster in some cases but I can't remember why right now.
I don't know what's going on in mobile vs PC @pandamicro , I pretty much have no experience with mobile platforms unfortunately. |
I've found in my game, that in scene with a lot of sprites, which can be batched(~500-600), animations has a lot of lags.
|
Thanks for your feedbacks guys, I'm travailing today so haven't got time to digest all the informations, but will take more time tomorrow. I fixed an important issue today which will cause vertex buffer error and produce artifacts when removing and then adding back sprites.
@1scaR1 When you have those lags, can you break in |
Thanks ! It's exactly what I need to do to complete this PR, I will update the test result here.
Yes, I'm doing this with SpriteWebGLRenderCmd's |
@pandamicro Hm, this lags were the result of unusual state of my notebook of browser. After I reboot, I don't see any lags and i can't reproduce scene where i can see lags of such type. |
@heishe here is my findings today:
Is there anyway to set the attribute pointers only once (if that will never change) during configuration of a buffer or is that necessary to set attribute pointers for each draw call ? |
By the way, I was suggested to activate depth test and try to batch more sprites together, I think it worth try, but have you guys some hint about this solution ? About the cost of depth test and the draw call reduction benefit from auto batching. |
I think the difference between desktop and mobile is due the fact previously mentioned, WebGL can't use the same buffer data simultaneously.
Quote from https://www.opengl.org/wiki/Vertex_Specification_Best_Practices So I think on Desktop the rendering is so fast that it can process multiple buffers very quickly, so if I put all data into one buffer, the render commands must be processed one after another which is slower. But on mobile, the GPU performance is significantly lower, so the benefit of using one buffer is showing, especially on some middle-low end Android devices. |
So if you have 2 QUADs in one buffer, and you want to draw the first quad (4 vertices) of that buffer, you only have to set up the vertexAttribPointer data once and then drawArrays for the first call would be drawArrays(GL_TRIANGLE_STRIP, 0,4) and the drawing the second one would be (GL_TRIANGLE_STRIP,4,4) and it would fetch the correct data from the buffer without you having to change the offsets via vertexAttribPointer. But that only works for unified buffer because you bind the current ARRAY_BUFFER to a certain vertex attribute slot (first parameter of vertexAttribPointer) whenever you call vertexAttribPointer, and you can't switch that "during" a drawArrays or drawElements call. I think it IS possible to assign different ARRAY_BUFFERS to each attribute slot (so doing something like bindBuffer(...)-> vertexAttribPointer(0,...)->bindBuffer(...)->vertexAttribPointer(1,...) etc.)) but I don't think that helps in this case.
The problem is that if you want to ensure correct rendering of your game data, out of order rendering in general only works for fully opaque sprites (that is, sprites that don't have alpha in their texture at all OR where alpha is only used for pixels that are completely transparent or transparent enought hat you cant notice a difference), because correct blending requires correctly ordered rendering. So overall if the entire thing is a win or loss depends quite heavily on application. In 3D it's an easy win in general because there really aren't that many transparent sections in a 3D world overall (and for those, usually there is a separate rendering path that draws everything in the correct order), but my subjective feeling is that 2D uses alpha a lot in order to make images look smoother (so just disabling that might make images look bad) and for antialiasing for things like text rendering. I think in this case you just have to try. The general strategy in terms of rendering is that you can draw everything that is opaque first, and after that you can do a separate render path where you draw all the objects that use semi-transparency in order. Of course you have to detect if textures (or the textureRect region of a texture that is used by a sprite) for example use partial transparency. You can do this when loading textures and setting textureRects (which shouldn't happen all that often), and you could do it with some kind of shader (that draws to a 1x1 buffer that you read back on the CPU) or you could do it in client side javascript entirely (don't know how slow/fast that is). OR you could just let the user define it, set a flag in all cc.Sprites for example that say whether or not that sprite uses partial transparency.
Overall regarding options 1 vs 2 in your initial post at the top, I think it's just because bottlenecks are in different places for PCs vs mobile: When you do one bufferSubData call to update the entire buffer, the major load for desktops is going to be on memory bandwidth and the connection of the PCI bus that is used to transfer the data to the GPU. This is something that PC hardware has been optimized to do very well for years now (used to transfer and draw millions of vertices and similar data every second for a long time now). So for 2D games and simple 3D games, PCs get more easily bottlenecked by everything that requires lots of CPU power (because CPUs in contrast have not been getting better as quickly as other parts of PCs), notably something like WebGL calls of any kind. So only using one bufferSubData vs multiple makes the job easy for PCs. Meanwhile on mobile, GPUs are generally weak and memory bandwidth is extremely low compared to PCs, so the bottleneck is in actually copying data around. More WebGL calls are slower for mobile just as they are for PC, but overall the reduction of "memory you touch" by using multiple bufferSubDatas to transfer the minimal amount of memory is just a more important optimization for mobile. No idea what to do here. If you're fine with supporting two implementations you could just select what the renderer does depending on whether you detect a mobile or PC browser. |
Sorry I was mistaken about the difference between desktop and mobile, the desktop is performing better with bufferSubData call per sprite, and mobile is performing better with one single bufferSubData call for the entire global vertex buffer each frame.
I haven't succeed to do this, I specified the offset in drawArrays, but if I remove vertexAttribPointer call, it just render nothing, and if keep them, it renders strange things, the vertex data it use in this way is not correct.
|
I think, in cocos2d is better to use separate buffer for vertex, color and UV's. Most often changed parameter in engine is position of sprites. Color and UV changes recently relatively to vertex position. It can reduce buffer size to update.
As the result we have two cycles within all cmd, but do not call heavy gl function for each cmd. |
OK, I got it working, now |
I think we need to test the result later, for v3.11 I'm gonna move on to do the performance test case. |
@pandamicro , what do you think about my second proposal? You introduced one big buffer for all sprites, but not using it's benefits in batch rendering(or there is no samples how to use it in batching). |
The next commit will remove unnecessary |
@1scaR1 Please take a look at the last commit, if you can test it with your game, that would be great ! |
@pandamicro Hmm, how can I use your changes concerning |
Don't need to, in fact, global buffer is for all sprites that couldn't be batched, the breaking ones, like all label ttf. So you can just pull my branch, and nothing need to be changed |
This commits brings no performance improvement, because this
still called each frame, while there is nothing happens in game. P.S. P.P.S. |
Sorry, there was a state reset I missed. Now it should be good.
Yes, batched sprites still use separated buffers, I haven't refactor that part, and it shouldn't have big difference cause there shouldn't be much of them. |
Usage of P.S. One purpose: May be move all buffer staff to renderer? So sprite render cmd will tell render only it's info about verts and sprite state(blend, texture, offset in buffer, etc) |
Thanks for testing
This is due to the fact that we permit Sprite with no texture, while it will go into the
Totally, that should be done in the next step. And unified buffer should be activated for batched sprites too. |
@pandamicro But i don't have any Sprites with no texture on Scene... |
Do you have any empty label ? You can actually break in the |
I found that my game spawn a lot of incorrect sprites(with incorrect sprite frame name) on background, by debugging rendering process:) |
That should be a bug, I fixed it but haven't pushed yet |
So I have tried to use unified buffer for all sprites including auto batched sprites (vertexAttribPointer call reduced), but performance dropped on all devices I got. So won't dig further for now |
This PR is a continue of #3265
It contains mainly two important improvements:
There are some details to discuss:
I found a strange result inconsistence between desktop browser and mobile browsers, for the sprites global buffer, I have two option to update them:
bufferSubData
to refresh the WebGL buffer, so probably one glbufferSubData
call with the whole buffer per frame.bufferSubData
for each dirty sprite during its rendering function, so multiple (one for each dirty sprite)bufferSubData
with small data sections per frame.It turns out on desktop it's faster using option 1 than using option 2, while on mobile (iOS Safari and Android Chrome) it's just the contrary, option 2 is faster (significantly).It turns out on desktop it's faster using option 2 than using option 1, while on mobile (iOS Safari and Android Chrome) it's just the contrary, option 1 is faster (significantly).
Secondly, this PR doesn't bring any performance improvement on desktop Chrome, it's even not as good as before. But it do improve the performance on mobile especially on Android.
Does anyone know what's happening here ?
@dabingnn @jareguo @1scaR1 @heishe