C0DE517E: How the GPU works - part 3 (optimization!)

This is the last one. So after writing way too much, let's go practical and recap what good looks like for a GPU:

Coherency. Doing the same stuff on huge amounts of data arranged in a sequential fashion. That's why we want to minimize the number of draw calls, and minimize the GPU state changes between them. Unfortunately which changes are most expensive is an hardware dependent matter, surely we'll sort draw calls per render targets first and shaders second.
All sources of uncoherent accesses are bad. That's why we want to use interleaved vertex buffers, we want to make them as small as possible, we want to use swizzeled texture, with mipmaps. But there are also other things that could cause problems, like random access to shader constants (via indexing into arrays, stuff that happens for example when you're doing bone animation with GPU skinning). Or like dynamic branching where frequently nearby data (for pixel shaders, usually blocks of 8x8 pixels) don't all take the same execution path.

Balance. Think about GPU as a huge pipeline (it is). It's not the total number of things it does that counts, but how the stages are balanced, it's not the sum of the work done by all stages that makes the cost of a drawcall on the GPU, but only the cost of slower stage. Think about it even outside the GPU, consider moving things from the CPU to the GPU vertex shader or viceversa, and between the vertex shader to the pixel shader. Consider even other stages, not only the shader pipelines, expecially if they deal with memory (i.e. vertex fetching and render target writing). Blending can be the bottleneck! In that case you can often reduce the overdraw by doing more! I.e. don't draw big alpha-key quads, use real geometry. Or draw less particles, but fancier!

Note: when profiling remember that in a drawcall it might happen that sometimes a given part of the pipeline is the one that's stalling it and some other times another part of the pipeline causes a problem (i.e. because of memory cache behaviour). So usually there is a single stage that is slow and that we have to "rebalance" but it's also true that doing less work in general, helps a little even in the other stages

Latency hiding. Shader pipelines have a way to hide memory latencies (that's the same way employed by modern CPUs as well). They hide them by having more threads in execution than actual arithmethic units so each unit has always something to do even if many threads are stalled on a huge latency. On the GPU you have more threads if your shader uses less registers. So less registers is good. And each memory operation gives you a number of ALU instructions "for free" (remember, balancing, texture units and alu units operate in parallel). Both moving computations into lookup tables and viceversa could be a good idea (i.e. using N analytical lights or a environment lighting cubemap), it all depends on if your shader is alu or texture bound. Don't guess, profile. If your pixel shader is texture bound for example, and you can't do anything about it, you could at least make your vertex shader faster (if it's ALU bound and it's a bottleneck for some primitives) by moving some of its computations into the pixel shader.
Dependencies between ALU and texture fetches or between texture and texture (i.e. texture fetches with UV coordinates that depend on other textures or computations) could be a problem on some platforms.

Don't work. The best thing you can do is not to do any work at all. Kill stuff as soon as possible in the pipeline (i.e. before pixel shader!). Do not draw things that are occluded (i.e. using occlusion queries). Do not shade your vertices twice (correctly use and optimize the post-transform cache). Do not overdraw (correctly use and optimize the early-Z rejection stage of modern GPUs). Draw only where you need (i.e. don't have large triangles that are mostly empty, i.e. with alpha = 0, execute post effects only where you need them). Think about early-stencil (to reduce overdraw, for example, in particle systems). Consider dynamic branching (wisely).

Work less. Again, this is obvious, but after saying all that always remember that you still should overall try to do the less possible work, then balance, then if you can't optimize further, you might consider adding work (features) to stages that are still free (could do more work hidden by other latencies).
Doing less work means for example to read (and write) less in memory accesses (smaller vertex buffers, smaller texture formats), to use less registers for interpolation of vertex outputs (remember that they are always float4 and that shader compilers can't pack, in that case, stuff for you to use less registers, same applies to constants, expecially to arrays), to use less ALU instructions to do the same work.
Draw less pixels, draw expensive stuff to smaller rendertargets and compose them back with the full-res one (again, particles!). Draw less vertices, use less triangles for smaller objects (LOD). Use less shaders for small objects (see my post, "how to properly lod pixel shaders") and so on.
Avoid transforming your vertices multiple times by optimally using vertex post-transform cache (i.e. use indexed, cache optimized, triangle lists). Avoid wasted pixel shaders computations by maximixing the number of "full quads", i.e. blocks of 2x2 pixels covered by a single triangle. If your triangles get too small, the rasterizer will emit many quads (that are the pixel shading unit, pixel shaders don't work on single pixels because they need neighbors to compute derivatives) with pixels masked out not to be written in the frame buffer, thus wasting processing power.
Cull more! Consider predication and occlusion queries. Use hi-Z (early z culling), use hi-Stencil! Everywhere!

Give hints to your shader compiler. For example, if your vertex position .w coordinate is always one, don't read it from the input stream, read a float3 and when you need to do homogeneous transforms, build a float4 with the fourth component set to one (exception to this rule apply, i.e. if you read a texture, do no computations on it, and just return it, then explicitly setting alpha to zero if it's always zero in the texture itself does slow down the shader).
Always use the smallest vector type you need. Most GPUs (even old ones, that dates back to the way fixed pipelines were made, that allowed separate processing of RGB color and alpha value) can process at the same time a scalar and a vector operation (some can process in parallel a float3 and a float1, some can process float4 and float1, some others can also do float2 and float2...)

Use appropriate types, i.e. if your shader input is an integer, and you'll use that integer to index an array, declaring it as a float is slower (but doing operations on it as a float can be faster, check your compiled code). Always prefer vector computations over multiple scalar ones, always use intrinstics functions (i.e. don't write your own normalization function).

Other tips and tricks...

Check your compiled shader code for potential wasted work. For example, when dealing with integers compilers generate bad code, don't use integers computations, don't use integer modulo, most of the time you can use a float multiply and divide.
Use what your platform gives to you, i.e. signed textures, swizzling is usually free, some modifiers also are (*2, *4...), offsetting uv coordinates in texture fetches by a constant could be free, check your documentation!
Usually 2d textures are the fastest option (1d and 3d and cube ones are slower).
Depending on the platform, halfs could be or not faster than full floats (sometimes only normalize is faster, i.e. on NVidia hardwre) and/or require less registeres.
Consider dynamic branching (wisely, it could be slower than not using it). Expecially if you're sure it's most of the times coherent (see above). A neat idea: mipmap your shadowmaps (use care... you can't filter them as normal textures), then use a low mipmap level and dynamic branching to avoid expensive computations (i.e. percentage closer filtering) in areas that are trivially non-shadowed (or that trivially are, depending if you compute the min or the max of neighboring pixels in your mipmaps)
Usually, texture fetches that depends on computations (UV coordinates that are not directly given by interpolated vertex shader outputs) have always to be evaluated (i.e. can't be discarded in dynamic branching) otherwise the shader can't correctly compute deriviatives for texture filtering. This is platform dependent (ATI on PC surely has this problem). So beware that those fetches, inside dynamic branches, could be automatically moved outside them (most of the time you can move the computations in the vertex shader and pass them via interpolators, or explictly set the lod level or gradient in the texture access, some simple computations do not incour in this problem anyway).
Know your platform. Usually platform independent rules (as everything I've told you in those posts) are the things that matter most. But exceptions can really kill your performance (some operations could really slow, in an unexpected way). Profile and benchmark to know which things are more expensive on your platform.
Profiling is hard. Use profiling tools that read the internal counters of the GPU. Finding the pipeline bottleneck by modifying the stage load could be done but should be done with care. For example, to check if your drawcall is limited by the vertex shader you could think of replacing it with a simpler version. But if that simpler version changes the way pixels are drawn, you are also impacting on the pixel shader. The correct way to do that is to profile with the full shader in a special configuration of its inputs (i.e. bone matrices set to the identity for a skinning shader) such that you can later hardcode that special configuration removing computation but yielding the same results.

7 comments:

Unknown said...: Hello, I am a member of the Ceske-Hry.cz game developer Czech- and Slovak-only community. I haven't found your email on this blog so I am putting it here. I'd like to ask you for a permission for translating your article "How the GPU works" into Czech language and publishing it on the Ceske-Hry.cz website with your approval. Thanks for your reply.; August 18, 2008 at 1:08 PM
Unknown said...: No problem, if you want there's a messenger applet embedded in the blog (up-left) that you can use to contact me on MSN.

I would reccomend you also to check out my newer GPU-vs-CPU article: http://c0de517e.blogspot.com/2008/07/gpu-versus-cpu.html; August 18, 2008 at 11:09 PM
DEADC0DE said...: ah, anyway if you want that someone that does not give away his email address to reply you, the best option is to include your email address in the message instead :); August 18, 2008 at 11:14 PM
Anonymous said...: This comment has been removed by a blog administrator.; February 17, 2009 at 9:19 PM
O Rapaz Invisível said...: Hey there. I've bumped into your blog while searching online about z-buffering. Mainly what I am curious about is what entity (or entities) can cause the effect known as "z-fighting" in the whole rendering process. Is it caused by wrong calculations inside the GPU? Is it caused by wrong modeling in an application (before stuff are even sent to the GPU)? Is it caused by the graphic card drivers? I'm a totally n00b in this matter but I'm curious about this particular effect (I want to know if it indicates a hardware defect in a graphics card) so i thought maybe you could help shed some light on this :) Thanks a lot in advance!; April 24, 2009 at 7:46 AM
DEADC0DE said...: the invisible boy: it's not a sign of defective graphic cards... http://en.wikipedia.org/wiki/Z-fighting; April 24, 2009 at 1:23 PM
Unknown said...: Answer for "The Invisible Boy" about z-fighting:
Your camera has a near and far plane, also known as view frustum. This distance between near and far plane is represented by your z-buffer and has a certain precision, e.g. 16Bit. Depending on the distance of near and far plane and the precision of your z-Buffer, it can happen that 2 or more triangles share the same z-Buffer value, which causes that flickering / noisy rasterization. To prevent that you can decrease the frustrum or increase z-Buffer precision, e.g. from 16Bit to 32Bit.; August 17, 2016 at 7:28 AM

C0DE517E

Search this blog

24 April, 2008

How the GPU works - part 3 (optimization!)

7 comments: