Search this blog

28 April, 2008

Parallel linking

At last! A smart programmer under the name of Ian Lance Taylor developed a parallel linker for GCC (x86), Gold. I hope this idea spreads fast (to visual studio...), C/C++ linking times are the main bottleneck to fast code iteration nowadays.

Linking is so slow that most companies use "bulk" builds (and I was really suprised by that, until a year ago I thought that incremental linkers, good precompiled headers and sane dependency management where good enough), i.e. merging all the sourcecode (per directory, or per package) into single, big source files (by using the preprocessor and includes).
Even if this slows down compilation (that is very fast with multicore or distrubuted system like incredibuild or distcc/ccache), requires regeneration of those bulk files if you add sources to the project, and has the nasty side effect of easily breaking includes in non bulk builds, it speeds up linking so much that is something almost compusory to have in a large project.

Is writing text the best way to code?

I do think so. But it seems that there are many people that do not agree with me, as there are a lot of experiments with graphical programming languages. One of the best I've ever seen is Subtextual, but I still don't see the point in those languages.

There are a couple of educational languages that use graphics to ease programming, and some of them are also very nice (check out this one, Alice), probably for childs this is the best way to start (even if I started coding at eight, in Basic, and I don't remember having much trouble with the syntax itself). But for most newbies the thing that is hard to understand is the logic of coding, not the syntax of the languages. Syntax is easy. I can't find any simpler way to express a condition than the sentece: if x>10 then print "the numer is greater than ten"

In general, I find that text is a great way of expressing logical statements. And textual input is the best way of editing it, of course. While is true that code is data (and we know that since the beginning of computer science, from lambda calculus), I don't see any value in editing it as structured data. Modern IDEs already use that equivalence, and almost noone today is editing code as raw text, even the simplest code editor has syntax highlighting, code folding, many do refactoring, many others use reflection or parsing to run tests, to do coverage analysis, static code checking... We do take advantage of the structure of the code.

Last but not least, beware of false analogies. My motivation in writing this post was mainly because of this other one about subtextual. There, the author writes:

[...] One of the things you do most frequently in an IDE like this is type some code, which at least temporarily puts your program into a totally invalid state. As you're typing "def ", your module is syntactically invalid [...] If you're using a tool to edit something other than a program, like, say, Inkscape, as you move between different states in your drawing (add a line, change a gradient, resize a shape) each one is a valid SVG document if you were to save it. [...]"

Now, the nice thing about analogies is that they take two truths and stich them togheter. As you're starting from two truths, usually people don't pay much attention about the intricacies of that mechanism and often see relations that are not there. Here we are comparing coding using text to (vector) image editing, show how textual coding goes from an valid program to another passing through many invalid ones, while image editing always maintains a valid image. But this is not true at all. When I edit a photo in photoshop, I go through many invalid ones, meaning that they don't represent anything believable, they are wrong. I can cut a piece of skin, paste it into a layer, move to another position in the image, in order to cover a scar on the body. And until I've finished blending the new layer, the image is really invalid to me. Of course in image editing, we don't have a formal mean to assess validity, but still when making our analogies we should be sure of working on the same level of abstraction. We can do that using the correct one, an higher one, in the image editing part of the analogy, or to make my point in a stronger way, also lowering the one on the programming side. As Inkscape goes from a valid SVG document to another, a IDE goes from a valid ASCII document to another.

Last but not least, let me make an analogy. Coding is about expressing idea (only algorithmic ones) in a language (a very small, rigid but extensible one). I don't see any better way of doing that than writing text, as I don't if I had to write a novel in a natural language, or a theorem in maths.

Huang-Rowley did it again: Siggraph 2008

I usually don't post news here (dunno why, maybe I should have a weekly or so interesting links post), but this is a nice one:
Ke Sen Huang maintains a huge list of links to papers of various conferences, his work is really helpful.
Another very nice conference for realtime rendering is Smartgraphics, last year's proceedings here.

24 April, 2008

How the GPU works - part 3 (optimization!)

This is the last one. So after writing way too much, let's go practical and recap what good looks like for a GPU:

Coherency. Doing the same stuff on huge amounts of data arranged in a sequential fashion. That's why we want to minimize the number of draw calls, and minimize the GPU state changes between them. Unfortunately which changes are most expensive is an hardware dependent matter, surely we'll sort draw calls per render targets first and shaders second.
All sources of uncoherent accesses are bad. That's why we want to use interleaved vertex buffers, we want to make them as small as possible, we want to use swizzeled texture, with mipmaps. But there are also other things that could cause problems, like random access to shader constants (via indexing into arrays, stuff that happens for example when you're doing bone animation with GPU skinning). Or like dynamic branching where frequently nearby data (for pixel shaders, usually blocks of 8x8 pixels) don't all take the same execution path.

Balance. Think about GPU as a huge pipeline (it is). It's not the total number of things it does that counts, but how the stages are balanced, it's not the sum of the work done by all stages that makes the cost of a drawcall on the GPU, but only the cost of slower stage. Think about it even outside the GPU, consider moving things from the CPU to the GPU vertex shader or viceversa, and between the vertex shader to the pixel shader. Consider even other stages, not only the shader pipelines, expecially if they deal with memory (i.e. vertex fetching and render target writing). Blending can be the bottleneck! In that case you can often reduce the overdraw by doing more! I.e. don't draw big alpha-key quads, use real geometry. Or draw less particles, but fancier!

Note: when profiling remember that in a drawcall it might happen that sometimes a given part of the pipeline is the one that's stalling it and some other times another part of the pipeline causes a problem (i.e. because of memory cache behaviour). So usually there is a single stage that is slow and that we have to "rebalance" but it's also true that doing less work in general, helps a little even in the other stages

Latency hiding. Shader pipelines have a way to hide memory latencies (that's the same way employed by modern CPUs as well). They hide them by having more threads in execution than actual arithmethic units so each unit has always something to do even if many threads are stalled on a huge latency. On the GPU you have more threads if your shader uses less registers. So less registers is good. And each memory operation gives you a number of ALU instructions "for free" (remember, balancing, texture units and alu units operate in parallel). Both moving computations into lookup tables and viceversa could be a good idea (i.e. using N analytical lights or a environment lighting cubemap), it all depends on if your shader is alu or texture bound. Don't guess, profile. If your pixel shader is texture bound for example, and you can't do anything about it, you could at least make your vertex shader faster (if it's ALU bound and it's a bottleneck for some primitives) by moving some of its computations into the pixel shader.
Dependencies between ALU and texture fetches or between texture and texture (i.e. texture fetches with UV coordinates that depend on other textures or computations) could be a problem on some platforms.

Don't work. The best thing you can do is not to do any work at all. Kill stuff as soon as possible in the pipeline (i.e. before pixel shader!). Do not draw things that are occluded (i.e. using occlusion queries). Do not shade your vertices twice (correctly use and optimize the post-transform cache). Do not overdraw (correctly use and optimize the early-Z rejection stage of modern GPUs). Draw only where you need (i.e. don't have large triangles that are mostly empty, i.e. with alpha = 0, execute post effects only where you need them). Think about early-stencil (to reduce overdraw, for example, in particle systems). Consider dynamic branching (wisely).

Work less. Again, this is obvious, but after saying all that always remember that you still should overall try to do the less possible work, then balance, then if you can't optimize further, you might consider adding work (features) to stages that are still free (could do more work hidden by other latencies).
Doing less work means for example to read (and write) less in memory accesses (smaller vertex buffers, smaller texture formats), to use less registers for interpolation of vertex outputs (remember that they are always float4 and that shader compilers can't pack, in that case, stuff for you to use less registers, same applies to constants, expecially to arrays), to use less ALU instructions to do the same work.
Draw less pixels, draw expensive stuff to smaller rendertargets and compose them back with the full-res one (again, particles!). Draw less vertices, use less triangles for smaller objects (LOD). Use less shaders for small objects (see my post, "how to properly lod pixel shaders") and so on.
Avoid transforming your vertices multiple times by optimally using vertex post-transform cache (i.e. use indexed, cache optimized, triangle lists). Avoid wasted pixel shaders computations by maximixing the number of "full quads", i.e. blocks of 2x2 pixels covered by a single triangle. If your triangles get too small, the rasterizer will emit many quads (that are the pixel shading unit, pixel shaders don't work on single pixels because they need neighbors to compute derivatives) with pixels masked out not to be written in the frame buffer, thus wasting processing power.
Cull more! Consider predication and occlusion queries. Use hi-Z (early z culling), use hi-Stencil! Everywhere!

Give hints to your shader compiler. For example, if your vertex position .w coordinate is always one, don't read it from the input stream, read a float3 and when you need to do homogeneous transforms, build a float4 with the fourth component set to one (exception to this rule apply, i.e. if you read a texture, do no computations on it, and just return it, then explicitly setting alpha to zero if it's always zero in the texture itself does slow down the shader).
Always use the smallest vector type you need. Most GPUs (even old ones, that dates back to the way fixed pipelines were made, that allowed separate processing of RGB color and alpha value) can process at the same time a scalar and a vector operation (some can process in parallel a float3 and a float1, some can process float4 and float1, some others can also do float2 and float2...)
Use appropriate types, i.e. if your shader input is an integer, and you'll use that integer to index an array, declaring it as a float is slower (but doing operations on it as a float can be faster, check your compiled code). Always prefer vector computations over multiple scalar ones, always use intrinstics functions (i.e. don't write your own normalization function).

Other tips and tricks...
  • Check your compiled shader code for potential wasted work. For example, when dealing with integers compilers generate bad code, don't use integers computations, don't use integer modulo, most of the time you can use a float multiply and divide.
  • Use what your platform gives to you, i.e. signed textures, swizzling is usually free, some modifiers also are (*2, *4...), offsetting uv coordinates in texture fetches by a constant could be free, check your documentation!
  • Usually 2d textures are the fastest option (1d and 3d and cube ones are slower).
  • Depending on the platform, halfs could be or not faster than full floats (sometimes only normalize is faster, i.e. on NVidia hardwre) and/or require less registeres.
  • Consider dynamic branching (wisely, it could be slower than not using it). Expecially if you're sure it's most of the times coherent (see above). A neat idea: mipmap your shadowmaps (use care... you can't filter them as normal textures), then use a low mipmap level and dynamic branching to avoid expensive computations (i.e. percentage closer filtering) in areas that are trivially non-shadowed (or that trivially are, depending if you compute the min or the max of neighboring pixels in your mipmaps)
  • Usually, texture fetches that depends on computations (UV coordinates that are not directly given by interpolated vertex shader outputs) have always to be evaluated (i.e. can't be discarded in dynamic branching) otherwise the shader can't correctly compute deriviatives for texture filtering. This is platform dependent (ATI on PC surely has this problem). So beware that those fetches, inside dynamic branches, could be automatically moved outside them (most of the time you can move the computations in the vertex shader and pass them via interpolators, or explictly set the lod level or gradient in the texture access, some simple computations do not incour in this problem anyway).
  • Know your platform. Usually platform independent rules (as everything I've told you in those posts) are the things that matter most. But exceptions can really kill your performance (some operations could really slow, in an unexpected way). Profile and benchmark to know which things are more expensive on your platform.
  • Profiling is hard. Use profiling tools that read the internal counters of the GPU. Finding the pipeline bottleneck by modifying the stage load could be done but should be done with care. For example, to check if your drawcall is limited by the vertex shader you could think of replacing it with a simpler version. But if that simpler version changes the way pixels are drawn, you are also impacting on the pixel shader. The correct way to do that is to profile with the full shader in a special configuration of its inputs (i.e. bone matrices set to the identity for a skinning shader) such that you can later hardcode that special configuration removing computation but yielding the same results.

At last, at least

NVidia published an article about using SAS with FX Composer.

I did like the old (1.x) FX Composer, and I was really disappointed by the new one, but at least now I can know how to use SAS annotations and scripting. I wasn't ever able to found anything on them on internet (they are/were Microsoft standard annotations but there's nothing useful on them on the directx sdk documentation).
Strange thing that they wrote this now, wasn't Collada FX supposed to be used instead of SAS in the new FX Composer?

Note: non-semantic stuff is well documented in the DirectX SDK

Singletons: the new superglue

Singleton is the most common, if not the only, design pattern I see in my game programming work. It's so popular because it provides a nice encapsulation of a nasty concept: global, shared state.

But everyone knows that global, shared state is a bad thing, that was a bad thing to have even when no one cared about thread safety, so we just wrapped them into that nice design pattern...
Global state makes two classes communicate if they both happen to use it, no more if you explicitly design the dependency. Anyone that can see it can access it, both for reading and writing, only a very few number of system components should be accesses in such a broad way.

Generally speaking, I don't like patterns too much. They provide you just the right amount of knowledge you need to make a bad design, thinking it's good. The worst errors I've ever seen do not came from no knowledge, but from having just the the right amount of it, too small to really understand what you're doing, but enough to think you did. That amount of knowledge, is evil.

Note: This a very nice article on why static variables are bad in general. This of course applies even more to global ones and thus, to singletons.
Note: Another very good article against singletons (well actually, better than this one) is this one.

13 April, 2008

How the GPU works - part 2

So how does the actual shader execution unit work? This is where the things get very platform (GPU) specific, but let's try to get the general picture, without infranging any NDA. Even for developers it's not always easy to find in depth information on all the details, luckily most of the times they aren't needed also.

For the interested reader, a nice starting point are the ATI-AMD and Intel recently disclosed documentation about their GPUs. The Intel ones are more intresting than you could imagine being low-end graphic chips. On the NVidia side, the best documents you can find (as of now) are the CUDA/G80 ones (and as CUDA is kinda popular now, there are interesting investigations done even by third parties)...

Enough said, let's start. Every GPU has a number of arithmetic units, ALUs, and texture units. As in every modern processor, the memory access is a couple of orders of magnitude slower than instruction processing, so what we want to do is always have our execution units full, in order to amortize those costs and hide them with high latencies and high throughput.

That is not something new at all. CPUs started this trade a long time ago. A single instruction was split in an number of simpler stages, those stages where arranged in a deep pipeline, if the pipeline is always full we have a high latency, as every instruction has to go thought all those pipeline stages, but if we have no bubbles in the pipeline, we'll get an high thorughput, the mean number of instructions per second that we're able to process is high. Simpler stages meant more gigahertz, deep pipelines meant that a pipeline stall was and is incredibly expensive (i.e. a branch misprediction). Even more similar to what GPUs do is hyperthreading, we need get more "hardware threads" per each functional CPU core because doing so the CPU has different independent streams of instructions to compute, and if one is stalled on a memory access, there's another one to keep its ALU busy...

GPUs employ the same ideas, but in a way more radical way that CPUs are doing. The stream execution model is such that we want to execute the same instruction sequence (shaders) on a huge amount of data (vertices or pixels), and all the computations are independent (even geometry shaders have access to topological information, how a vertex is connected to other vertices, but computation on a vertex has no influence on the ones done for the other vertices).

So what we do is to partition input data in big groups, in one group all the data has to be processed in the same way (with the same shader/pipeline configuration).
Execution in a group happens in parallel, is we have a shader of ten instructions and the vector is a group of one hundred different inputs, ALUs compute the first instruction for each of the one hundred inputs, then the second and so on. Usually more than a single ALU works on a given group (ALUs are split into different pipelines, each pipeline can process a different group/shader). The problem with this approach is that we need to store not only all the different inputs that make a group, but we also have to have space for all the intermediate data that we need during the execution of a shader.

An input for a vertex shader for example can be made of only four floats (the vertex position) but the shader itself could require an higher number of floats as temporary storage during its execution. That's why when the shaders are compiled, the maximum number of used registers is also recorded, for the GPU to know. Each register is made of four floats. Each GPU pipeline has a limited number of registers available to process its execution group, so the size of that group is limited by the space each input requires for processing.

In other words, the more registers a shader needs, the less parallel threads a group will be made of, the less latency hiding we get. That's a key concept of GPU processing. This execution grouping is also the very same reason why dynamic branching on the GPU is not really dynamic, but happens in groups, if a given number of pixels or vertices go through the same branch only that one is evaluated, otherwise both are and the right result is applied for each input using conditional moves.

Of course, in real GPUs there are also other limiting factors, usually even if there are enough registers only a fixed number of pixel quads or vertices can be in process in any given moment, and pipeline bubbles occour when we need a change in the pipeline configuration, execution in a group has to be exactly the same. Unfortunately knowing when those state changes happen depends on the specific GPU platform, and somethines things get weird.

Understanding latencies and pipeline stages in the entire GPU is crucial in order to write effective shaders. Some GPUs to furhter hide memory latencies can execute different instructions of the same shader on the different threads of a group, so if a thread is busy on a memory access, ALU processing can be immediately scheduled for other threads that have already finished with the same access. That also means that for each memory access you get a number of alu instructions "for free" as they're hidden by memory access anyway, AND viceversa.

That's true also for other latencies in the GPU, for example, writing in render targets always takes a number of cycles, even if your pixel shader executes in less than those cycles, overall that pipeline stage is a bottleneck for pixel processing anyway, so you're shader won't go any fater, the same applies for interpolator latencies, triangle setup/clipping ones or for vertex fetching, even if usually the memory related ones are so big that the only thing you should care is balancing ALUs with fetching/writing, but still, for some simple shaders, other latencies can be the limiting facotrs.

The key to GPU performance is balancing this huge pipeline, every stage has to be always busy, if so, enormous throughputs can be obtained. The key for shader performance is balancing ALU count with texture/vertex fetches, while at the same time, trying to keep our register count as small as possible. Pipeline bubbles are performance killers. Bubbles are caused by configuration changes. Those are the general rules. Exceptions to those rules can really kill you, but in general, this is what you should aim for.

Next time we'll see (if south park does not drain all my free time) some shader coding advices, after we've got the idea of how everything works.

Small gems

Guy Steele: Growing a language
Brandon Morse from GeneratorX

09 April, 2008

My work: rendering crowd

What I've done in the last month: Xbox 360, 2000+ visible instances (10000+ total), 32 different animations, 4 lod levels (from 4500 to 50 vertices), 16 textures, 100% 3d, no lighting, 1-bone skinning (done each frame, for all lods): 3.5-2.5ms (everything has still to be tuned, there's probably still some space for improvement, it varies a lot depending on the camera position etc...)

Problems? Yes, this week I have to do the same on ps3...

Faking nextgen, pt.2

Another very intresting PS2-nextgen presentation, this time, SH lighting, HDR and DOF. There are some innaccuracies in the presentation, but this stuff is cool (as a fool in a swimming pool).

07 April, 2008

How the GPU works - part 1

NOTE: this series of posts is not meant for complete beginners. If you don't know anything about hardware graphic programming, a good primer is this article by Humphreys and Luebke.

In this, and the following posts, I want to analyze the behaviour of a modern GPU (let's say, directX 9 class), going from the broad architecture to the shader execution pipeline to some specific shader optimization guidelines, as I've promised some time ago in a "next post preview"-post. This is going to be kinda hard, as there are many different architectures, and I think that I'll be editing this post over and over to correct errors in it (if I don't get lazy too fast). So let's begin:

A GPU follows, even in those days of extreme programmability, a very specific and, if used properly, powerful computation model called "stream processing". That is a parallel programming paradigm where the same computational kernel is executed over an uniform stream of elements. That enables both easy, huge parallelism and hiding of latencies (usually, memory accesses) by employing long pipelines.

Modern GPUs have different stages, in each of those stages the data stream is transformed according to some rules that we can set. Problem is that changing the kind of operation to be performed, in this computational model, is really expensive, as it creates "pipeline bubbles" that easily kill performace.

Configuration information is fed by our program (CPU-side) to the GPU by a command (ring) buffer. All the calls we make to the Direct3d device, if they are not redundant (trying to set in the GPU a configuration that is already set), are written in the command buffer when a draw call is found. When the command buffer is full or when a frame has to be presented, the commands are sent to the GPU for execution. That's why the GPU is usually at least one frame behind the CPU.

What kind of configuration can we send to the GPU? Basically there are five broad categories: state changes, vertex buffer/vertex declaration/texture sampler changes, shader changes, shader constant changes, render target changes.

State changes are used to setup the "fixed" elements of the pipeline, computational units that are not programmable, only configurable in a number of ways. Vertex buffers are the main source of streaming data for the vertex shaders, while texture samplers, are together with interpolated data produced by vertex shaders, the main source of data for the pixel shaders. Shader constants are a small pool of fast to access memory that hold constants used to further configure the shader that's going to be executed. Render targets are the final, output buffers that will hold the result of the computation (usually, an image)
Note that I've grouped together vertex buffers, vertex declarations and texture samplers because this is the way newer cards view the data input, in an unified way (some newer cards also don't have independent pixel and vertex units, but have unified shading units that can be dynamically allocated to one or the other task). Older GPUs had a fixed way to get data for the vertex shader, by fetching it from vertex buffers in an order dictated by the vertex declaration, while the pixel shaders could fetch anywhere in a texture using a sampler. Textures were and are more flexible, providing a more expensive but random access, while vertices provide a fast sequential access to data.

The bad thing is that actually, this is only a logical view of how a GPU works, internally the execution is way more convoluted and different from a GPU to another. It's impossible to predict which logical changes will trigger which physical ones.
For example, on some platforms, the shader constants are directly injected into the shader code, patching it. On some others, the vertex fetch is injected in the shader code, again by patching the shader according to the vertex declaration that's being used, also the fixed parts of the pipeline could depend not only on the states, but also on the configuration of the programmable parts.
Using texcoord and color interpolators together could require a state change in the interpolation unit, using halfs and full floats as input could require a similar change in the fetching one, certain kinds of texture fetches can be automatically optimized by replacing them with cheaper ones when possible, but this also requires a pipeline change.
Shader patching is expecially nasty, you should know if and when it happens on your target GPU and avoid it (by making copies of the shader each used with a different configuration or by eliminating the different configurations that cause patching).

As a general rule, changing the source of the stream data (vertex buffers, textures) with another one that has the same attributes is not expensive (it only updated the memory pointer from which the GPU is fetching data). Changing shader constants can be cheap, but it depends on the architecture, on how those constanst are fed to the GPU (via shader patching? bad). Everything else is expensive, changing shaders surely is (that's why engines tend to sort draw calls by "materials"), changing states can be. Changing render targets is always the most expensive operation, as it requires a pipeline flush (obviously, every draw command regarding that render target has to be _finished_ in order to change it, so we have to empty the pipeline). As I said, some operations happening in the shaders also can cause the pipeline to stall, on some architectures.

That said, let's see how the GPU works, again, on a logical level:

* The CPU sends a command buffer to the GPU
* The GPU starts to process it, that happens usually in parallel with the CPU, if no special syncronization primitives are used and if the CPU is not attemping to modify (lock) GPU resources (textures, vertex buffers) while the GPU is trying to render them. Buffers that are dynamically updated by the CPU should be at least double buffered.
* The GPU parses all the configuration commands, until a draw call is issued
* The draw call executes, data starts to be fetched from the vertex buffers in the order specified by the vertex declaration. Streams of floats, with different meaning are fetched and organized in blocks to be executed by the vertex unit. A vertex declaration could for example say to fetch four floats that have a "position" semantic and four that have a "normal" semantic from a single (interleaved) buffer in the GPU memory. Fetching is of course, cached (pre-transform cache). Fetching can be indexed with an index buffer, so we can avoid sending and processing two times the same vertex if it's repeated in our primitives.
* The vertex unit executes the vertex shader. Vertex shader receives in input the shader contants that are always the same during the draw call, and the data fetched from the buffers. Many vertices are processed in parallel, from each processed vertex a new set of attributes is computed by the shader, each attribute has again, a semantic. For example, for each [position, normal] (plus the constants) input the shader can compute an output that is made of a "position" and a "color". Computing the position is compulsory, as it says to the GPU where to draw the pixels for that primitive. If the vertices where fetched with indexing, a post-transform cache is enabled. Vertices with the same index in the primitives, if that index is still in the cache, skip vertex shader execution and a previous result is used. This is very important, and it's why primitives should always be ordered to maximize post-transform cache performance.
* Primitives are assembled. If we wanted to draw a triangle list, every three processed vertices, a triangle is assembled and rasterized. Rasterization produces pixels (always arranged in 2x2 quads, GPUs don't do scanline rasterization, quads are needed to compute derivatives of pixel attributes, derivatives are usex by the ddx and ddy shader instructions and for texture filtering). Each pixel holds the vertex shader output attributes, interpolated from the pritive vertices to the pixel location. Interpolation depends on the attribute semantic (register used, each semantic maps to a different register). Texcoord semantics are usually perspective corrected, and interpolated with high precision. Color semantics are interpolated with a lower precision, not perspective corrected, and sometimes clamped to [0...1] range.
* Primitives are culled and clipped. Primitives that are back-facing to the camera, outside the screen borders and frustum clip planes are discarded. Blocks of pixels that are surely behind already drawn ones are discarded (nowdays all the GPUs do a Z-Buffer check early in the pipeline, before the pixel shader executing, by using a special, "hiearchial" Z-Buffer)
* The pixel unit executes the pixel shader. As was with the vertices, the pixels are executed in parallel, shader input are the interpolated attributes and the shader constants. Pixel shaders emits pixel color, alpha and depth value.
* The shaded pixel is tested for rejection by a configurable stage based on its properties (alpha and depth value) and some state-configured rules. Usually we want to discard pixels that are behind already drawn primitives, and this is done by comparing the current pixel depth value with the one stored in the Z-buffer (a.k.a. depth buffer). If our pixel is nearer to the point of view that the old one, it's accepted and the Z-Buffer is updated.
* The shaded pixel is blended with the existing output image (rendering targets), the blending is again a fixed function unit configured with render states.

In some GPUs there are also some other stages that I've not described here. Keep in mind is that each and every of this stage can be the bottleneck for your rendering. And that those are logical stages, they correspond to the major physical components of a GPU, but a real GPU implemets those in hardware with more sub-stages, each is part of the whole pipeline, a stall in any of those slows down all the others that depend on the stalled one. Knowing you hardware helps a lot, bu unfortunately, sometimes it's simply not possible (PC where graphic card vary) and we have mostly to reason at a logical level.
The first thing you have to do is to identify the bottleneck, by using profiling tools or by artificially lowering the amount of work done by each stage to check if that one is the one we have to optimize. The second thing to remember, is that stage changes can happen in each of those stages, and if it's true that pipeline bubbles are deadly in the long pipelines used to execute shaders, even state changes in the fixed function components can slow down the execution, and that state changes do not happen only at each drawcall, but can also implicitly happen in a draw call, if the kind of work to be performed changes.

Next time, we'll see more in detail, how a shader is actually executed... Stay tuned.

01 April, 2008

C# DirectX wrappers

Managed DirectX has been discontinued quite a long ago. But I still want to be able to quicly prototype stuff in C# and F# without having to use the dumb XNA framework, that also lacks DX10 capabilities.
Luckily a couple of projects come to the rescue: