C0DE517E: How the GPU works

NOTE: this series of posts is not meant for complete beginners. If you don't know anything about hardware graphic programming, a good primer is this article by Humphreys and Luebke.

In this, and the following posts, I want to analyze the behaviour of a modern GPU (let's say, directX 9 class), going from the broad architecture to the shader execution pipeline to some specific shader optimization guidelines, as I've promised some time ago in a "next post preview"-post. This is going to be kinda hard, as there are many different architectures, and I think that I'll be editing this post over and over to correct errors in it (if I don't get lazy too fast). So let's begin:

A GPU follows, even in those days of extreme programmability, a very specific and, if used properly, powerful computation model called "stream processing". That is a parallel programming paradigm where the same computational kernel is executed over an uniform stream of elements. That enables both easy, huge parallelism and hiding of latencies (usually, memory accesses) by employing long pipelines.

Modern GPUs have different stages, in each of those stages the data stream is transformed according to some rules that we can set. Problem is that changing the kind of operation to be performed, in this computational model, is really expensive, as it creates "pipeline bubbles" that easily kill performace.

Configuration information is fed by our program (CPU-side) to the GPU by a command (ring) buffer. All the calls we make to the Direct3d device, if they are not redundant (trying to set in the GPU a configuration that is already set), are written in the command buffer when a draw call is found. When the command buffer is full or when a frame has to be presented, the commands are sent to the GPU for execution. That's why the GPU is usually at least one frame behind the CPU.

What kind of configuration can we send to the GPU? Basically there are five broad categories: state changes, vertex buffer/vertex declaration/texture sampler changes, shader changes, shader constant changes, render target changes.

State changes are used to setup the "fixed" elements of the pipeline, computational units that are not programmable, only configurable in a number of ways. Vertex buffers are the main source of streaming data for the vertex shaders, while texture samplers, are together with interpolated data produced by vertex shaders, the main source of data for the pixel shaders. Shader constants are a small pool of fast to access memory that hold constants used to further configure the shader that's going to be executed. Render targets are the final, output buffers that will hold the result of the computation (usually, an image)
Note that I've grouped together vertex buffers, vertex declarations and texture samplers because this is the way newer cards view the data input, in an unified way (some newer cards also don't have independent pixel and vertex units, but have unified shading units that can be dynamically allocated to one or the other task). Older GPUs had a fixed way to get data for the vertex shader, by fetching it from vertex buffers in an order dictated by the vertex declaration, while the pixel shaders could fetch anywhere in a texture using a sampler. Textures were and are more flexible, providing a more expensive but random access, while vertices provide a fast sequential access to data.

The bad thing is that actually, this is only a logical view of how a GPU works, internally the execution is way more convoluted and different from a GPU to another. It's impossible to predict which logical changes will trigger which physical ones.

For example, on some platforms, the shader constants are directly injected into the shader code, patching it. On some others, the vertex fetch is injected in the shader code, again by patching the shader according to the vertex declaration that's being used, also the fixed parts of the pipeline could depend not only on the states, but also on the configuration of the programmable parts.

Using texcoord and color interpolators together could require a state change in the interpolation unit, using halfs and full floats as input could require a similar change in the fetching one, certain kinds of texture fetches can be automatically optimized by replacing them with cheaper ones when possible, but this also requires a pipeline change.

Shader patching is expecially nasty, you should know if and when it happens on your target GPU and avoid it (by making copies of the shader each used with a different configuration or by eliminating the different configurations that cause patching).

As a general rule, changing the source of the stream data (vertex buffers, textures) with another one that has the same attributes is not expensive (it only updated the memory pointer from which the GPU is fetching data). Changing shader constants can be cheap, but it depends on the architecture, on how those constanst are fed to the GPU (via shader patching? bad). Everything else is expensive, changing shaders surely is (that's why engines tend to sort draw calls by "materials"), changing states can be. Changing render targets is always the most expensive operation, as it requires a pipeline flush (obviously, every draw command regarding that render target has to be _finished_ in order to change it, so we have to empty the pipeline). As I said, some operations happening in the shaders also can cause the pipeline to stall, on some architectures.

That said, let's see how the GPU works, again, on a logical level:

* The CPU sends a command buffer to the GPU
* The GPU starts to process it, that happens usually in parallel with the CPU, if no special syncronization primitives are used and if the CPU is not attemping to modify (lock) GPU resources (textures, vertex buffers) while the GPU is trying to render them. Buffers that are dynamically updated by the CPU should be at least double buffered.
* The GPU parses all the configuration commands, until a draw call is issued
* The draw call executes, data starts to be fetched from the vertex buffers in the order specified by the vertex declaration. Streams of floats, with different meaning are fetched and organized in blocks to be executed by the vertex unit. A vertex declaration could for example say to fetch four floats that have a "position" semantic and four that have a "normal" semantic from a single (interleaved) buffer in the GPU memory. Fetching is of course, cached (pre-transform cache). Fetching can be indexed with an index buffer, so we can avoid sending and processing two times the same vertex if it's repeated in our primitives.
* The vertex unit executes the vertex shader. Vertex shader receives in input the shader contants that are always the same during the draw call, and the data fetched from the buffers. Many vertices are processed in parallel, from each processed vertex a new set of attributes is computed by the shader, each attribute has again, a semantic. For example, for each [position, normal] (plus the constants) input the shader can compute an output that is made of a "position" and a "color". Computing the position is compulsory, as it says to the GPU where to draw the pixels for that primitive. If the vertices where fetched with indexing, a post-transform cache is enabled. Vertices with the same index in the primitives, if that index is still in the cache, skip vertex shader execution and a previous result is used. This is very important, and it's why primitives should always be ordered to maximize post-transform cache performance.
* Primitives are assembled. If we wanted to draw a triangle list, every three processed vertices, a triangle is assembled and rasterized. Rasterization produces pixels (always arranged in 2x2 quads, GPUs don't do scanline rasterization, quads are needed to compute derivatives of pixel attributes, derivatives are usex by the ddx and ddy shader instructions and for texture filtering). Each pixel holds the vertex shader output attributes, interpolated from the pritive vertices to the pixel location. Interpolation depends on the attribute semantic (register used, each semantic maps to a different register). Texcoord semantics are usually perspective corrected, and interpolated with high precision. Color semantics are interpolated with a lower precision, not perspective corrected, and sometimes clamped to [0...1] range.
* Primitives are culled and clipped. Primitives that are back-facing to the camera, outside the screen borders and frustum clip planes are discarded. Blocks of pixels that are surely behind already drawn ones are discarded (nowdays all the GPUs do a Z-Buffer check early in the pipeline, before the pixel shader executing, by using a special, "hiearchial" Z-Buffer)
* The pixel unit executes the pixel shader. As was with the vertices, the pixels are executed in parallel, shader input are the interpolated attributes and the shader constants. Pixel shaders emits pixel color, alpha and depth value.
* The shaded pixel is tested for rejection by a configurable stage based on its properties (alpha and depth value) and some state-configured rules. Usually we want to discard pixels that are behind already drawn primitives, and this is done by comparing the current pixel depth value with the one stored in the Z-buffer (a.k.a. depth buffer). If our pixel is nearer to the point of view that the old one, it's accepted and the Z-Buffer is updated.
* The shaded pixel is blended with the existing output image (rendering targets), the blending is again a fixed function unit configured with render states.

In some GPUs there are also some other stages that I've not described here. Keep in mind is that each and every of this stage can be the bottleneck for your rendering. And that those are logical stages, they correspond to the major physical components of a GPU, but a real GPU implemets those in hardware with more sub-stages, each is part of the whole pipeline, a stall in any of those slows down all the others that depend on the stalled one. Knowing you hardware helps a lot, bu unfortunately, sometimes it's simply not possible (PC where graphic card vary) and we have mostly to reason at a logical level.

The first thing you have to do is to identify the bottleneck, by using profiling tools or by artificially lowering the amount of work done by each stage to check if that one is the one we have to optimize. The second thing to remember, is that stage changes can happen in each of those stages, and if it's true that pipeline bubbles are deadly in the long pipelines used to execute shaders, even state changes in the fixed function components can slow down the execution, and that state changes do not happen only at each drawcall, but can also implicitly happen in a draw call, if the kind of work to be performed changes.

Next time, we'll see more in detail, how a shader is actually executed... Stay tuned.

9 comments:

Alessandro Monopoli said...: Very nice article! Can't wait for part 2 :); April 8, 2008 at 12:41 AM
Unknown said...: Thanks for the article. Nevertheless I hope it will become useless in the not so distant future, when we'll have CPUs with 128 cores and whatever else is needed to implement some alternative software rendering algorithms. I mean it would be cool not being so tied to the hardware and one rendering technique (no matter how efficient it is).; April 8, 2008 at 12:23 PM
DEADC0DE said...: remigiusz: I hope not. GPUs are way faster for what they have to do than CPUs exactly because they are more specialized, they use that nice, extremely efficient computational model. Anyway even if we have CPUs with N cores, GPUs with the same technology will be faster for what they have to do. I don't think that the two things should really converge into one. And also, the nice thing of the GPU model is how they hide memory latencies. Not sure how CPUs are going to deal with that problem, eventually to have a good performance they will have to be programmed in a "streaming" fashion as well, and thus studying that computational model and its data structures is always useful, even in the future.; April 8, 2008 at 1:52 PM
Anonymous said...: This comment has been removed by a blog administrator.; February 17, 2009 at 9:18 PM
jcc said...: great articles,
do you know of any good books on GPU architectures?
thanks!!; October 28, 2009 at 7:55 AM
Niels Olson said...: Looked briefly for your email address, didn't find it. So posting this here:

> over an uniform

s/an/a/

Say it out loud.; April 12, 2011 at 5:06 PM
Unknown said...: This comment has been removed by the author.; March 5, 2014 at 12:48 AM
Unknown said...: I've posted a "GPU primer" article. Check it out: http://gpuprimer.blogspot.com; March 5, 2014 at 12:51 AM
Unknown said...: Original link to the introductory paper at the beginning of this article is dead, can be found here: http://web.archive.org/web/20080206001716/http://www.cs.virginia.edu/~gfx/papers/paper.php?paper_id=59; February 28, 2019 at 11:45 AM

C0DE517E

Search this blog

07 April, 2008

How the GPU works - part 1

9 comments: