C0DE517E: 07.08

31 July, 2008

GPU versus CPU

Some days ago, a friend of mine at work asked me what was the big difference in the way GPUs and CPUs operate. Even if I went into a fairly deep description of the inner workings of GPUs in some older posts, I want to elaborate specifically on that question.

Let's start with a fundamental concept: latency, that is the time that we have to wait, after submitting an instruction, to have its results computed. If we have only one computational stage, then effectively the reciprocal of the latency is the amount of instruction we can process in an unit time.

So we want them to be small right? Well it turns out, that they were in the last years growing instead! But still our processors seem to run faster than before, why? Because they are good at hiding those latencies!

How? Simple, let's say that instead of having a single computational stage, you have more stages, a pipeline of workers. Then you might move an instruction being processed from one stage to the other (conceptually) like on a conveyor belt, and while you're processing it the other stages can accept more instructions. Any given instruction will have to go through the whole pipeline, but the rate of instruction processing can be much higher than latency, and it's called throughput.

Why we did like those kinds of designs? Well, in the era of the gigahertz wars (that now has largely scaled back), it was an easy way of having higher frequencies. If a single instruction was split in a number of tiny steps, then each of them could be simpler, thus requiring less stuff to be done, thus enabling designers to have higher frequencies, as each small step required less time.

Unfortunately, if something stalls this pipeline, if we can't fetch more instructions to process to keep it always full, then our theorical performance can't be reached, and our code will run slower than on less deeply pipelined architectures.

The causes of those stalls are various, we could have a "branch misprediction", we were thinking some work was needed, but we were wrong, we started processing instructions that are not useful. Or we could not be able to find instructions to process that are not dependant on results of the ones that are currently being processed. The worse example of this latter kind of stall is on memory accesses. Memory is slow, and it's evolving at a slower pace than processors too, so the gap is becoming bigger and bigger (there wasn't any twenty years ago, for example on the Commodore 64, its processors did not need caches too).

If one instruction is a memory fetch, and we can't find any instruction to process after it that does not depend on that memory fetch, we are stalled. Badly. That's why hyper-threading and similar architectures exist. That's why memory does matter, and why cache-friendly code is important.

CPUs become better and better at this job of optimizing their pipelines. Their architectures, and decoding stages (taking instructions and decomposing them in stages, scheduling them in the pipeline and rearranging them, that's called out-of-order instruction execution), are so complicated that's virtually impossible to predict at a cycle level the behaviour of our code. Strangely, transistor numbers did evolve according to Moore's law, but we did not use those transistors to have more raw power, but mostly to have more refined iterations of those pipelines and of the logic that controls them.

Most people say that GPUs computational power is evolving at a faster pace than Moore's law predicted. That is not true, as that law did not account for frequency improvements (i.e. thinner chip dies), so it's not about computational power at all! The fact that CPUs computational power did respect that law means that we were wasting those extra transistors, in other words, that those transistors did not linearly increase the power.

Why GPUs are different? Well, let me do a little code example. Let's say we want to compute this:

for i=0 to intArray.length do boolArray[i] = (intArray[i] * 10 + 10) > 0

GPUs will actually refactor the computation to be more like the following (plus a lot of unrolling...):

for i=0 to intArray.length do tempArray[i] = intArray[i]
for i=0 to intArray.length do tempArray[i] = tempArray[i] * 10
for i=0 to intArray.length do tempArray[i] = tempArray[i] + 10
for i=0 to intArray.length do boolArray[i] = tempArray[i] > 0

(this example would be much easier in functional pseudocode than in imperative one, but anyway...)

Odd! Why are we doing this? Basically, what we want to do is to hide latency in width, instead of in depth! Having to perform the same operation on a huge number of items, we are sure that we always have enough to do to hide latencies, without much effort. And it's quite straightforward to turn transistors in computational power too, we simply will have more width, and more computational units working in parallel on the tempArray! In fact, that kind of operation, a "parallel for", is a very useful primitive to have in your multithreading library... :)

Many GPUs work exactly like that. The only big difference is that the "tempArray" is implemented in GPU registers, so it has a fixed size, and thus work has to be subdivided in smaller pieces.

There are some caveats.

The first one is that if we need more than one temp register to execute our operation (because our computation is as simple as the one of my example!) then our register array will contain less independant operating threads (because each one requires a given space), and so we will have less latency hiding. That's why the number of registers that we use in a shader is more important than the number of instructions (now we can clearly see them as passes!) that our shader needs to perform!

Second, this kind of computation is inherently SIMD, even if GPUs do support different execution paths on the same data (i.e. branches) those are still limited in a number of ways.

Another one is that our computations have to be independant, there's no communication between processing threads, we can't compute operations like:

for i=0 to boolArray.length do result = result LOGICAL_OR boolArray[i]

That one is called in the steam processing lingo, a gather operation (or if you're familiar with functional programming, a reduce or fold), the inverse of which is called a scatter operation. Lucily for the GPGPU community, a workaround to do those kinds of computations on the GPU exists and is to map our data to be processed into a texture/rendertarget, use register threads to process multiple pixels in parallel and use texture reads, that can be arbitrary, to gather data. Scatter is still very hard, and there are limitations to the number of texture reads too, for example that code will be processed usually by doing multiple reductions, from a boolArray of size N to one of size N/2 (N/4 really, as textures are bidimensional) until reaching the final result... but that's too far away from the original question...

Are those two worlds going to meet? Probably. CPUs already do not have a single pipeline, so they're not all about depth. Plus both CPUs and GPUs have SIMD data types and operations. And now multicore is the current trend, and we will see have more and more cores, that will be simpler and simpler (i.e. the IBM Cell or the Intel Larrabee). On the other hand, GPUs are becoming more refined in their scheduling abilities, i.e. the Xbox 360 one does not only hide latency in depth, but also can choose which instructions from which shader to schedule in order to further hide memory latencies across multiple passes (basically implementing fibers)... NVidia G80 has computational units with independent memory storages...

Still I think that GPU processing is inherently more parallel than CPU, so a specialized unit will always be nice to have, we are solving a very specific problem, we have a small computational kernel to apply to huge amounts of data... On the other hands, pushing too much the stream computing paradigm on the CPUs is not too useful, as there are problems that do not map well on it, because they don't work on huge amounts of data nor they perform uniform operations...

30 July, 2008

Celebration of light

It's summer, and as always in Vancouver there's a firework competition that lasts a week. It's called celebration of light.

I was there, with my friends, looking at the show. After a while I said to Fabio: "I wonder if they have some kind of software for prototyping that or if they spend a lot of money...", "I was thinking about writing that from the very beginning of the show" he interrupted me.

Geeks.

28 July, 2008

Commando

We are near a milestone of our project, I don't have any serious issue to fix, everything is going fine...

But one day I still had to enter the "commando" mode (see resign patterns if you don't know about them yet, they are important). There was a bunch of code, done by someone else, and never tested as we were lacking art assets to do that. Assets were finally delivered, game programmers try to enable that rendering feature, it fails, the problem fall back on me.

Now as, specifically, the game programmer that assigned the bug to me is a friend of mine too, I wanted to solve that as fast as possible, so he could stop slacking and return to coding asap. Problem was that I did not know the code, well, the entire subsystem really, it's a new technology that we're just integrating. I will design and probably implement the "correct" version of that functionality, but for now we wanted just to hack something to see, to evaluate GPU performance and let artists have something to work on.

Luckily I had the priviledge of working, in my previous company, with a guy, Paolo Milani, that could handle those kinds of situations perfectly. He is a powerful weapon (in the wrong hands), he can do almost anything (being good all round, at coding, hacking, designing, maths etc), but he was mostly used, due to lack of money, time, and too much ignorance, to do in a couple of hours the work of weeks. That of course resulted in code that no other human could ever understand, but still, sometimes those skills are helpful.

How you could notice him entering the commando mode? Simple:

The mouse wheel accellerated up to 10.000rpm.
The GPU fans started spinning because of the heat generated by Visual Studio text rendering.
With the other hand, while scrolling furiosly, code was being added "on the fly" all over the solution.
You could see the number of compile errors decrese in realtime, until reaching zero that marks the end of an iteration.
Looking at the Xbox 360 monitor, you could see over minutes the game taking shape... First flat shaded bounding boxes, then the bikes, the track, diffuse textures, animations, sound...

I'm not that good. I've never seen anyone else that good. Still, this morning, half asleep in bed, I was thinking about our (overly complex) math library, simd intrinsics, the wrapper we have for portability, the vector classes... then I turned on the other side, hugged the head of my girlfriend, and for a split second I surprised myself thinking if that head did inherit from our base vector class, were the data was, if it was properly aligned...

Vertex shader LOD for Pixel shaders

I already blogged a couple of times about LODs for pixel shaders, so this is a quick update on the subject. Very quick, I'll say everything in one sentece, so be prepared and don't miss it:

Having geometrical LODs (less vertices/polygons) has also a (not small) impact on pixel shader performance, as the GPUs always process pixels in 2x2 quads, and so partial quads of a rasterized polygon waste pixel shader resources (as "unused" pixels in the quad will be processed and the discarded)

24 July, 2008

Quick shader tip

Don't use the const keyword. It's broken in some compilers (i.e. it leads to bad code) and it's not helping at all in optimizing the shader. Const is only helpful for the programmer, not for the compiler anyway (this is also true for C++). The compiler is smart enough to find what it really const and what not, at it has access to the whole sourcecode (no linking). The only exception of course are the global variables, that being by default uniform parameters, are always assumed non-const even if the shader does not change them.

22 July, 2008

Kill the hype

Since the infamous Carmack's interview on PCPerspective, (some of) the realtime rendering world has been rediscovering voxels (as point based rendering is something that we weren't doing yet anyway).

Noone tells us why. Why having less information (about topology) should be better than having more? Well, if you have so much data that you can't fit in memory, I can easily see the advantage, but that doesn't seem to be our problem as of now in most cases.

And weren't we all excited about DX10 geometry shaders exactly because we could have access to that kind of data?

I simply hate the hype. I hope that soon someone (more influential than me) says in an interview how cool Nurbs are, so we will be the two opposite ends of the hype, fully parametric surfaces versus raw 3d points.

The other (and related) hype is about raytracing techniques. I consider most of the realtime raytracing research to be dangerous for raytracing. Why we love raytracing? Because it allows to answer random visibility queries. Why we love to be able to do that? Because it enables us to use more refined methods of integrating the rendering equation. Faster ones, more adaptive, if you want. That still did become popular in the non-realtime world just a few moments ago...

Realtime raytracing research is mostly focused on the opposite direction, restricting the queries to coherent ones, so restricting also the effects that we can simulate to the ones that rasterization already does so well.

It seems that the only thing that you gain is the ability of rendering more accurate specular reflections, very, very, very slowly. Very useful, indeed, it's exactly the thing that artists ask me to bring them all day...

P.S. That was unfair, in fact just the ability of computing per pixel shadows in a robust way, without having to mess with shadow map parametrizations etc, is a very nice feature. But it's not enough.

18 July, 2008

ShaderX 6

Just finished to read it (very lazily, I'm also reading Geometric Algebra for Computer Science, that looks promising).

My picks from it:

Rendering Filtered Shadow with Exponential Shadow Maps (that you should already know...)
Interactive Global Illumination with Precomputed Radiance Maps (very nice extension to lightmaps...)

Also very intresting:

Stable Rendering of Cascaded Shadow Maps (a lot of nasty, useful details)
Practical Methods for a PRT-Based Shader Using Spherical Harmonics (even more nasty details)
Care and Feeding of Normal Vectors (that, again, you should already know...)
Computing Per-Pixel Object Thickness in a Single Render Pass (easy and nice)
Deferred Rendering Using a Stencil Routed K-Buffer
A Fexible Material System in Design

17 July, 2008

Normals without normals preview

It's 0.55, that is kinda late for me, as tomorrow I have to work and I won't ever sleep less than 8-9 hours per night. That's also why I you won't see me at work before 10:30.

Anyways I've finished the very first sketch in FXComposer of a nice-ish idea I had to compute smooth normals on a surface, when you don't have them (HINT: that might happen because you're displacing the surface with a non differentiable function for example... that might happen if you're computing that function using numerical methods...)

You probably won't be able to see anything intresting in the attached screenshot, but as I said, it's late, so you have to believe me, it's kinda cool, and I think it could have various uses... If it turns out to be a good solution, I'll publish the details :)

14 July, 2008

Hue shifting code snippet (with trivia)

Recently, to add variety to instances of a crowd system, I experimented with cheap methods to do hue shifting (as the pixel shader is very cheap at the moment and has a few ALU cycles to spare, as they are hidden by color texture access latency)... After 3 mostly-failed attempts I ended up with the following (actually, it's a test I did in the weekend, I'm not 100% sure it's error-free as I didn't test it much yet... LOL!):

float2 sc;

sincos(IN.random_recolor, sc.x, sc.y);

sc.y = 1.f - sc.y;

sc /= float2(sqrt(3.f), 3.f);

float3 xVec = float3(1.f, sc.xx * float2(-1,1)) + (sc.yyy * float3(-2,1,1));

float3x3 recolorMatrix = float3x3(xVec.xyz, xVec.zxy, xVec.yzx);

float3 recolored = mul(tex2D(colorTexture, IN.UV), recolorMatrix);

Have you figured out what it does? Try, even if the code is kinda cryptic, you should be able to understand the underlying idea... I've changed the names of variables to make is less obvious and protect the innocent (in my real code, I don't waste an interpolator only for the recoloring random float for example)... (hint: start from the bottom...)

Done? Well, if your guess was along the lines of "a rotation on the 45° positive axis in RGB space" then you're right! Mhm if it was not, then either you're wrong, or I did a mistake in the code :)

Bonus question: what kind of errors does it make (compared to a real hue-shift i.e. the one that photoshop implements)? Hints follow... What kind of errors could it make? Being a hue shift, it's wrong if it changes the other two components of the HSL space instead of the hue (we could argue that is an error even if it's non-linear in the hue, but as we want to encode a random shift we don't care much, even if a strong non-linearity if feed with an uniform random variable leads to a preference towards some colors). So we have to see if it changes the saturation or the luminosity... Which of the two is more a probably going to be a problem? Which of the two that code gets more wrong?

Second bonus question: how many ALU cycles does that technique take?

13 July, 2008

Slowdown

The blog will slowdown a little bit this month, I'm rather busy with photography, two small code projects (shaders), and other personal matters. But there's plenty to read, I posted some links to interesting RSS feeds in the past, and if you haven't read all the old posts, that is the perfect occasion to do that...

09 July, 2008

Note to myself

I've spent 1.5 days to find that this line:

return &GetPrevReadBuffer(mWritePose)[mWritePose];

should have been this one instead:

return &GetPrevReadBuffer(mWritePose)[mWritePose * NUMVECTOR4PERPOSEMATRIX];

And I previously had other problems with that too, basically I'm using a double-buffered (and interleaved... in a complicated way) three-dimensional vector4 array to store animation data...

Note to myself: DON'T ever do that again, DON'T use pointer arithmethic, never. Wrap arrays in classes. I didn't do that because (to make things more complicated) I have three different representations of that data, the "simulation" side sees them as scale-quaternion-translation classes, the replay sees them as compressed versions of the same, and the rendering expands the compressed versions into affine matrices...

Now I have to go back debugging, because there's still a problem in the interleaving and interplation code that even if I've added debug asserts and test cases everywhere, is still hiding itself somewhere. AAAAAAAAAAAAAAAAAAAAA!!!!

p.s. Direct access in arrays is bad from a performance standpoint too. If you wrap your arrays with setters and getters then it's easier to change the in-memory layout of your elements later to optimize for cache misses... There are many cases where good code design also helps performances, not directly but making changes after profiling more easy!

07 July, 2008

off topic

translation

because I feel like that

Search this blog