Search this blog

14 August, 2008

Test-Driven-Development

  1. Test
  2. If it didn't compile, add some keywords
  3. Goto 1
This is what test driven development usually is in games. It's not that bad, we do (or should) prefer iteration and experimentation over any form of design. Yes, I know that unit tests are very useful for refactoring, and thus simplify some sorts of iteration, but still, it's not enough.
This doesn't mean that automated testing is not important, quite the contrary, you should have plenty of scripts to automate the game and gather statistics. But unit tests are good only for some shared libraries, I don't think they will be ever successful in this field.

I'm going to leave for Italy, dunno if I'll have time to post other articles, there's stuff from Siggraph that is worth posting, I have a nice code optimization tutorial to post, the "normals without normals" technique plus a few other code snippets. Probably those things have to wait until mid-September, when I'll be back from holidays...

12 August, 2008

Ribbons are the new cubes!

Are you making a demo? Don't forget your splines, they are the cool thing now...
Lifeforce
Nematomorpha
The Seeker
Atrium
Scarecrow
Invoke

The "progressively appearing" geometry trick is also commonly used to draw them:
Route 1066
Falling down
Media error
Tactical battle loop

Cubes seems to be cool only if you instance a crazy number of them now:
Debris
Momentum

Another cool trend: 2d metaballs
Nucleophile
Incognito (near the end, this one features ribbons too)

But plain old spheres are not forgotten too
Kindercrasher

Plain old particle systems are out...

10 August, 2008

Small update

I've finished reading the Larrabee paper, linked in the realtimerendering blog. Very nice, intresting in general, even if you're not doing rendering... And it has a few very nice references too.

It seems that my old Pentium UV pipes cycle-counting abilities will be useful again... yeah!

I'm wondering how it can succeed commercially... It's so different from a GPU that it will require a custom rendering path in you application to be used properly, wonder how many will do that as nothing that you can do on Larrabee is replicable on other GPUs... Maybe, if its price is in the rage of standard GPUs and its speed with DirectX (or a similar API) is comparable... or if they manage to include it in a consolle. Anyway, it's exciting, and a little bit scarying too. We'll see.

I've also found a nice, old article about Xenos (the 360 Gpu) that could be an intresting read if you don't have access to the 360 sdk.

Warning: another anti-C++ rant follows (I've warned you, don't complain if you don't like what you read or if you find it boring...)

Last but not least, I've been watching to a nice presentation by Stroustrup that he gave at university of Waterloo, on C++0x, it's not new but it's very intresting. It shows again how C++ is at an evolutionary end.

Key things you'll learn from it: C++ design process is incredibly slow and constrained, C++ won't ever deprecate features so it might only grow (even if Bjarne would like to do so, but he says that he was unable to convince the compiler vendors...), not change. That means that all the problems and restrictions imposed by the C compatibility and by straight errors in the first version of the language won't be addressed. That also means that C++ is almost at its end, as it's already enormous and it can't shrink, and there is a limit to the number of things a programmer can know about any language. C++ is already so complicated that some university professors use its function resolving rules as "triky" questions during exams...

You will also hear the word "performance" each minute or so. We can't do that because we care about performances, we are not stupid windows programmers! Well, Bjarne, if going "low level" means caring about performances, then why aren't we all using assembly? Maybe because writing programs in assembly was so painful that not only become impractical, but was also hampering performances, as it was hard enough to write a working program, let's not talk about profiling and optimizing it... Try today to write a complete program in assembly that's faster than the same written in C (on a modern out-of-core processor I mean, of course on C64 assembly is still a nice choice)... So the equation higher level languages == less performance is very simple and very wrong in my opinion, and we have historical proofs of that. C++ is dead, it's only the funeral that's long and painful (especially when incredilink takes five minute to link our pretty-optimized-for-build-times solution).

I can give C++ a point for supporting all the design-wise optimizations pretty well (i.e. mature optimizations, the ones you have to do early on, that are really the only ones that matter, for function level optimizations you could well use assembly in a few places, if you have the time, that is something that's more likely to happen in a language that does not waste all of it in the compile/link cycle), while other languages still don't allow some of them (i.e. it's hard to predict memory locality in C#, and thus to optimize a design to be cache efficient, and there's no easy way to write custom memory managers to overcome that too).

Still C++ does not support them all, and that's why when performances really matter, we use compiler specific extensions to C++, i.e. alignment/packing & vector data types... The wikipedia C++0x page does not include the C99 restrict keyword as a feature of the language but I did not do any further research on that, I hope it's only a mistake of that article... Even the multithreading support they want to add seems to be pretty basic (even compared to existing and well supported extensions like OpenMP), quite disappointing for a language that's performance driven, even more considering that you'll probably get a stable and widespread implementation of it in ten years from now...

P.S. it's also nice to know that the standard commitee prefers library functions to language extensions, and prefers to build an extensible language over giving natively a specific functionality. Very nice! It would be even a nicer idea if C++ was not one of the messiest languages to extend... Anyone that had the priviledge of seeing a error message from a std container should agree with me. And that is only the standard library that's been made together with the language, it's not really an effort of a third party to extend it... Boost is, and it's nice, and it's also a clear proof that you have to be incredibly expert to make even a trivial extension and kinda expert to use and understand them after someone, more expert than you, have made one! Well I'll stop there, otherwise I'll turn this "small update" post into another "c++ is bad" one...

07 August, 2008

Commenting on graphical shader systems

This is a comment on this neat post by Christer Ericson (so you're supposed to follow that link before reading this). I've posted that comment on my blog because it lets me elaborate more on that, and also because I think the subject is important enough...

So basically what Christer says is that graphical (i.e. graph/node based) shader authoring systems are bad. Shaders are performance critical, should be authored by programmers. Also, it makes global shader changes way more difficult (i.e. remove this feature X from all the shaders... now it's impossible because each shader is a completely unrelated piece of code made with a graph).

He proposes an "ubershader" solution, a shader that has a lot of capabilities built into, that then gets automagically specialized into a number of trimmed down ones by tools (that remove any unused stuff from a given material instance)
I think he is very right, and I will push it further…

It is true that shaders are performance critical they are basically a tiny kernel in a huuuge loop, tiny optimizations make a big difference, especially if you manage to save registers!

The ubershader approach is nice, in my former company we did push it further, I made a parser that generated a 3dsmax material plugin (script) for each (annotated) .fx file, some components in the UI were true parameters, others were changing #defines, when the latter changed the shader had to be rebuit, everything was done directly in 3dsmax, and it worked really well.

To deal with incompatible switches, in my system I had shader annotations that could disable switches based on the status of other ones in the UI (and a lot of #error directives to be extra sure that the shader was not generated with mutually incompatible features). And it was really really easy, it's not a huge tool to make and maintain. I did support #defines of “bool”, “enum” and “float” type. The whole annotated .fx parser -> 3dsmax material GUI was something like 500 lines of maxscript code.

We didn't have just one ubershader made in this way, but a few ones, because it doesn't make sense to add too many features to just one shader when you're trying to simulate two completely different material categories... But this is not enough! First of all, optimizing every path is still too hard. Moreover, you don’t have control over the number of possible shaders in a scene.

Worse yet, you loose some information, i.e. let’s say that the artists are authoring everything well, caring about performance measures etc... In fact our internal artists were extremely good at this. But what if you wanted to change all the grass materials in all your game to use another technique?

You could not, because the materials are generic selections of switches, with no semantic! You could remove something from all the shaders, but it's difficult to replace some materials with another implementation, you could add some semantic information to your materials, but still you have no guarantees on the selection of the features that artists chosen to express a given instance of the grass, so it becomes problematic.

That’s why we intended to use that system only as a prototype, to let artists find the stuff they needed easily and then coalesce everything in a fixed set of shaders!
In my new company we are using a fixed sets of shaders, that are generated by programmers easily usually by including a few implementation files and setting some #defines, that is basically the very same idea minus the early-on rapid-prototyping capabilities.

I want to remark that the coders-do-the-shaders approach is not good only because performance matters. IT IS GOOD EVEN ON AN ART STANDPOINT. Artists and coders should COLLABORATE. They both have different views, and different ideas, only together they can find really great solutions to rendering problems.

Last but not least having black boxes to be connected encourages the use of a BRDF called "the-very-ignorant-pile-of-bad-hacks", that is an empirical BRDF made by a number of phong-ish lobes modulated by a number of fresnel-ish parameters that in the end produce a lot of computation, a huge number of parameters that drive artists crazy, and still can't be tuned to look really right...

The idea of having the coders do the code, wrap it in nice tools, and give tools to the artists is not only bad performance-wise, it’s bad engineering-wise (you most of the time spend more resources into making and maintaining those uber tools than the one you would spend by having a dedicated S.E. working closely with artists on shaders), and it’s bad art-wise (as connecting boxes has a very limited expressive power).

31 July, 2008

GPU versus CPU

Some days ago, a friend of mine at work asked me what was the big difference in the way GPUs and CPUs operate. Even if I went into a fairly deep description of the inner workings of GPUs in some older posts, I want to elaborate specifically on that question.

Let's start with a fundamental concept: latency, that is the time that we have to wait, after submitting an instruction, to have its results computed. If we have only one computational stage, then effectively the reciprocal of the latency is the amount of instruction we can process in an unit time.

So we want them to be small right? Well it turns out, that they were in the last years growing instead! But still our processors seem to run faster than before, why? Because they are good at hiding those latencies!
How? Simple, let's say that instead of having a single computational stage, you have more stages, a pipeline of workers. Then you might move an instruction being processed from one stage to the other (conceptually) like on a conveyor belt, and while you're processing it the other stages can accept more instructions. Any given instruction will have to go through the whole pipeline, but the rate of instruction processing can be much higher than latency, and it's called throughput.

Why we did like those kinds of designs? Well, in the era of the gigahertz wars (that now has largely scaled back), it was an easy way of having higher frequencies. If a single instruction was split in a number of tiny steps, then each of them could be simpler, thus requiring less stuff to be done, thus enabling designers to have higher frequencies, as each small step required less time.

Unfortunately, if something stalls this pipeline, if we can't fetch more instructions to process to keep it always full, then our theorical performance can't be reached, and our code will run slower than on less deeply pipelined architectures.
The causes of those stalls are various, we could have a "branch misprediction", we were thinking some work was needed, but we were wrong, we started processing instructions that are not useful. Or we could not be able to find instructions to process that are not dependant on results of the ones that are currently being processed. The worse example of this latter kind of stall is on memory accesses. Memory is slow, and it's evolving at a slower pace than processors too, so the gap is becoming bigger and bigger (there wasn't any twenty years ago, for example on the Commodore 64, its processors did not need caches too).

If one instruction is a memory fetch, and we can't find any instruction to process after it that does not depend on that memory fetch, we are stalled. Badly. That's why hyper-threading and similar architectures exist. That's why memory does matter, and why cache-friendly code is important.

CPUs become better and better at this job of optimizing their pipelines. Their architectures, and decoding stages (taking instructions and decomposing them in stages, scheduling them in the pipeline and rearranging them, that's called out-of-order instruction execution), are so complicated that's virtually impossible to predict at a cycle level the behaviour of our code. Strangely, transistor numbers did evolve according to Moore's law, but we did not use those transistors to have more raw power, but mostly to have more refined iterations of those pipelines and of the logic that controls them.


Most people say that GPUs computational power is evolving at a faster pace than Moore's law predicted. That is not true, as that law did not account for frequency improvements (i.e. thinner chip dies), so it's not about computational power at all! The fact that CPUs computational power did respect that law means that we were wasting those extra transistors, in other words, that those transistors did not linearly increase the power.


Why GPUs are different? Well, let me do a little code example. Let's say we want to compute this:


for i=0 to intArray.length do boolArray[i] = (intArray[i] * 10 + 10) > 0


GPUs will actually refactor the computation to be more like the following (plus a lot of unrolling...):


for i=0 to intArray.length do tempArray[i] = intArray[i]
for i=0 to intArray.length do tempArray[i] = tempArray[i] * 10
for i=0 to intArray.length do tempArray[i] = tempArray[i] + 10
for i=0 to intArray.length do boolArray[i] = tempArray[i] > 0


(this example would be much easier in functional pseudocode than in imperative one, but anyway...)

Odd! Why are we doing this? Basically, what we want to do is to hide latency in width, instead of in depth! Having to perform the same operation on a huge number of items, we are sure that we always have enough to do to hide latencies, without much effort. And it's quite straightforward to turn transistors in computational power too, we simply will have more width, and more computational units working in parallel on the tempArray! In fact, that kind of operation, a "parallel for", is a very useful primitive to have in your multithreading library... :)

Many GPUs work exactly like that. The only big difference is that the "tempArray" is implemented in GPU registers, so it has a fixed size, and thus work has to be subdivided in smaller pieces.

There are some caveats.
The first one is that if we need more than one temp register to execute our operation (because our computation is as simple as the one of my example!) then our register array will contain less independant operating threads (because each one requires a given space), and so we will have less latency hiding. That's why the number of registers that we use in a shader is more important than the number of instructions (now we can clearly see them as passes!) that our shader needs to perform!
Second, this kind of computation is inherently SIMD, even if GPUs do support different execution paths on the same data (i.e. branches) those are still limited in a number of ways.
Another one is that our computations have to be independant, there's no communication between processing threads, we can't compute operations like:

for i=0 to boolArray.length do result = result LOGICAL_OR boolArray[i]

That one is called in the steam processing lingo, a gather operation (or if you're familiar with functional programming, a reduce or fold), the inverse of which is called a scatter operation. Lucily for the GPGPU community, a workaround to do those kinds of computations on the GPU exists and is to map our data to be processed into a texture/rendertarget, use register threads to process multiple pixels in parallel and use texture reads, that can be arbitrary, to gather data. Scatter is still very hard, and there are limitations to the number of texture reads too, for example that code will be processed usually by doing multiple reductions, from a boolArray of size N to one of size N/2 (N/4 really, as textures are bidimensional) until reaching the final result... but that's too far away from the original question...

Are those two worlds going to meet? Probably. CPUs already do not have a single pipeline, so they're not all about depth. Plus both CPUs and GPUs have SIMD data types and operations. And now multicore is the current trend, and we will see have more and more cores, that will be simpler and simpler (i.e. the IBM Cell or the Intel Larrabee). On the other hand, GPUs are becoming more refined in their scheduling abilities, i.e. the Xbox 360 one does not only hide latency in depth, but also can choose which instructions from which shader to schedule in order to further hide memory latencies across multiple passes (basically implementing fibers)... NVidia G80 has computational units with independent memory storages...

Still I think that GPU processing is inherently more parallel than CPU, so a specialized unit will always be nice to have, we are solving a very specific problem, we have a small computational kernel to apply to huge amounts of data... On the other hands, pushing too much the stream computing paradigm on the CPUs is not too useful, as there are problems that do not map well on it, because they don't work on huge amounts of data nor they perform uniform operations...