C0DE517E: 03.09

26 March, 2009

Garbage collection, again

Recently I've discovered this forum, Molly Rocket. A friend of mine told me to have a look, people were trashing one of my articles! Kinda cool I thought, but unfortunately it turned out that it was mostly about misunderstandings due to my bad writing.
Well, that's not the point, the point is that in that forum, there are plenty of smart people, helping less experienced ones. I stumble in a post about garbage collection, and write the usual stuff about it and its merits, compared to explicit allocations and reference counting. And... of course, I get told that those are just the usual arguments, not enough to persuade the big guys.
Ok so, let's write again... about garbage collection!

So let's picture the usual scenario. It's C++ and so, you start writing code to manage memory (at this point, you're dealing with memory, so hopefully you already have a coding standard, maybe some tools to enforce it, and you know that by default, C++ is wrong).

You probably start writing your custom allocators, to help debugging, usual stuff. You wrap the default allocator with some padding before and after to detect stomps, tracing functionality to detect leaks and fragmentation, handling alignment and so on.

Explicit allocation do no compose, they don't really work with OO, so you implement reference counting, maybe deriving your classes from a reference counted base, adding smart pointer classes (and hoping that you don't have to interface it with another similar system in a third part library).

Ok you're set, you start writing your game. And fragmentation is a problem! Ok, no fear, everyone faced that, you know what to do. You start to make separate memory pools, luckily you already knew about that, and tagged all your allocations with a category: rendering, animation, ai, physics, particles. It was so useful to enforce memory budgets during the project!
So now it's only a matter of redirecting some of this categories to different pools, possibly different allocation strategies.
Off we go... and it works! That's how everyone is solving the problem...

But! It's painful!
And this is the best scenario were you have total control and you did all the right choices, and you don't have to link external libraries that use other strategies.
You have to size all your pools for the worst case scenario. And then streaming comes in the equation. Streaming rocks right?
You need to have more and more fine control over allocations, splitting heaps, creating class pools.

You realize that what it really counts is objects lifetime. The most useful thing is to classify allocations as per frame (usually a linear allocator that automatically frees everything at the end of the frame, double buffered if the memory has to be consumed by the GPU...), short lived (i.e. temporary objecs), medium lived (i.e. level resources) and permanent.

You realize that if you go on and on, and split allocations so every class has its own pool, and you size the pools for the worst case you're wasting a lot of memory, and in the end you don't need to manage allocations anymore. You can simply use a circular pool, and overwrite old instances with new ones, if the pool is correctly sized, living instances won't get overwritten ever!

Something wrong. And what's with the idea of object lifetime anyway? Is there a better solution? A more correct, generic answer? Something that should be used as a better default? Well well...

25 March, 2009

Optimization again, from Steve Yegge

Everything Yegge writes is well worth a read, and most of the times, I agree with what he writes. The following is extracted from this talk.

“OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.

I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!

So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...

The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?

This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.

Because we couldn't see how the system worked anymore!

Small systems are not only easier to optimize, they're possible to optimize. And I mean globally optimize.

So when we talk about performance, it's all crap. The most important thing is that you have a small system. And then the performance will just fall out of it naturally.”

P.S. That talk is about dynamic languages, it shows some pretty cool stuff. Without any doubt the progress JavaScript compilers are doing is incredible. There's plenty of neat stuff, and you can even do graphics with it.
But I have to say, I would reccomend nothing of those for games, especially console ones, especially outside scripting realm (that's to say, asset loading, and stuff that is mostly about parameters set with a logic, than code executed thousands of times per frame, and for those tasks, look no further than Lua, it's really the best solution as of now, even if javascript JSON is tempting).
I do believe that the future, the near future when we'll dump C++ for good and move to another language + C for low lever stuff, will not be about dynamic languages but about more modern static ones.

18 March, 2009

At last! Double-click highlighting in VS!

I’ve finally found a vs plugin that emulates the word highlighting of the excellent (and opensource) Notepad++.

Basically, whenever you doubleclick to select a word, all the occurrences of that word in the sourcecode are automatically highlighted.

Download and install it now (rockscroll)

Oooh, just noticed, this is my 100th post! Ok, so here's a bonus link: http://www.vis.uni-stuttgart.de/~hopf/pub/Fosdem_2009_r600demo_Slides.pdf :P

Update: oh shit, RockScroll rocks so much indeed, that now I miss its code thumbnails when I have to use notepad++! Aaaargh!

Update: new project! http://code.google.com/p/metalscroll/

17 March, 2009

Be driven by data but don't let data drive your code

Data
I don't like data driven systems too much, for reasons I've already explained. They tend to be bulky and inflexible rather easily. They are just a way of constraining change, hoping that if we parametrize and generalize enough, things can be all designed from the beginning, and change (or creativity really) won't happen. That's plain stupid.

I don't like big frameworks and huge tools, I'm a fan of fast iterations, and tight interaction between coders and artists. Scripting is nice.

Scripting is very nice as it's a glorified, and in many cases faster and smaller, data provider. Hey, plus you get a scripting system too! Or, if you have a scripting system, hey, you've got your data system as well (that's what maya engineers had to have thought when they designed maya ascii file format just as a snapshot of the scripting commands that generated the scene in the first place).

Don't get me wrong, I don't have anything against raw data in general, intended as information outside the code.
Of course we need data everywhere, textures, geometries etc. We also need parameters, and if you don't have a system to store and tweak parameter variables, either by reflecting code declared ones or code generating tool created ones, you're crazy, go and code (or steal) one now.

I like the approach the uses reflection more, as reflection is one of the extensions you'll need to provide to c++ anyway. Plus a well made reflection system won't allow you only to tweak variables, but also to bind a scripting system (reflecting functions) and to serialize your classes, that's handy for a lot of tasks (think savegames, networking, dumps for debugging, asset loading...)

Structure
But what if you have to make your system structure "data driven"? That means, the creation, lifetime of objects and their relationships. Those are not parameters, hardly can expressed as parameters. You might be tempted to use a structured data format (xml...) for the purpose, and it can be a very good solution in many cases, for sure. Just remember to not let your data drive your code.
The wrong approach is to have the code depend on the data, querying it, being built around it. It creates a lot of coupling that you want to avoid. Really, you do.

Make the data push objects into existence, data->translator->objects, not viceversa, do not have objects query for data, pull information out of it! Creating an object(data) makes the two things hard to untangle, of course some data will be required to create an object, but that should be the data it needs, not a generic structure designed around your data-driving/management/asset loading/etc... system.

You want to think about scripting, having the data call the code via its interpreter... Even the word "interpreter" lets you think about scripting, in fact you know that there's no difference between code and data, but here we're talking about engineering, not computer science.

10 March, 2009

My little rendering engine

Almost two years ago I wrote a small series of articles for an italian programming magazine, about 3d engine design. Along with the articles, I wrote a small prototype, in C# (OpenGL/CG via the Tao Framework).

Those articles were never published (we didn't agree on the money, in the end) and then I forgot about it for a long time, until a few weeks ago when I started reasoning again about ithem So to help myself, I'll write one of my trademark badly written and way too long posts... Enjoy.

Note: my posts are really too long and for sure, badly written. Luckily I usually put in italics stuff that is... let's say less essential. Still if you have time I don't see why you should skip any of ugly and not proofread nonsense...

First of all, I have to provide a little bit of background... At that time I was towards the end of my work on a brand new, built from scratch 3d engine. We were a small team, five persons for a 3d engine that was running on five platforms, 360, Ps3, Wii, Psp and PC, plus the pipeline and artists tools AND the game rendering (that's to say, both the integration of the new engine and the coding of all the effects, shaders etc).

It was an incredible accomplishment, we did an impressive amount of work, most of us were new to nextgen console development too, we didn't have many devkits, a lot of work was done even before having them, the platforms were still kinda new, it's a story that if told now, sounds like the one of a grandpa telling of the war, when people suffered from starvation. But the end results were not pretty.

I've learned a lot from that experience, both about how to do things (when I started that job I was just of university, I did have a strong background in rendering, and lot of not too useful knowledge, but my last working realtime code was from years ago as during my master I abandoned everything C/C++/assembly related to focus only on playing with different languages) and even more about how not to do them...

Basically the engine was made as an abstraction layer over the native APIs (sounds good, but it was far from being untangled from the rest of the code, so it wasn't really a layer) with a scenetree on top of it. The scenetree was made of "fat" objects, containing everything they needed to render (or do what they needed to do) plus other things (to avoid too many virtual calls/code paths base classes were bigger than needed). Objects did have the usual render, update, prerender, etc. calls, they did somewhat set renderstates and textures, and somewhat somewhere issue draw calls. It was this, plus tons of other things, from serialization to math, from networking to sound, plus other awful stuff added to made awfull stuff less slow, bloat and duplicated code.

With this in mind, the ideas I had for my design were:

Stateless
As simple as possible/as fast as possible
Not generic, but extendable, people have to write their own stuff to render
No scene tree

4 is easy, you just don't make it a scenegraph based engine. 3 is fundamental, but it's easy to follow. 2 is a lot about implementation, more than the overall design idea. 1 was the real thing to reason about.

In a moment I'll write about it, but first, a couple of disclaimers. First of all, nothing that follows is entirely new (or probably, new at all). I find intresting this design because it's rare to see it, if only learn from books and from opensource 3d engines, most probably you never saw anything similar. But it's not new for sure. Second, every engine I worked with or coded had its own defects, sometimes even big ones, but they were all better than what follows just because, they were real tech with which real games shipped. My little engine was stuff for a few articles, less than a testbed, absolutely not proven in real world. Last but not least, my comments on the pitfalls of that engine at that time do not imply anything on the current state of it :)

Let's lay down the basic design:

We need objects that do the rendering, of a given graphical feature. Let's call those "Renderers".
Renderers do not directly interact with the underlying graphic API (or its abstraction layer). In fact no one does, other than some very specific classes that are behind a very specific system. You can't set streams, textures, render targets, and of course even less you can set render states.
Renderers should not directly do anything substatial. No culling code, no fancy cpu vertex skinning or gpu one... Those are rendering features that should be coded in feature managers (i.e. bounding sphere culling manager), and Renderers should register themselves to use those. Feature managers are about code that does not need to deal directly with the graphic API (i.e. meshes, textures etc), as we said, almost noone has access to it.
Feature managers are nice to have because they reduce code duplication, allow more modularity, and make a lot of times things faster too, as they can keep close in memory the datastructures they need (i.e. bounding volumes for visibility) instead of having them scattered all over multiple fat objects. Also, their update with a little care can happen on separate threads.
Renderers should be mostly written as needed for the game, and they should manage interaction with game (managing internal state changes caused by it, hopefully in a threadsafe way but that's another story). They manage resources needed to render. Put together features. Issue RenderCommands. A renderer class could be the SoccerPlayerRenderer that derives from a generic GpuSkinnedPlayer, that uses BoundingSphereClipper and GpuSkinning feature managers.
RenderCommands are fixed sized strings of bits (i.e. 64bits or 128bits).
RenderCommands are pushed into a RenderCommandQueque. The queque is typed (templated) on a RenderCommandInterpreter.
The RenderCommandInterpreter interprets the RenderCommand and issues graphic API calls, from state setting to draw calls. It can and it should perform state caching to avoid issuing duplicated commands. No state shadowing is thus required in the graphic API or its abstraction layer.
The engine will provide a number of RenderCommandInterpreters. The most basic one is the MainDispatcher, that contains an array of RenderCommandInterpreters, and takes a fixed number of most significant bits out of the RenderCommand and uses those to index the array, and dispatch the rest of the string of bits to it.
The most common subclass of the MainDispatcher is the SceneInterpreter, that before dispatching the command, sets a rendertarget, also associated with the index it uses to select the RenderCommandInterpreter.
Another common RenderCommandInterpreter is the SubcommandDispatcher, that as the MainDispatcher contains different RenderCommandInterpreters, but instead of selecting one based on some bits of the command, it associates different bits substrings of the RenderCommand to each of them. That means, it chops the RenderCommand extracting substrings in fixed positions, and passes each substring to a registered RenderCommandInterpreter (so it associates the latter with the former).

You've probably started to get the idea. Other than those, the other RenderCommandInterpreters, that will operate on parts of the RenderCommand, will be things like MeshInterpreter, TextureInterpreter, ShaderInterpreter, ShaderParamBlockInterpreter (or you might prefer to collapse the former three into a MaterialInterpreter...), etc...

Implementation note: The dispatchers should either be templated or hardcoded to avoid virtual function calls and bit shifts by variable amounts, that are both very slow things on PowerPC based platforms (Ps3, 360, Wii...). Templating SubcommandDispatcher is tricky as you can't template on a variable number of parameters in C++, so you're limited to chopping the string in a point and containing two interpreters, one for the head and one for the tail of the chopped string. By concatenating SubcommandDispatchers in a kinda ugly template definition, you get the ability of dispatching arbitrary substrings to many interpreters... In C# generics work only on types and not on values, so you can't template the bit positions, hardcoding is the way. And it's also simpler, letting the users hardcode those decisions make the code shorter, so I would strongly advise not to use templates there.

The MainDispatcher is peculiar because instead of chopping the RenderCommand into subcommands and sending them to the appropriate handlers (interpreters), it selects a totally different interpreter configuration.

This is because you get a fixed number of bits for each feature, usually those bits will be used directly by the interpreter to select a given API state, i.e. a mesh, so the number of bits limits the number of meshes that you can have, and you might want to register more for the main rendering pass than for generating shadows (that's why, usually the MainDispatcher is subclassed into the RenderCommandInterpreter that manages the rendertarget).

Using fixed strings of bits is not only a compact way of encapsulating all the state needed for a drawcall, it's also nice as it allows to easily sort them, by ordering the RenderCommandInterpreter bits, placing first (most significant) the ones that manage states that are more expensive to change. State caching is trivial, if an interpreter receives twice the same substring of bits to process, it does not have to do anything (usually).

Renderers will initialize rendering resources (i.e. meshes), register them into interpreters and features and grab handles out of them (note that the same resource can be registered into two interpreters, i.e. you might have two different types for the interpreter used for meshes in the shadow rendering pass and another for the ones in the main pass). Those handles will be used to compose commands to push in the queque (most of the times, the handle will be an index into an array of the interpreter, and will be exactly the same as the bit substring that is used to compose the RenderCommand that uses it).

As a side effect of being stateless, multithreading is easier too. First all the renderers should do an internal update, grabbing data from the game and updating feature managers accordingly. Then all the feature managers can execute their update in parallel. At that point, renderers can render in parallel by pushing rendercommands in a per-thread queque. Queques can be sorted independently, and then merge-sorted together in a single big one. From there on, parallel execution can again happen, in various ways. Probably the simpler one is to just parallelize the MainInterpreter, in case that's associated with the render target, we can construct command buffers (native ones) for each of them in parallel, and then send everything to the GPU to execute.

Last but not least, even if I didn't design/implement it, I suspect that with a little care hotloading/swapping and streaming can be easy to do in this system, mainly because we have handles everywhere instead of managing directly the resources...

02 March, 2009

Embrace your bottleneck

If you're oldschool enough, you've started learning about code optimization in terms of pure cycle counts. When I've started, caches were already the king, but you still did count cycles, and their schedule in the U,V pipes of the first Pentium processor.

Nowdays you reason in terms of pipelines, latency and stalls.

Both on the GPU and CPU the strategy is the same (even if at different granularities). You identify your bottleneck and try to solve it.

But what if you can't solve it? If you can't win, join them!

Do you have to draw a big transparent polygon all over the screen that's stalling your blending pipeline? See it as a whole lot of free pixel shader cycles, and even more vertex shader ones!

Do you have some SIMD code, let's say a frustum/primitive test, that's stalling on a memory instruction? Nice, you can replace that cheap bounding sphere test with a bounding box or maybe use multiple spheres now! Or maybe you can interleave some other computation, or keep another pipeline busy (the integer one, or the floating point one... ).

On modern CPUs you'll have plenty of such situations, especially if you're coding on PowerPC based ones (Xbox 360, Ps3) that are in-order (they don't reschedule instructions to keep pipes busy, that's done only statically by the compiler or by you), have lots of registers and very long pipes. Sidenote: If you're basing most of your math on the vector unit on those architectures, think twice! They're way different from the Intel desktop processors, that were made with a lot of fancy decoding stanges so you could care less about making pipelines happy. The vector pipeline is so long that seldom you'll have enough data, and with no dependencies to use it fully, in most cases your best bet it to use FPU for most code, and VPU only in some number-crunching (unrolled and branchless) loops!

The worst thing that can happen is that now you have fancier/more accurate effects! But chances are that you can fill those starving stages with other effects that you needed, or that fancier effects can substitute multiple simple passes (i.e. consider the balance between using lots of simple particles versus a few volumetric ones or a single raymarched volume...), or that more accurate computations can save you some time in other stages (i.e. better culling code)!

Search this blog