Almost two years ago I wrote a small series of articles for an italian programming magazine, about 3d engine design. Along with the articles, I wrote a small prototype, in C# (OpenGL/CG via the Tao Framework).
Those articles were never published (we didn't agree on the money, in the end) and then I forgot about it for a long time, until a few weeks ago when I started reasoning again about ithem So to help myself, I'll write one of my trademark badly written and way too long posts... Enjoy.
Note: my posts are really too long and for sure, badly written. Luckily I usually put in italics stuff that is... let's say less essential. Still if you have time I don't see why you should skip any of ugly and not proofread nonsense...
First of all, I have to provide a little bit of background... At that time I was towards the end of my work on a brand new, built from scratch 3d engine. We were a small team, five persons for a 3d engine that was running on five platforms, 360, Ps3, Wii, Psp and PC, plus the pipeline and artists tools AND the game rendering (that's to say, both the integration of the new engine and the coding of all the effects, shaders etc).
It was an incredible accomplishment, we did an impressive amount of work, most of us were new to nextgen console development too, we didn't have many devkits, a lot of work was done even before having them, the platforms were still kinda new, it's a story that if told now, sounds like the one of a grandpa telling of the war, when people suffered from starvation. But the end results were not pretty.
I've learned a lot from that experience, both about how to do things (when I started that job I was just of university, I did have a strong background in rendering, and lot of not too useful knowledge, but my last working realtime code was from years ago as during my master I abandoned everything C/C++/assembly related to focus only on playing with different languages) and even more about how not to do them...
Basically the engine was made as an abstraction layer over the native APIs (sounds good, but it was far from being untangled from the rest of the code, so it wasn't really a layer) with a scenetree on top of it. The scenetree was made of "fat" objects, containing everything they needed to render (or do what they needed to do) plus other things (to avoid too many virtual calls/code paths base classes were bigger than needed). Objects did have the usual render, update, prerender, etc. calls, they did somewhat set renderstates and textures, and somewhat somewhere issue draw calls. It was this, plus tons of other things, from serialization to math, from networking to sound, plus other awful stuff added to made awfull stuff less slow, bloat and duplicated code.
With this in mind, the ideas I had for my design were:
Stateless
As simple as possible/as fast as possible
Not generic, but extendable, people have to write their own stuff to render
No scene tree
4 is easy, you just don't make it a scenegraph based engine. 3 is fundamental, but it's easy to follow. 2 is a lot about implementation, more than the overall design idea. 1 was the real thing to reason about.
In a moment I'll write about it, but first, a couple of disclaimers. First of all, nothing that follows is entirely new (or probably, new at all). I find intresting this design because it's rare to see it, if only learn from books and from opensource 3d engines, most probably you never saw anything similar. But it's not new for sure. Second, every engine I worked with or coded had its own defects, sometimes even big ones, but they were all better than what follows just because, they were real tech with which real games shipped. My little engine was stuff for a few articles, less than a testbed, absolutely not proven in real world. Last but not least, my comments on the pitfalls of that engine at that time do not imply anything on the current state of it :)
Let's lay down the basic design:
We need objects that do the rendering, of a given graphical feature. Let's call those "Renderers".
Renderers do not directly interact with the underlying graphic API (or its abstraction layer). In fact no one does, other than some very specific classes that are behind a very specific system. You can't set streams, textures, render targets, and of course even less you can set render states.
Renderers should not directly do anything substatial. No culling code, no fancy cpu vertex skinning or gpu one... Those are rendering features that should be coded in feature managers (i.e. bounding sphere culling manager), and Renderers should register themselves to use those. Feature managers are about code that does not need to deal directly with the graphic API (i.e. meshes, textures etc), as we said, almost noone has access to it.
Feature managers are nice to have because they reduce code duplication, allow more modularity, and make a lot of times things faster too, as they can keep close in memory the datastructures they need (i.e. bounding volumes for visibility) instead of having them scattered all over multiple fat objects. Also, their update with a little care can happen on separate threads.
Renderers should be mostly written as needed for the game, and they should manage interaction with game (managing internal state changes caused by it, hopefully in a threadsafe way but that's another story). They manage resources needed to render. Put together features. Issue RenderCommands. A renderer class could be the SoccerPlayerRenderer that derives from a generic GpuSkinnedPlayer, that uses BoundingSphereClipper and GpuSkinning feature managers.
RenderCommands are fixed sized strings of bits (i.e. 64bits or 128bits).
RenderCommands are pushed into a RenderCommandQueque. The queque is typed (templated) on a RenderCommandInterpreter.
The RenderCommandInterpreter interprets the RenderCommand and issues graphic API calls, from state setting to draw calls. It can and it should perform state caching to avoid issuing duplicated commands. No state shadowing is thus required in the graphic API or its abstraction layer.
The engine will provide a number of RenderCommandInterpreters. The most basic one is the MainDispatcher, that contains an array of RenderCommandInterpreters, and takes a fixed number of most significant bits out of the RenderCommand and uses those to index the array, and dispatch the rest of the string of bits to it.
The most common subclass of the MainDispatcher is the SceneInterpreter, that before dispatching the command, sets a rendertarget, also associated with the index it uses to select the RenderCommandInterpreter.
Another common RenderCommandInterpreter is the SubcommandDispatcher, that as the MainDispatcher contains different RenderCommandInterpreters, but instead of selecting one based on some bits of the command, it associates different bits substrings of the RenderCommand to each of them. That means, it chops the RenderCommand extracting substrings in fixed positions, and passes each substring to a registered RenderCommandInterpreter (so it associates the latter with the former).
You've probably started to get the idea. Other than those, the other RenderCommandInterpreters, that will operate on parts of the RenderCommand, will be things like MeshInterpreter, TextureInterpreter, ShaderInterpreter, ShaderParamBlockInterpreter (or you might prefer to collapse the former three into a MaterialInterpreter...), etc...
Implementation note: The dispatchers should either be templated or hardcoded to avoid virtual function calls and bit shifts by variable amounts, that are both very slow things on PowerPC based platforms (Ps3, 360, Wii...). Templating SubcommandDispatcher is tricky as you can't template on a variable number of parameters in C++, so you're limited to chopping the string in a point and containing two interpreters, one for the head and one for the tail of the chopped string. By concatenating SubcommandDispatchers in a kinda ugly template definition, you get the ability of dispatching arbitrary substrings to many interpreters... In C# generics work only on types and not on values, so you can't template the bit positions, hardcoding is the way. And it's also simpler, letting the users hardcode those decisions make the code shorter, so I would strongly advise not to use templates there.
The MainDispatcher is peculiar because instead of chopping the RenderCommand into subcommands and sending them to the appropriate handlers (interpreters), it selects a totally different interpreter configuration.
This is because you get a fixed number of bits for each feature, usually those bits will be used directly by the interpreter to select a given API state, i.e. a mesh, so the number of bits limits the number of meshes that you can have, and you might want to register more for the main rendering pass than for generating shadows (that's why, usually the MainDispatcher is subclassed into the RenderCommandInterpreter that manages the rendertarget).
Using fixed strings of bits is not only a compact way of encapsulating all the state needed for a drawcall, it's also nice as it allows to easily sort them, by ordering the RenderCommandInterpreter bits, placing first (most significant) the ones that manage states that are more expensive to change. State caching is trivial, if an interpreter receives twice the same substring of bits to process, it does not have to do anything (usually).
Renderers will initialize rendering resources (i.e. meshes), register them into interpreters and features and grab handles out of them (note that the same resource can be registered into two interpreters, i.e. you might have two different types for the interpreter used for meshes in the shadow rendering pass and another for the ones in the main pass). Those handles will be used to compose commands to push in the queque (most of the times, the handle will be an index into an array of the interpreter, and will be exactly the same as the bit substring that is used to compose the RenderCommand that uses it).
As a side effect of being stateless, multithreading is easier too. First all the renderers should do an internal update, grabbing data from the game and updating feature managers accordingly. Then all the feature managers can execute their update in parallel. At that point, renderers can render in parallel by pushing rendercommands in a per-thread queque. Queques can be sorted independently, and then merge-sorted together in a single big one. From there on, parallel execution can again happen, in various ways. Probably the simpler one is to just parallelize the MainInterpreter, in case that's associated with the render target, we can construct command buffers (native ones) for each of them in parallel, and then send everything to the GPU to execute.
Last but not least, even if I didn't design/implement it, I suspect that with a little care hotloading/swapping and streaming can be easy to do in this system, mainly because we have handles everywhere instead of managing directly the resources...