Search this blog

10 March, 2009

My little rendering engine

Almost two years ago I wrote a small series of articles for an italian programming magazine, about 3d engine design. Along with the articles, I wrote a small prototype, in C# (OpenGL/CG via the Tao Framework).

Those articles were never published (we didn't agree on the money, in the end) and then I forgot about it for a long time, until a few weeks ago when I started reasoning again about ithem So to help myself, I'll write one of my trademark badly written and way too long posts... Enjoy.

Note: my posts are really too long and for sure, badly written. Luckily I usually put in italics stuff that is... let's say less essential. Still if you have time I don't see why you should skip any of ugly and not proofread nonsense...

First of all, I have to provide a little bit of background... At that time I was towards the end of my work on a brand new, built from scratch 3d engine. We were a small team, five persons for a 3d engine that was running on five platforms, 360, Ps3, Wii, Psp and PC, plus the pipeline and artists tools AND the game rendering (that's to say, both the integration of the new engine and the coding of all the effects, shaders etc).

It was an incredible accomplishment, we did an impressive amount of work, most of us were new to nextgen console development too, we didn't have many devkits, a lot of work was done even before having them, the platforms were still kinda new, it's a story that if told now, sounds like the one of a grandpa telling of the war, when people suffered from starvation. But the end results were not pretty.

I've learned a lot from that experience, both about how to do things (when I started that job I was just of university, I did have a strong background in rendering, and lot of not too useful knowledge, but my last working realtime code was from years ago as during my master I abandoned everything C/C++/assembly related to focus only on playing with different languages) and even more about how not to do them...

Basically the engine was made as an abstraction layer over the native APIs (sounds good, but it was far from being untangled from the rest of the code, so it wasn't really a layer) with a scenetree on top of it. The scenetree was made of "fat" objects, containing everything they needed to render (or do what they needed to do) plus other things (to avoid too many virtual calls/code paths base classes were bigger than needed). Objects did have the usual render, update, prerender, etc. calls, they did somewhat set renderstates and textures, and somewhat somewhere issue draw calls. It was this, plus tons of other things, from serialization to math, from networking to sound, plus other awful stuff added to made awfull stuff less slow, bloat and duplicated code.

With this in mind, the ideas I had for my design were:
  1. Stateless
  2. As simple as possible/as fast as possible
  3. Not generic, but extendable, people have to write their own stuff to render
  4. No scene tree

4 is easy, you just don't make it a scenegraph based engine. 3 is fundamental, but it's easy to follow. 2 is a lot about implementation, more than the overall design idea. 1 was the real thing to reason about.

In a moment I'll write about it, but first, a couple of disclaimers. First of all, nothing that follows is entirely new (or probably, new at all). I find intresting this design because it's rare to see it, if only learn from books and from opensource 3d engines, most probably you never saw anything similar. But it's not new for sure. Second, every engine I worked with or coded had its own defects, sometimes even big ones, but they were all better than what follows just because, they were real tech with which real games shipped. My little engine was stuff for a few articles, less than a testbed, absolutely not proven in real world. Last but not least, my comments on the pitfalls of that engine at that time do not imply anything on the current state of it :)

Let's lay down the basic design:

  • We need objects that do the rendering, of a given graphical feature. Let's call those "Renderers".
  • Renderers do not directly interact with the underlying graphic API (or its abstraction layer). In fact no one does, other than some very specific classes that are behind a very specific system. You can't set streams, textures, render targets, and of course even less you can set render states.
  • Renderers should not directly do anything substatial. No culling code, no fancy cpu vertex skinning or gpu one... Those are rendering features that should be coded in feature managers (i.e. bounding sphere culling manager), and Renderers should register themselves to use those. Feature managers are about code that does not need to deal directly with the graphic API (i.e. meshes, textures etc), as we said, almost noone has access to it.
  • Feature managers are nice to have because they reduce code duplication, allow more modularity, and make a lot of times things faster too, as they can keep close in memory the datastructures they need (i.e. bounding volumes for visibility) instead of having them scattered all over multiple fat objects. Also, their update with a little care can happen on separate threads.
  • Renderers should be mostly written as needed for the game, and they should manage interaction with game (managing internal state changes caused by it, hopefully in a threadsafe way but that's another story). They manage resources needed to render. Put together features. Issue RenderCommands. A renderer class could be the SoccerPlayerRenderer that derives from a generic GpuSkinnedPlayer, that uses BoundingSphereClipper and GpuSkinning feature managers.
  • RenderCommands are fixed sized strings of bits (i.e. 64bits or 128bits).
  • RenderCommands are pushed into a RenderCommandQueque. The queque is typed (templated) on a RenderCommandInterpreter.
  • The RenderCommandInterpreter interprets the RenderCommand and issues graphic API calls, from state setting to draw calls. It can and it should perform state caching to avoid issuing duplicated commands. No state shadowing is thus required in the graphic API or its abstraction layer.
  • The engine will provide a number of RenderCommandInterpreters. The most basic one is the MainDispatcher, that contains an array of RenderCommandInterpreters, and takes a fixed number of most significant bits out of the RenderCommand and uses those to index the array, and dispatch the rest of the string of bits to it.
  • The most common subclass of the MainDispatcher is the SceneInterpreter, that before dispatching the command, sets a rendertarget, also associated with the index it uses to select the RenderCommandInterpreter.
  • Another common RenderCommandInterpreter is the SubcommandDispatcher, that as the MainDispatcher contains different RenderCommandInterpreters, but instead of selecting one based on some bits of the command, it associates different bits substrings of the RenderCommand to each of them. That means, it chops the RenderCommand extracting substrings in fixed positions, and passes each substring to a registered RenderCommandInterpreter (so it associates the latter with the former).

You've probably started to get the idea. Other than those, the other RenderCommandInterpreters, that will operate on parts of the RenderCommand, will be things like MeshInterpreter, TextureInterpreter, ShaderInterpreter, ShaderParamBlockInterpreter (or you might prefer to collapse the former three into a MaterialInterpreter...), etc...

Implementation note: The dispatchers should either be templated or hardcoded to avoid virtual function calls and bit shifts by variable amounts, that are both very slow things on PowerPC based platforms (Ps3, 360, Wii...). Templating SubcommandDispatcher is tricky as you can't template on a variable number of parameters in C++, so you're limited to chopping the string in a point and containing two interpreters, one for the head and one for the tail of the chopped string. By concatenating SubcommandDispatchers in a kinda ugly template definition, you get the ability of dispatching arbitrary substrings to many interpreters... In C# generics work only on types and not on values, so you can't template the bit positions, hardcoding is the way. And it's also simpler, letting the users hardcode those decisions make the code shorter, so I would strongly advise not to use templates there.

The MainDispatcher is peculiar because instead of chopping the RenderCommand into subcommands and sending them to the appropriate handlers (interpreters), it selects a totally different interpreter configuration.

This is because you get a fixed number of bits for each feature, usually those bits will be used directly by the interpreter to select a given API state, i.e. a mesh, so the number of bits limits the number of meshes that you can have, and you might want to register more for the main rendering pass than for generating shadows (that's why, usually the MainDispatcher is subclassed into the RenderCommandInterpreter that manages the rendertarget).

Using fixed strings of bits is not only a compact way of encapsulating all the state needed for a drawcall, it's also nice as it allows to easily sort them, by ordering the RenderCommandInterpreter bits, placing first (most significant) the ones that manage states that are more expensive to change. State caching is trivial, if an interpreter receives twice the same substring of bits to process, it does not have to do anything (usually).

Renderers will initialize rendering resources (i.e. meshes), register them into interpreters and features and grab handles out of them (note that the same resource can be registered into two interpreters, i.e. you might have two different types for the interpreter used for meshes in the shadow rendering pass and another for the ones in the main pass). Those handles will be used to compose commands to push in the queque (most of the times, the handle will be an index into an array of the interpreter, and will be exactly the same as the bit substring that is used to compose the RenderCommand that uses it).

As a side effect of being stateless, multithreading is easier too. First all the renderers should do an internal update, grabbing data from the game and updating feature managers accordingly. Then all the feature managers can execute their update in parallel. At that point, renderers can render in parallel by pushing rendercommands in a per-thread queque. Queques can be sorted independently, and then merge-sorted together in a single big one. From there on, parallel execution can again happen, in various ways. Probably the simpler one is to just parallelize the MainInterpreter, in case that's associated with the render target, we can construct command buffers (native ones) for each of them in parallel, and then send everything to the GPU to execute.

Last but not least, even if I didn't design/implement it, I suspect that with a little care hotloading/swapping and streaming can be easy to do in this system, mainly because we have handles everywhere instead of managing directly the resources...


PypeBros said...

I've really got to read this one more in depth. Thanks for taking the time to tell us about your experience.

Anonymous said...

Where does the camera fit into all that ?! Is it a RenderCommand ?

DEADC0DE said...

Camera information and anything else that's per-scene can easily go in the top-level interpreter, the one I called SceneInterpreter. RenderCommands are all the same, they are the equivalent of a draw call, encapsulating ALL the state needed for that draw call. As explictly saving that state in a command would take an enormous space, the commands are really indices in state managers, that keep that information. Those indices are concatenated into a long (usually 64 or 128 bit wide) fixed string of bits, for performance reasons and to also provide an easy way of sorting.

DEADC0DE said...

Oh, and there can be RenderCommandInterpreters with no associated bits, in that case they don't do anything per-drawcall (rendercommand) but only at the beginning/end of scene drawing (RenderCommandInterpreters have a begin-endscene callback, as some of them need to handle double buffering of cpu-updated-gpu-resources) and viceversa, there can be RenderCommandInterpreters that have bits associated but do nothing, and in that case they provide only those bits as a sort key (i.e. you might want to use that technique to ensure your drawcalls get sorted in buckets, or get roughly sorted in z-order reserving a few bits in the command to store the object to camera distance)

Anonymous said...

Hello kenpex. I enjoyed your post. I realize that "My little rendering engine" is old code of yours, and only discussed here as a discussion example, but the idea of scenegraph-free rendering has become more important to me of late.

From what I can see in your scheme, you keep the scene description very separate from rendering, and only allow the interpreters to actually communicate with the API. I was curious how you would handle a game object with a material that references a texture. Could you describe, theoretically, how you encapsulate texture states in the RenderCommands in your scheme.

Hopefully, that is not too ambiguous a question.

DEADC0DE said...

jai: it's actually very easy. each command is made of different parts, fixed (in their position) substrings of bits.
Those bits are used to communicate to a subsystem which configuration to use for that draw call.

An example:
Let's assume we have only two command interpreters in our system, one for textures, the other for meshes. Let's say our commands are 8-bit wide, and that the bits are arranged like this:

Command: xxxxyyyy

xxxx are bits handled by the TextureInterpreter.

yyyy are bits handled by the MeshInterpreter.

The object renderer, in its initialization will create some resources, i.e.

Texture tex = LoadTexture("foo.jpg");

textureHandle = TextureInterpreter::Register(tex);

Mesh mesh = LoadMesh("bar.collada");

meshHandle = MeshInterpreter::Register(mesh);

RenderCommand cmd;

cmd |= TextureInterpreter::CreateSubcommand(textureHandle);

cmd |= MeshInterpreter::CreateSubcommand(meshHandle);


Now of course this is an example, it's not real code. In real life you might want to organize your structures in a different way, for sure. You might want not have a single texture for example but let the TextureInterpreter (or whatever name you want to call it) manage states that represent all the sampler bindings. You want it to handle sampler states too (i.e. mipmapping etc), so you won't register a single texture, you will register into it a more complicated structure. Maybe you'll want your textures to be reference counted, etc etc etc... Also you don't want to create the command each frame, if it doesn't change etc etc... It's just an example. Hopefully this will shed some light on how simple that stuff is.

Anonymous said...

OK, you're right that's pretty simple. Thanks for the additional information. In your scheme, is the TextureInterpreter actually doing the bind() in the API, or is that happening inside the RenderCommandQueue?

DEADC0DE said...

The interpreters are the only ones that call API stuff. The queque is just a (sortable) queque, no logic there.

DEADC0DE said...

as I wrote in the post, the nice things about commands is that they sort easily and meaningfully, the queques make multithreading very simple, and having the interpreters call the API means that caching/shadowing of API states is trivial and fast, as it happens at a high level

Anonymous said...

OK, I've reread your post with the comments and I think I understand more clearly now. One last question: are the Interpreters the consumer of the RenderQueue and the Game is the producer?

DEADC0DE said...

No, not directly. Game sends updates to the renderers (in some thread friendly way, i.e. using a buffer), renderers create rendercommands.

The best idea is (note, each step in a thread is a sync point, i.e. waits on all the parallel operations):

Game Thread:
- game objects update, writes into a (double) buffer

Render Thread:
- in parallel (that's to say, spawning new threads, or better, work units for the threadpool): each renderer reads the renderer configuration from the buffer, updates the "feature" managers (i.e. culling information, animations etc)
- in parallel: "feature" managers update, i.e. skinning computations, culling etc
- in parallel: renderers push rendercommands into per thread queques
- queques are merged and sorted
- API calls are made (command buffers are built, this can happen again in parallel but it's triky)

DEADC0DE said...

p.s. if you implement this system in your engine, credits would be appreciated ;)

~Main said...

How does your setup expand for per-frame memory coherence? IE it's one thing to pass down a command that contains a pointer to a texture & VB to use in rendering, it's another thing to guarantee that the data will be resident in memory once the command gets to the hardware.

I'm guessing for each of your resources, they are owned by the render side of things, and the sim gets handles to it? Where some handles change based upon temporary objects? (IE dynamic VBs)

DEADC0DE said...

Main: I don't know if I understood your question.

Probably not, but what I understood is along the lines of: "your system defers the execution of the API calls out of the renderer, how can I do stuff that requires to know when a given command gets executed, i.e. cpu updated textures or streams?"

If that was the question, the answer is simple. How would you do it normally? Most probably inserting fences and double buffering right? So here's the same! I didn't implement this stuff, but I see a couple of ways.

First way, you just design your MeshInterpreter or whatever to support dynamic meshes and do all the double buffering and fence stuff. Or you write a DynamicMeshInterpreter and use a bit in the CommandStream to decide if to dispatch the mesh handle bits to one or the other...

The second is more convoluted, but it's an example of how flexible the base idea is. Instead of modifying the MeshInterpreter, you register two meshes instead of one, and handle the double-buffering in the Renderer by emitting different commands each frame (pointing at the first or second copy of the mesh). You still need the fences tho! Well that's easy, you create a GpuSyncInterpreter or whatever, register two fence objects into it, and reserve some bits in the command for them.

This second solution is not really good tho, as the fences will be needed just by a few drawcalls, so reserving bits for them in the command is a waste... Also, commands gets sorted by their bit strings, and sorting on fences is not very reasonable...

Last note, unrelated. I wrote in the previous pseudocode snippet:


this means that the interpreter creates the command bits relative to it... that usually means that it just shift the handle bits in the correct position, but you can also do other things, to get a different ordering... it's not something that I would reccomend tho, but it's possible

Anonymous said...

p.s. if you implement this system in your engine, credits would be appreciated ;)

If I utilize any of these concepts I will both credit, and post here to share.


Orchaldir said...

How does this handle batching? I just started reading about it, so maybe it should be pretty clear.

DEADC0DE said...

Orchaldir: It's all about batching as the commands are pushed in a queque that is then sorted to minimize state changes (one of the basic ideas there is to associate least significant bits of the command to managers that handle states that are the least expensive to change...)

DEADBEEF said...


Interesting post. I still didn't fully understand how this handles dynamic resources. Like dyn. VBs or textures. Do you have to provide additional parameters which might be pointers or some handles to memory pools whatever?
Another thing is how are the hierarchies evaluated? Is there still an implicit "scenegraph" at least between the objects?
Also template specialization idea is not obvious. I mean we're getting all these flag combinations in run-time how can you actually redirect the call to partcular dispatcher at compile-time without some managers with virtual calls or big switches?

DEADC0DE said...

deadbeef: - dynamic resorces are easy, the catch is that when I say that the state is immutable, I mean immutable in a frame, not across frames. Of course for dynamic meshes and textures you might want to add some logic to handle double or more buffers to avoid gpu/cpu locks... this can be done inside or outside the managers, as you want, at different levels of abstraction - implicit scenegraph, yes of course, there would be a lot of them, probably you'll need one for culling for example. hierarchial animations are not too common, and should be handled by the game-side animation player, not by the rendering. Anyway all those implicit scenegraphs as you call them, are totally optional, hierarchies are a powerful concept that could be useful, but there's no need to have them as the main data structure - templates, they are less than obvious, and honestly it's better to avoid them, just hardcode as you need the dipatching function that takes the entire command bit string and takes decisions, splits bits and dispatches them to managers as needed... each user of the engine should hardcode that function based on their needs, I don't see much of a value in generalizing those composition primitives (yes, even if I did)

Michaƫl G. (Bakura) said...

Interesting reading. How do you handle effects in this design ? For instance, let's say you want to render shadow maps for each lights in the scene ?

Is there a kind of ShadowMapFeature (so MeshRenderer will register into this) that will create, for each light, render commands ?

What about the so-called ShaderParamBlockParameter ? How shader variables are updated with this system ?

Sorry for those silly questions.

DEADC0DE said...


Shadowmaps should be handled by your "scene" context. In my post I used some arbitrary names and terminology, I said that you will have a thing called "SceneInterpreter" that is associated with the most-significant bits of your command, and sets the rendertargets etc... So if you want some objects to be shadowmapped, you submit commands to render them twice, a set of command for the "main" scene, and a set of commands for the "shadow" one. Then commands get sorted and rendering happens in the appropriate buffers. You might want to include in the commands the appropriate materials for the two passes, or specialize your SceneInterpreter so when it encounters a command that has encoded in its bits the "shadow" scene setting, it also sets (overrides) the appropriate shaders...

About the parameters... Think about DirectX10/11. Instead of having single parameter granularity, you create a block of parameters. You set all the parameters in the block, and encode the block handle in the command

Demiurge said...

I've opened a discussion on gamedev on rendering design, a little brainstorming, in which everyone could share his opinion.
This post is one of the reading that had let my brain start to think on different way of handling the same problem - rendering - so I would like you to see what happen in the post.

Thank you!

Gabriel (aka Joren, Demiurge...)

DEADC0DE said...

Demiurge: nice. Now that I read this thing again, I have to say I did a pretty bad job at explaining it. It goes too much into irrelevant implementation details, and too little into things that are actually important, like minimizing cache hits.

Demiurge said... have shared your idea, writing down what you think about.
The powerful beautiness of sharing and writing down reasoning is that after some time you can read back what you wrote and understand deeply your thoughts!
Also explaining it to other people is a great litmus paper.
What I am doing on Gamedev is simply a shared brainstorming, and I found in your post some very interesting ideas.
What I want to explore is a solution powerful and flexible, something that could give the developer a great degree of freedom but also SPEED!
Maybe you can join the discussion, or if you haven't coded your rendering engine, we can discuss about it and find every caveats and solutions!

DEADC0DE said...

From what I can see, it's good, the only drawback that I see in it is that you are basically working with handles all the times, so you pay that on the cache when going from the hashes to the resources...

An alternative to avoid that is to record abstracted commands that embed pointers to the native resources. In my engine test, a draw command is a short bit string made of handles, that is both the command and the sorting key for it.

I.e a command is, for example
Framebuffer handle...Texture handle...mesh handle

An alternative is to record commands/pointers + a sort key for all of them. That takes more space, but avoids the indirection. To do the same draw, you'll record something like

settexture...pointer + sortkey
setmesh...pointer + sortkey

If the sort is stable, then you can rearrange your recorded abstracted commands (something you can't do with the native ones) and not pay any cache hits. The downside is that your record buffer can be longer (more hits!), and the whole thing is less abstracted (that could be good!).

Notice that in this scheme, all the sortkeys can be stored in a separate array, as they're only used in the sorting pass, that makes sense. Also you could still cull redundant commands when recording, thus making sure your recorded stuff is not too big. Deriving the right sortkeys can be a bit of a problem though.

DEADC0DE said...


It might seem that the scheme I described, as it uses handles and requires branches for state-culling, would be very expensive when it comes to the actual issue of the rendering calls, after sorting.

It is not. Allow me to demonstrate. Let's say that we organize the bits in this way:

[framebuffer][zbufferbits][renderstates][low frequency shader params][textures][high frequency shader params / material][mesh]

The trivial decoding of this would require the extraction of each bit substring, probably comparing it with a stored cache of the last value, if different issue the string to the "manager" that uses it as an index into an array where it has stored the relevant settings.

So a lot of branches and cache misses, right?

Well, not really. Let's note for example, that the framebuffer does not change often at all. We could make our test so that the most frequent case (new == cached) is the predicted one, thus saving the misprediction.

The zbufferbits are there just as a sortkey, no branches, no cache misses, they are not linked to a resource.

The renderstates follow the suit of the framebuffer. Most of the times, you will have a set of renderstates for opaque object, a set for transparent, and some other special ones. Not many, and they won't change often. That's so true actually, that you could just unify the previous-value-cache for both the framebuffer and the renderstates, checking the substring made of the bits of both. Even better, you can unify the two managers, thus saving even some bits...

Low frequency shader params... Those would be stuff like camera matrices, and per-frame stuff. Same reasoning...

What about the material params and the mesh? Well, you might predict there that most of the time you will fail the cache there... Actually, for the meshes it's not useful to do the check at all.

What about the cache misses? For the meshes, that we expect to change at each command, you can pay a lot... Well, yes, it's true.

But let's reason about that, how can we mitigate the situation?
Two ways are possible. The first is to reserve a chuck of memory for the meshes, allocate them there, and store in the command not a handle to the mesh (index in an array of resource-pointers) but directly the offset of the mesh in this chunk of memory. 28/27 bits might be enough.

Or you might in some cases know that you're going to draw a bunch of meshes together, and so store them next to each other in the pointer array... Effectively, you could even merge them together, and use start/end indices to draw the submeshes, or so on.

Demiurge said...

This is a really good point!
What if you simply store a pointer to a "data" structure that contains all the data for the current call?
I think that this can solve some problems - and you can allocate the different datas to a pool and use some bits to index them.

Can be helpful?

DEADC0DE said...

Demiurge, not really. Then for each command you have to access that data structure, the accesses are random after sorting, so you will maximize your cache misses. And you still have to branch if you want to do redundant state filtering...

Demiurge said...

If I remember well in this post

the drawing is performed with a map with the pair key-data...maybe this can present the same problem?
Also, the data you will submit to the renderer will be always different for each draw call INSIDE a can you lower cache misses?

Sorry for all these questions, I'm only trying to understand your design that has some common points with the design I have in mind and that I'll develop when all the spot will be clear!

DEADC0DE said...

Demiurge: Sure, Christer Ericson describes exactly the same mechanism that I describe in this post (afaik).

About your question if I understand it correctly you're confusing internal data structures with GPU access.

You can have 100 draw calls, each of them drawing object that are sparse in memory.

If you don't have cache misses to produce those draw calls, you're fine. So let's say that you have 100 pointers to meshes.

The CPU cache misses that you have to avoid, are the ones required to fetch the pointers, not the data the pointers point to.

That data is never accessed by the CPU, the pointer is directly used to construct the GPU instruction stream.

Every engine has to design to minimize the cache misses of the CPU data. For example a lot of 3d engines use a scenegraph, and meshes are stored inside more general objects.

You might need to go through many hops before reaching the draw data (shader pointers, constant pointers, texture pointers, mesh pointers).

This design, by encoding everything compactly, in some way encourages better programming practices, at the high level. You won't compose the bit strings at runtime most of the times, because it's understood as a complex operation, but you will preprocess then, and just select which ones to submit at runtime.

But introducing another abstraction layer, via the handles, we have to manage cache misses there. And that's what I wanted to clarify in my comment.

BTW, the same handle->GPU resource problem existed in OpenGL. NVidia even started a project to go around that:

Demiurge said...
This comment has been removed by the author.
Demiurge said...

I understood...maybe a good solution is to treat the commandbuffer more of a stack, so you add the api-dependent resources directly in the stack. When you want to fire a command, you watch also the number of parameters inside the stack: you can see it as a file, in which you store the data you need.
Or maybe you can create a separate buffer with only the raw-params of all the calls you need, and for every call you do you move a pointer to the current area of that buffer.
This will eliminate the problem of handles and lower the cache misses.
Another good solution is to track the resources per type in the device, then use an index to retrieve the api-dependent resource used in the call...

DEADC0DE said...

Demiurge: variable sized commands are possible, but they will make sorting a pain, so nah. And I don't understand how using indices in the device can be any better than using handles.

Unknown said...

this is an old post i know, but the architecture is definitely still relevant.. I'm a bit confused about how FeatureManagers do NOT call graphics api when they implement features for the renderers.

maybe you could use an example similar to the one you posted above

Texture tex = LoadTexture("foo.jpg");

textureHandle = TextureInterpreter::Register(tex);

Mesh mesh = LoadMesh("bar.collada");

meshHandle = MeshInterpreter::Register(mesh);

RenderCommand cmd;

cmd |= TextureInterpreter::CreateSubcommand(textureHandle);

cmd |= MeshInterpreter::CreateSubcommand(meshHandle);


DEADC0DE said...

This is probably one of my worst written posts, sorry for that. I went to explain thing in terms of the classes I made in code and not the general concept.

The general concept is really really easy. You just want to roll out your own command buffer instead of the directX or OpenGL one, and then translate one to the other.

The reason to do this, which might seem wasteful (and it might be for some usages), is that your own buffer is meant to be sortable, so you can emit draws from many threads in any order, and then still be able to merge and sort into the right order. Another objective is to be able to easily filter out redundant changes between one draw to the other.

The implementation that I described encodes the draws in this command buffer by using a fixed number of bits for each command, and these bits are both your sort key, and the encoding of the entire state of a drawcall.

Mostly this is accomplished by thinking of these bits are indicies into arrays of state, for example, the first 3 bits can be an index that decides which combination of rendertargets to use, then your next 5 could be your shader constants selection and so on.

When you generate this command buffer you first ask the various managers for the bit encoding (index) for the state you want. For example, you ask the Texture manager for an index encoding this six textures that you want to bind, the manager goes into its internal arrays, checks if it already has that combination and if so returns its index, otherwise it creates a new one and returns. The same for shaders, shader constants and so on... Most of the state does not change, so this can be done once, not at every frame. Most of the state that does change will just allocate a unique index for itself and so it can change the state associated with the index without having to allocate a new one each frame.

Then, when for every frame, after the culling and whatsoever you want to emit the draw, you just combine all these indices into a command and enqueue it. No DirectX calls happen in any of these processes...

The DirectX interaction starts to happen only when the command buffer is done, and sorted, and we start translating it into actual GPU commands... Of course this means that you have a lot of indirections, when you have to translate the command buffer into the actual draws, but consider that many of the bits won't change from a draw to the other (as you sort), so no random access into arrays needs to be done.

It's all fast and nice actually and it allows for lots of optimizations here and there. The only big issue with this that I can see is that you need to wait to emit DirectX calls untill all the internal command buffer calls are done, as this needs to be sorted and potentially a draw emitted at the end can be sorted to the beginning (in the end, that was our objective). So even if all the internal command enqueueing happens in parallel, the sort happens in parallel, and the DirectX translation can happen in parallel, you still have a syncronization barrier at a point which could be avoided and which could create some issues...