Search this blog

27 March, 2011

Stable Cascaded Shadow Maps - Ideas

Stable CSM intro

A "stable" cascade is nothing else than a fixed projection of your entire world on a giant texture, of which we render a fixed window that fits around the projection of view frustum each frame, making sure that we always slide this fixed window by an integral number of texels each frame. 
As we have to be sure that the "window" will fit the frustum in all cases, to determine its size a way is to fit the frustum in a sphere and then size the window using the radius of such sphere.

Implementing CSM, especially on consoles is not that easy. For an open world game you'll notice that you need quite a lot of resolution to get decent results, and cascade rendering can become quickly a problem. On 360 from what I've seen, resolving big shadowmaps from EDRAM to the shared memory is very expensive too, so it becomes important to pack the shadowmaps aggressively. Some random "good" ideas are:
  • Render shadows to a deferred shadow buffer, enabling the possibility of rendering one cascade at a time. It also makes way easier to cross-fade cascades and possible rendering shadows at half-res (that is a good idea... upsampling with bilateral filtering or similar). It's possible to use hi-stencil and hi-z (on ps3, also depth range) in various ways to accelerate this.
  • Tune cascade shadow filtering to try to match filter size across different resolutions (that's to say, filter less far cascades).
  • Shadow a pixel using the best cascade that contains that pixel, instead of relying of the frustum split planes (this makes a bit harder to fade between cascades, but not too much). Use scissors or clipping planes to avoid rendering stuff that is already rendered in previous cascades into more coarse ones. Microsoft has a pair of nice articles.
  • Compute light near-far planes to be tight around the frustum but avoid culling objects before the near plane (a.k.a. "pancake": clamp depth in the vertex shader, it's not a big deal as the projection is orthographic but it can screw self shadowing of such clipped objects, you need to give a bit of "buffer" space to the near plane). The downside is that you get more raster pressure as the hi-z will not reject the objects that are compressed on the near plane... you can solve that either by giving a small linear range for the pancaked objects or marking and using stencil/hi-stencil where they get drawn.
  • Cull small objects aggressively from distant cascades. Avoid rendering objects in far cascades if they were rendered completely in the previous ones.
  • Pack shadowmaps! Do not render things behind the frustum and maximize the area in front of it! This and this articles have some good ideas. You can also pack two shadowmaps into a two-channel 16-bit target if double-depth fill is not giving you a big speedup.
Still after doing all this, you might end needing more performance...


I'm playing Crysis 2. Nice game, starts a bit weak with a too forced story but it improves A LOT later on. Graphically is great as I'm sure you've all noticed, ok long story short, I still probably love Modern Warfare and Red Dead a bit more but it does not disappoint. Somewhat the art direction on Crysis 2 looks a bit "hyperrealistic" to me most of the times with very soft and exaggerated ambient fill, even more accentuated by the huge bloom. But well, technically is impressive and it is surely a good game.

Now of course if you're a rendering engineer, first thing you do with such a game is to walk slowly everywhere and check out the rendering techniques. And so did I. Some notes:
  • Lods pop noticeably, small objects are faded out pretty aggressively. Still during "normal" gameplay it's not too evident.
  • DOF is pretty smart. It seems to filter with a "ring" pattern that I guess is both an optimization and a way to simulate bokeh. It looks like what you get from a catadioptric mirror lens, but it's reasonable also because most lenses will have a sharp out of focus either before of after the focal plane, as the bokeh shape of one is the inverse of the other (so if a lens has a nice gaussian-like out of focus after the focal plane, it will get an harsh negative-gaussian one before). It also manages to blur correctly objects before the focal plane, kudos for that.
  • Huge screenspace bloom/lens flares.
  • Motion blur (camera only?)
  • Decent post-filtering AA, even if with some defects (ghosting of objects in motion), not the best I've seen but good.
  • Shadows. Stable CSM. A weird circular filter is applied to them. No fading between cascades. A dithering pattern that seems to be linked to the light space. Far cascades are updated every other frame.
Ok. So the last item caught my attention. How to do that? Well, it's not that hard if you think about it. If you observe the update of the CSM, you'll notice that even when you rotate the view your far cascades move only by a few texels, so we could just add a bit of space there and assume that updating these cascaded every other frame won't create problems.


But what if we want to be accurate? Well it turns out it's not really hard at all! We know what is the window we rendered last frame, and where we should render this frame. Most of the new frame is already rendered in the last one, we could just shift the data in the right place. 

It turns out, we don't even need that, if we want to apply this incremental update only once and then re-render, we can just shift our "zero" of the shadowmap uv and wrap. We still need to render the new data and resolve it, but that is only a few texels wide border! Even culling the objects to render only the ones that fall in that border is really trivial.

Really, we could do an incremental update for every cascade... forever! If it wasn't for two things: moving objects and the fact that we can't fix our cascade (light) near/far z, but we usually to maximize the resolution need to fit it each frame (or so).

We could alleviate the latter problem by having the "shifting" shader also re-range the "cached" last frame data into the new near/far range. The moving objects one can be solved by having them rendered into a separate buffer or a copy of the buffer. Both solutions though need more memory and bandwidth (resolve time on 360) so they can be good only if that is not already a major bottleneck (that's to say, if you packed your cascaded well).

26 March, 2011

Debugging DirectX9 is so stressful!

I've been working on console games for the past five years, so I don't know much about the PC tech these days. Now I'm on a console/pc title and I just had to debug the PC build.

Oh. My. God.

I got so stressed I actually almost got sick that night. And then yes, it turned out to be a really small bug that I could totally have debugged easier without hooking these tools anyway.

Pix for Windows is a joke, but still it's the best tool I tried... It's a bit better on DX10/11 (faster refresh)

NVidia perfHud is rather useless (even if I hear it's better than Pix for profiling, which I believe as Pix currently is unable to do any profiling at all) and the Intel GPA thing did not seem to really work at all (it took 20 minutes just to load the capture and it gave me some weird results, even if it looks better than Pix, it's promising I guess) - Update: newer versions of GPA seem to work fine, and actually it's now my preferred DX9 tool!

ApiTrace is a new tool which might be good... I had a look at one of the early versions which did not work for DX9, now it seems to have added support for all APIs...

ATI has a Gpu PerfStudio thing which is decent, but it deprecated DX9, the current version is for DX10/11 only.

For some things I would even say the old 3d reaper (or ripper) and DXexplorer are better tools!

I really fucking hope that the new Nvidia Parallel NSight (a.k.a. Nexus) and ATI Gpu PerfStudio 2 are great, I could not try them as they're dx10/11 only and I'm currently on dx9. Overall it really shows how much the industry is committed to PC these days...

15 March, 2011

DOF Test

Scaled to 300% to ease viewing
(click to enlarge)
It does motion blur as well. Guess how many ms on 360 :)

09 March, 2011

Do you have "failed builds"?

Sometimes we are so used to our industy workflows that we "accept" things that are terribly wrong without questioning them anymore. It's like when some medias start broadcasting false facts, or misusing words, and slowly the wrong becomes right.

What does it mean that a "build failed"? An entire build fails, catastrophically? Not even a single source file compiled? Or maybe it's only a bit of the frontend that did not compile? Or a single art asset? It's like saying that a car does not work just because the air conditioning does not turn on.

Ban the "broken build" concept. Ban "game crashes". The audio system failed? Well I guess we have a build of the game without the audio. The rendering crashes? Well I guess we have to disable that (and maybe use a minimal "debug" version instead i.e. animation skeletons and collision meshes).

A game is a complex collection of components. Then why if just one component does not work, we consider the entire thing "bad"? Decouple, my friend!

07 March, 2011

Tell the internet that you're not a moron...

...because it will assume you are. Especially if you work on a franchise iteration and you change anything.

Fight Night Champion went from 60fps to 30fps. The most generous reaction among professional reviewers is that it was a step back done in order to have better lighting and graphics. Most of the general internet public (or the part of it that is vocal, on the forums and comment sections of websites) just took it as a downgrade impacting both graphics and gameplay.
Screenshot stolen from:

Of course nothing of this is true. Fight Night Round 4 was already a game with very highly rated graphics, there would have been no need to impact the fluidity of the gameplay in order to have even better lighting. 

The lighting was designed from day zero to be able to run at 60fps, going to 30 in gameplay does not really bring us much as the worst-case performance scenario were the non-interactive sequences, that were 30fps in Round 4 too.

At a given point during pre-production, we started building tests for 30fps gameplay, first videos in after-effects (adding motion blur via optical flow), then after these proved to be interesting we went for a prototype in game and blind testing.

Most of our testers and producers likes the gameplay of the 30fps with motionblur version better than the 60fps one. Note that the game itself still runs at 60 (120hz for the physics). Even our users think the same, most did notice that now the punches "feel" more powerful and the game more "cinematic". 

The motion blur implementation itself is extremely good, blurring correctly out of the skinned characters silhouettes. To the point that when in some early screenshots we photoshopped in the blur effect, we were not really able to achieve an as good effect as the real in game one.

Still, when you release the technical details that no one really understands, people just assume that you're a moron and they know better. They like the feeling of the new game better, but they hate the 30fps rating... This is just an example, but it happens all the time for many feature that you change.

Bottom line? Change things, but be bold about them and take responsibility. Show what you've done and why, show people that you're not a moron, that you tried everything they are thinking to do plus more and made some choices for some real, solid reasons. Otherwise internet will just assume you're a moron...

P.S. This is just my view as a developer that cares about quality and makes choices in order to maximize quality. To say the truth I don't think that quality matters to a company per se. What it matters is the kind of qualities that sell. That's to say, you can do all the blind testing in the world and be 100% sure that a given choice is the best quality wise, then you go out and people just don't buy it, not always quality sells, some time the worst choice is the most popular (and by popular I mean in sales, not in the internet chatter, I.E. see how much hate there is for Call of Duty on the net and how much does it sell). Now for that side of the things, that's to say marketing, I don't know anything. Obviously FNC shipped at 30fps so marketing thought it was ok but I don't have any data nor experience. This other blog post might shed some light...

02 March, 2011

Alternatives to object handles

I'm tired, so I'll write this as a "note to self" (even more than how I usually do here). 

Hot-swapping (and similar concepts) are often implemented with object handles and a "manager".  Refcounting is often used to manage object lifetime.

A possible implementation is to have all the resources stored in the manager, let's say in an array, and the handles could be the index into this array. Or the handle could be a pointer to a shared object that points to the resource on the heap (same as C++ TR1 shared_ptr, but sharing also the pointer, not only the refcount) and the manager can just have an array of pointers to the resources on the heap.

This works well, but it introduces an indirection that will cause cache misses every time we have to access to the actual resource, especially if the resource is small and could be stored directly in the structures that need it, for example shader constants that can be represented just with the GPU pointer that contains the data to be set in the ring buffer (i.e. NVidia OpenGL bindless API see

Also, in general reference counting is not the best idea to manage object lifetime. It leaks memory if we have cyclic references, leaks that are not hard to detect with a debugging allocator but that can be complex to fix. It's also not too fast when destructing objects, as an object desctruction can trigger a chain of RC destructions or RC decrements that can cause more cache misses.

  • Storing the objects to be "patched" in the manager. We avoid handles, but when a pointer to a resource is obtained, the location of the pointer is stored in the manager (via a multimap: resource pointer to list of locations that point to it).
    • Pros:
      • No extra indirection, nor extra space used in the objects that hold the resource (other than the resource itself)
      • Still it can be wrapped in something having the interface of a shared_ptr, so the solution allows to go back to a more standard object handle implementation if wanted.
      • Can move resources in memory easily! Sort them to have cache-coherent access (if possible)
      • It's even an alternative to intrusive_ptr (Boost), that is a good (performance-wise) way of doing RC (without hot-swapping) that also does not incur in space penalty in the objects that point to resources but it can be used only to point to objects that implement a given interface. 
    • Cons:
      • It will have way worse performance every time you change one of the pointers to a resource!
      • Requires a complex data structure, difficult to implement and balance.
      • In general it will require more space.
      • It will patch objects... not the most clean thing... nor the most robust, you can easily forget registering a temporary copy of an handle in the manager or you really have to be sure that no such temporaries exist when you swap things...
      • It will have worse performance on object creation and most probably, destruction (handles and RC are not great in destruction too, it's hard to say because they can trigger a chain of RC destructions or decrements that results in cache misses).
      • It will be harder and slower to make thread-safe, even assuming that the hotswapping happens while the application is single-threaded.
  • "Hardcoded" Garbage Collection. Very similar to having an "update" function that triggers the caching of resource pointers to resource caches in the various classes, but without the need of having to waste space... Every class that holds references to GC objects implements a "walking" function that will mark them. Every GC object will need a flag to check if it was marked. We still need a manager with a list of all the GC objects.
    • Pros:
      • Safe even in presence of cycles.
      • Not too much of an extra burden if you're writing methods in every class for things like reflection (actually, if you have any form of reflection of the fields in your objects, this will be "free") or serialization...
      • At no extra cost you also will know which objects are holding references to your resources...
    • Cons:
      • Changing a resource is almost as expensive as a GC, we will need to walk through all the references!
      • Different interface, can't change your mind easily.
      • Error-prone, you can easily forget to update one of the tracking functions, and still you have to make sure you never create untracked copies of an handle, i.e. a temporary (or well, you have to be sure none of them are around by the time you execute the GC/hot swapping)
      • You will need to be a genius to have the GC or the reference-updating process run in parallel or incrementally.
  • List of references. Similar to the first idea, but instead of the multimap we just store a list of all the locations that contain resources. When we have to change them, we go through all the list and see which ones need patching
    • Pros:
      • No extra indirection, nor extra space used in the objects that hold the resource (other than the resource itself)
      • Still it can be wrapped in something having the interface of a shared_ptr, so the solution allows to go back to a more standard object handle implementation if wanted.
    • Cons:
      • Changing a resource is still expensive, we will need to walk through all the references.
      • We probably still would need to maintain a separate list of resources, other than the list of patching locations, if we need to be able to enumerate and find them... In general it will require more space.
      • It will patch objects... not the most clean thing... see above...
      • Again, harder and slower to make thread-safe
  • Permutation array. We add a second indirection level, to be able to sort resources to cause less cache misses. An handle in an index into an array of indices (permutation) of the array that stores the resources.
    • Pros:
      • Resources can be easily moved around in memory, and be sorted to be cache-friendly.
    • Cons:
      • Many times we don't have a coherent memory access order...
      • Many times the resources are actually small, so the array of indices won't cache-miss less than the actual array of resources!
      • yes, this one is not a great idea most of the times...
    But maybe this is all wrong, as I said, I'm tired. I'll think about it more another day... Comments as always are welcome.

    p.s. Hybrids are also possible I guess... Like dividing the locations roughly into buckets (i.e. depending on part of the msb of the pointed resource) thus trying to minimize the cost of mutating pointers in the first solution (list add and delete would be needed only if the pointer changes to a resource in a different bucket).