C0DE517E: 09.11

30 September, 2011

Rendering 101

This is how I explain (videogame) rendering to non-rendering (engineers... even if hopefully it's still understandable for some non-programmers as well).

If you have suggestions on how to make it more clear, post a comment!

28 September, 2011

Fight Night Champion Rendering @ GDC

I wrote so much for Fight Night Champion over the years, but not much can be found on the internet. I crafted at least five presentations for it, three mostly internal for the team on Fight Night 4 analysis and pre-production, one on the various prototypes and ideas done to exit pre-production, to share ideas with the other EA teams, and one for GDC after the game was done, which was refined and presented by my (awesome!) colleague Vicky Ferguson when I left EA (you should see the recording of her presentation, and she also presented for FN4 at the previous GDC which was very cool too).

[update] The presentation is now public!

Unfortunately the slides of this last one were not published (a couple could be found here, but they are pretty useless)... until now.

At the beginning I thought about writing more detalied blog articles about many of the experiments that were done, but many of them do not really need that much space, some others were rediscovered and published by others (i.e. pre-blurred SSS, bent normals SSAO), and I think FNC was more about the process than the final tech (in fact, the lengthy presentation never goes into the detail of what at the end were the final FNC shaders!), so...

This is a "director's cut" of the GDC slides, it's way more verbose (not meant for presentation) and with much more material.

Enjoy! [update: moved out of Scribd]

[update] Bonus links:

27 September, 2011

Bad ideas don't require much explanation: Caching Stable Cascaded Shadowmaps

Note:

Ok, some explanation (as I was asked recently about this). The idea is that for static CSM and static objects, you just shift every frame your fixed "window" into an "infinite" plane. So why don't we just do that in the actual CSM? We can shift the old data (or just implicitly shift by wrapping around and keeping an origin marker in the shadowmap coordinates) and just render the tiny strip that corresponds to the part of the window that is novel (corresponding to the intersection between the old clip volume and the new one).

Culling is easy too, it's just a boolean test between the two volumes. The only real issue is that the near-far planes change every frame in a stable CSM, thus the old shadow zbuffer data would be in a different coordinate space than the new one (not only a shift but also a change in the projection matrix).

This can be solved in a number of ways, the worse is to "reinterpret" the old values in the new projection as that loses precision. The best would probably be to write an index in the stencil part and use that index to address an array keeping the last 256 near-far values for the previous frames. 256 values will go pretty fast, so you might need to re-render the entire shadowmap when you're out of them... But really I think there could be stupid tricks to avoid that, like adding some "borders" to the CSM and shifting it every N-texels instead of every single one. Or computing the near-far for volumes that are snapped at shadowmap-sized positions, so every actual shadowmap rendered would at worse interesect four of them, thus requiring a maximum of four stencil indices...

Let me say that again. Our stable CSM shadow is just a fixed-size, shifting window inside a infinite projection plane aligned with our light direction, with a near-far clip that corresponds to the min-max interesection between the projection window and the entire scene. Now we can imagine that we precompute the near-far values (we won't but, let's imagine) in a grid sized as the shadow window. Then, no matter how we shift the window in this infinite plane, it will intersect only four (maxium) near-far precomputed grid cells... Thus, four values in the stencil are all that we need to index cached z-values.

20 September, 2011

Mathematica and Spherical Harmonics

As my previous post about Mathematica seemed to be well-received, I've decided to dig some old code, add some comments and post it here. Unfortunately it's littered with \[symbol] tags as in Mathematica I used some symbols for variables and shortcuts (which you can enter either in that form or as esc-symbol-esc). You can also see a PDF version of the notebook here, with proper formatting. Enjoy!

A function and its SH approximation

(* Normalization part of spherical harmonics *)
shNormalizationCoeffs[l_, m_] := Sqrt[((2*l + 1)*(l - m)!)/(4*Pi*(l + m)!)]

(* Evaluates to a function of \[Theta],\[Phi] for a given degree l and order m, it's defined as three different cases for m=0, m<0, and m>0*)
shGetFn[l_, m_] := Simplify[Piecewise[{{shNormalizationCoeffs[l, 0]*LegendreP[l, 0, Cos[\[Theta]]], m == 0}, {Sqrt[2]*shNormalizationCoeffs[l, m]*Cos[m*\[Phi]]*LegendreP[l, m, Cos[\[Theta]]], m > 0}, {Sqrt[2]*shNormalizationCoeffs[l, -m]*Sin[-m*\[Phi]]*LegendreP[l, -m, Cos[\[Theta]]], m < 0}}]]

(* Indices for a SH of a given degree, applies a function which creates a range from -x to x to every element of the range 0...l, return a list of lists. Note that body& is one of Mathematica's way to express pure function, with parameters #1,#2... fn/@list is the shorthand for Map[fn,list] *)
shIndices[l_] := (Range[-#1, #1] &) /@ Range[0, l]

(* For each element of the shIndices list, it replaces the corresponding shGetFn *)
(* This is tricky. MapIndexed takes a function of two parameters: element of the list and index in the list. Our function is itself a function applied to a list, as our elements are lists (shIndices is a list of lists) *)
shFunctions[l_] := MapIndexed[{list, currLevel} \[Function] ((m \[Function] shGetFn[currLevel - 1, m]) /@ list), shIndices[l]]

(* Generates SH coefficients of a given function fn of \[Theta],\[Phi], it needs a list of SH bases obtained from shFunctions,it will perform spherical integration between fn and each of the SH functions *)
shGenCoeffs[shfns_, fn_] := Map[Integrate[#1*fn[\[Theta], \[Phi]]*Sin[\[Theta]], {\[Theta], 0, Pi}, {\[Phi], 0, 2*Pi}] &, shfns]

(* From SH coefficients and shFunctions it will generate a function of \[Theta],\[Phi] which is the SH representation of the given coefficients. Note the use of assumptions over \[Theta] and \[Phi] passed as options to Simplify to be able to reduce the function correctly, @@ is the shorthand of Apply[fn,params] *)
angleVarsDomain = {Element[\[Theta], Reals], Element[\[Phi], Reals], \[Theta] >= 0, \[Phi] >= 0, \[Theta] <=
Pi, \[Phi] >= 2*Pi};
shReconstruct[shfns_, shcoeffs_] := Simplify[Plus @@ (Flatten[shcoeffs]*Flatten[shfns]), Assumptions -> angleVarsDomain]

(* Let's test what we have so far *)
testNumLevels = 2;
shfns = shFunctions[testNumLevels]
testFn[\[Theta]_, \[Phi]_] := Cos[\[Theta]]^10*UnitStep[Cos[\[Theta]]] (* Simple, symmetric around the z-axis *)

(* generate coefficients and reconstructed SH function *)
testFnCoeffs = shGenCoeffs[shfns, testFn]
testFnSH = {\[Theta], \[Phi]} \[Function]Evaluate[shReconstruct[shfns, testFnCoeffs]]

(* plot original and reconstruction *)
SphericalPlot3D[{testFn[\[Theta], \[Phi]], testFnSH[\[Theta], \[Phi]]}, {\[Theta], 0,
Pi}, {\[Phi], 0, 2*Pi}, Mesh -> False, PlotRange -> Full]

(* Checks if a given set of coefficients corresponds to zonal harmonics *)
shIsZonal[shcoeffs_, l_] := Plus @@ (Flaten[shIndices[l]]*Flatten[shcoeffs]) == 0

(* Some utility functions *)
shSymConvolveNormCoeffs[l_] := MapIndexed[{list, currLevel} \[Function] Table[Sqrt[4*Pi/(2*currLevel + 1)], {Length[list]}], shIndices[l]]
shExtractSymCoeffs[shcoeffs_] := Table[#1[[Ceiling[Length[#1]/2]]], {Length[#1]}] & /@ shcoeffs

(* Convolution with a kernel expressed via zonal harmonics, symmetric around the z-axis *)
shSymConvolution[shcoeffs_, shsymkerncoeffs_, l_] := (Check[shIsZonal[shsymkerncoeffs], err];
shSymConvolveNormCoeffs[l]*shcoeffs*shExtractSymCoeffs[shsymkerncoeffs])

(* Another test *)
testFn2[\[Theta]_, \[Phi]_] := UnitStep[Cos[\[Theta]]*Sin[\[Phi]]] (* asymmetric *)
testFn2Coeffs = shGenCoeffs[shfns, testFn2]
testFn2SH = {\[Theta], \[Phi]} \[Function]Evaluate[shReconstruct[shfns, testFn2Coeffs]]
plotFn2 = SphericalPlot3D[testFn2[\[Theta], \[Phi]], {\[Theta], 0, Pi}, {\[Phi], 0, 2*Pi}, Mesh -> False, PlotRange -> Full]
plotFn2SH = SphericalPlot3D[testFn2SH[\[Theta], \[Phi]], {\[Theta], 0, Pi}, {\[Phi], 0, 2*Pi}, Mesh -> False, PlotRange -> Full]
Show[plotFn2, plotFn2SH]

(* Test convolution *)
shIsZonal[testFnCoeffs, testNumLevels]
testConvolvedCoeffs = shSymConvolution[testFn2Coeffs, testFnCoeffs, testNumLevels]
testFnConvolvedSH = {\[Theta], \[Phi]} \[Function] Evaluate[shReconstruct[shfns, testConvolvedCoeffs]]
plotConvolvedSH = SphericalPlot3D[testFnConvolvedSH[\[Theta], \[Phi]], {\[Theta], 0, Pi}, {\[Phi], 0, 2*Pi}, Mesh -> False, PlotRange -> Full]

19 September, 2011

Raytracing Myths

I've already blogged about raytracing and rasterization, a couple of times, and I guess I should stop. Still, duty calls when I read statements such as "rasterization cannot be parallelized nearly as effectively as ray tracing".

Let's go!

Algorithms

Let's sketch both algorithms in some sort of pseudo-accurate pseudo-code:

Raytracing (first-hit)

for_each pixel px on screen { for_each primitive pr in scene { intersect ray(px) with pr }}

Rasterization

for_each primitive pr in scene { for_each pixel px covered by pr { mark px on pr }}

Pretty similar right? The inner loops are different, one intersects a ray equation with the analytic description of a primitive, the other in some way walks on the primitive marking pixels covered by the primitive (i.e. for triangles, by scanline interpolation or by walking on the pixels of the bounding box checking the edge equations).

Visible Pixels: there is a fundamental difference here that we will disregard for the future, as it's written raytracing has to walk on all the pixels of the screen, while rasterization walks all the pixels covered by primitives, thus is only part of the screen is covered raytracing has a disadvantage. The same could be said of rasterization if primitives have large parts that off-screen. This is true but not interesting, rasterization could solve the problem by clipping (moving the complexity from pixels to objects at least) and raytracing could subdivide the screen and identify empty tiles, anyhow this does not change the problem in general and does not change the complexity.

Also, some of you might have noticed that the description I've made is overly simplistic. What I've sketched does not deal in any way with depth hiding. Both algorithms need to keep track of which primitive is the closest for every pixel. Fundamentally it's the same operation, but it has a different memory cost due to the ordering of the operations: raytracing can keep a single depth while scanning every primitive of a pixel and understand which primitive is the closest to the camera, rasterization needs to keep a value for each pixel as we will know which is the closest only when the outer loop is done (z-buffer!).

Raytracing (first-hit)

for_each pixel px on screen { for_each primitive pr in scene { intersect ray(px) with pr, keep nearest depth }}

Rasterization

zbuffer = list of depths

for_each primitive pr in scene { for_each pixel px covered by pr { mark px on pr, store nearest in zbuffer }}

Complexity

So memory-wise rasterization is worse than raytracing! We have to keep a z-buffer! Not so fast... Let's assume that we have a huge scene to render. Maybe we want even to stream stuff. What could we do with both algorithms? It looks like we would need to have around, in memory, the contents of the inner loop, while we could stream over what's in the outer loop. It's another interesting difference that originates from the same fundamental fact, the two loops are inverted. The z-buffer can be nifty because we can stream primitives through it, no matter how huge the scene is, we can go primitive by primitive and keeping only the z-buffer around we can render a scene. Raytracing on the other hand, has to keep primitives around.

Now we can of course imagine variants for both to be more efficient, z-buffer can be subdivided in tiles (or, for other reasons that are thought still about memory efficiency, we can use a hierarchy), primitives in raytracing can similarly partitioned and aggregated using bounding volumes and so on. It does not really change that we don't have a winner, we have the same computation expressed in a different order, which leds to different trade-offs.

What about the compute time? It's evident that from what I've sketched, it's identical. But raytracing is O(log(primitives)) right? At least using spatial subdivision structures and disregarding the complexity of building them (which is often more than linear, thus making from the beginning complexity arguments rather moot)...It might not be obvious how that is not true, so let's make a couple of examples, hopefully these will be enough to generalize to any subdivision structure.

A scene with two group of primitives, one on screen, one off screen. These are bound by a box each, and the two boxes are bound with another box, thus creating a box hierarchy.

What would the raytracer do? It would for every pixel on screen check the outer box, for each pixel intersecting it it would check then inner boxes, one will fail always and not trigger the check with the inner primitives, the other will not and thus we will end up checking only one of the groups.

Rasterization? We go through all the pixels of the outer box, we see that something is on screen so we go through all the pixels of the inner boxes, one will be on screen, the other won't and we will render only one group of inner primitives.

Other than the tradeoff between "for every pixel on screen" and "for every pixel covered by primitive" - the "visible pixels" problem that I already said that we would disregard, it's the same thing.

But that is about screen coverage in two of the dimensions. What about depth complexity?

A sparial subdivision structure with a raytracer provides a way of having guarantees on the depth ordering. For each ray, if we find an intersection in a given depth of the subdivision, we don't have to proceed any further, we know that all the primitives in lower depths are hidden for that ray. Or to put it in another way, for all the nodes that are fully covered by previous node's primitives, we don't do any work.

Doesn't that happen also for rasterization? Let's persuade ourselves of this fact. Assume that we have a depth-only hierarchy, our scene is split into different "chunks" along the depth, each chunk is defined by a splitting plane and contains a primitive. Now, let's say that the primitive of the first chunk covers the entire screen. Obviously in a raytracer we would trace that and stop, considering once all the pixels of the first plane, then once over all the pixels of the primitive, and then ending. In a rasterizer, we would rasterize the plane, then the primitive, then the following plane, find it all occluded and end. A bit more work, but constant, no matter how deep the chunk with the all-covering primitive is we stop always analyzing one more plane, so it's not a concern complexity wise.

What if the primitive does not cover the entire screen though? What if it leaves exactly one pixel not covered? The raytracer would then go and check the following chunks against that single pixel only until a primitive covers it. The rasterizer would rasterize all the pixels in the planes every time it need to check, until the primitive covers it. It seems that we have a fundamental difference here, we could invoke again the "visible pixels" principle, but we would like a way to make the occlusion test resolution independent for the rasterizer to have the same cost.

To be clear, I'm not advocating the use of BVHs and z-buffer pyramids in your game engine today. I'm just saying you could, so the big algorithmic complexity argument about the doom day when rasterization will not be viable anymore are wrong. By the way, it won't even be a very impractical method good only on the paper (like the constant-time raytracing stufff), it would probably be a good idea in some heavily occluded scenes...

Luckily depth complexity in practice is not even a huge deal for rasterizers, for example on a GPU. How comes? Overdraw should kill us, how do we avoid it? We can because a GPU does not need to rasterize fully covered primitives, if we have a front-to back ordering we can save most of the cost by checking our primitives against a coarse z-buffer, before shading. In a GPU shading is the big cost, and so we just do an early depth test and live with that, but in theory this works in general, we can make the occlusion cost logarithmic on the pixels by subdividing the z-buffer grid in a hierarchy and in fact if we're interested in occlusion culling and not on shading, this is a good idea (see HOMs).

In the real world, if we want to talk outside the theory and useless statements about big-O complexity, both techniques need to deal with primitives and occlusions and they are if not equally good, very close, with both being able to handle scenes of immense complexity and out-of-core rendering. Both can be linear, logN or whatever, it's only that some techniques make more sense in some situations, so in practice most raytracers are logN with N^2 or more subdivision structure build times, rasterizers are mostly linear, both live happily with that on equally complex scenes.

Ironically in some way, raytracing which is said to be less dependent on primitive counts, absolutely needs accelleration structures for all but the simplest scenes, no raytracer goes without one, that's because you need them not only for depth complexity, but also to deal with the screen-space one (I guess you could cull primitives, LOD and tile, but then it would be so very close to a rasterizer not to be really smart, so really no one does that).
Rasterization on the other hand performs great on a wider range of scenes and needs z-buffer hierarchies, level of detail and other complex occlusion strategies only when the shading costs are negligible and the scenes have a lot of depth complexity. Both can do amazing things.

Parallelism and Concurrency

Enter the second big topic, parallelism and concurrency. Raytracing is easier to parallize says the myth - because each ray is independent. Is that true?

I can honestly see where this myth comes from, it even has a bit of truth in it. I picture someone in a university, writing a raytracer for fun or for a course and finding that's easy to fork at the ray-generation lever or a parallel_for over some worker threads and it works, because the inner loop is indeed independent, does not rely on shared state. A rasterizer does, for the z-buffer, so it would require some form of syncronization. Not the end of the world, you could have a z-buffer per thread and merge, you could have some a lockless stuff per pixel, tiles or whatever, but still it requires some changes.

So is it easier? In that sense yes. But let's reason as a more experienced engineer (or scientist) and not as a freshmen. We see two nested loops, we are on a modern CPU architecture, what do we do? Instinctively (at least, I hope) we unroll and parallelize the inner (instruction parallel), and try to make concurrent the outer (data parallel). In this respect, how do the two algorithms fare?

These two, powerful forms of parallelism are absolutely fundamental to achieve great throughput on modern architectures as it provides them with a lot of coherent computation. Modern CPU execution times are dominated by latencies. Instruction latencies and dependencies and even more nowadays, memory latencies. And this won't change! Latency is hidden by unrolling, by having enough "predictable" computation (and memory access). If we have an operation A1 and a subsequent B1, with B1 dependent on A1 and A1 taking 10 cycles of latency, if we can find nine other instructions between A1 and B1 we're safe, we "hid" the latency. If we can unroll a cycle ten times over data elements 1...10, then we can trivially do A1,A2,A3,... then B1,B2... and we're done.

Now in theory, they are both good. A raytracer inner loop intersect a ray with a number of primitive, a rasterizer walks on a primitive looking for pixels. If the primitives are uniform, both are easy to unroll and play well with SIMD, we can intersect a ray with i.e. four primitives at a time, or in a similar way rasterize four pixels of a scanline (or a 2x2 block...) together.

The outer loop follows the same reasoning we did before, and indeed it's a bit easier on a raytracer. So again a win? Yes if we limit ourselves to a case that does not (as I wrote before) exist: a raytracer without a spatial acceleration structure and shading. And if we don't consider that accessing primitives in the inner loop is way more memory intensive than walking on pixels.

If we write things in "theory", or rather, with a theory based on an over-simplistic model raytracing wins... If we take a realistic renderer on the other hand, things change.

Spatial acceleration breaks uniformity. Shading of different primitives at the same time leads to incoherent memory access. Same for neighboring rays intersecting different groups of primitives. We can have a ray at level zero of the hierarchy and the neighboring one at level ten. A ray might need shading, while the next one might be still traversing. It's a mess!

And not an easy one to fight against at all, how do we deal with that? Well it turns out, it's not that simple, and not fully solved yet.

One of the first approaches was to collect rays into the spatial structure cells and then be parallel to all the rays in a given cell. It works if you can generate enough rays, if you have still some coherency, i.e. for a distributed raytracer, less well for a path tracer. And still it deals only with part of the problem.

For a while the solution seemed to be to organize rays in "packets" that could walk the structures together, hoping they won't diverge much or trying to reconstruct some coherence by splitting and merging them. Then we found out that this works again for the kind of coherent loads that (we will see) are not really _that_ interesting for a raytracer, it will give you decently fast first-hit, shadows, the sort of stuff a raytracer is still good at. You can get some perfect mirrors on a sphere or so, how cool is that.

More recently, the first approach evolved in more general ray reordering strategies, while the second was turned sideways and experimented with having parallelism on a single ray by growing the width of the acceleration structures.

All this while on GPUs rasterizers have been executing stuff in parallel, where it matters, over tens and hundreds of computational units, for years now (and Reyes have been shading in parallel since the '82).

Ease of use

Simple algorithms are easy to express in both, and writing a bare-bones triangle rasterizer or sphere raytracer (using the primitives that are easier to map in both) is trivial. Maybe the raytracer will take a few lines less, but hardly anything that we care about.

For more complex effects, raytracer starts to win. Adding shadows, reflections. An univeristy student can have a lot of fun with a few lines of code, feeling happy. And from here on, if we never consider resource usage, raytracing always wins. You can solve the entire rendering equation (or so) with a path tracer in a hundred lines of code.

But when we make programs, in practice and not on paper, no matter what we have to deal with resources, they enter the equation. Then it becomes a matter of performance and how much do we care about it.

When I started programming, on a commodore 64, I knew only Basic and Assembly, and that was all I knew even when I moved to the Amiga (amos) and the PC (QuickBasic, PowerBasic), until one of my older cousing started university and I grabbed his K&R book. I loved to write graphical effects, and I had to "optimize" code by either tweaking the code to suit better the (to me unknown) logic of the underlying interpreter, or add inline assembly directly via machine code. Hardly easy, even if you might argue that for some things Basic was easier to learn and use than C (and I really liked the QuickBasic PDS IDE at the time...)

I think the complexity argument is rather dumb. Raytracing is less coherent and struggles with performance. Rasterization is more coherent and struggles with algorithms that need more flexibility in the visibility queries. Raytracing needs spatial accelleration structures and complex strategies to gain ray coherency. Rasterization needs fancy algorithms and approximations when it has to deal with global effects and more general visibility queries.

Conclusion

In practice if you've shipped a game, you already knew this. What do GPU use to render all these incredibly complex scenes over thousands of processors? What do we do on the other hand if we have to execute random visibility queries, for example for A.I.? Sometimes there are reasons why things are the way they are. The biggest trade-off between the two is surely that raytracing starts with incoherent visibility queries and research is trying to "find" patterns in them in order to gain performance, rasterization starts with coherent primitive walking and research is trying to find algorithms to compute global, complex illumination out of these fast massive queries we can issue (this is a nice example).

I liken it to programming languages. Dynamic versus Static, Pure versus Imperative. They are all Turing-complete and in theory equally capable at expressing computation. In practice though they start from different endpoints and try to reach the other, pure languages start safe and strive to express side-effects and program order, imperative start with side effects and strive to achieve safety, and so on.
And if there is something to learn from that (and hopefully there is not :D) is that performance in computer science has been always a stronger drive in the success of a technology than other considerations (we program, unfortunately, in C++ not in Haskell), probably because it's more important to achieve a given amount of quality in practice, than having something that in theory produces something with a higher quality but in practice does not deliver.

Going back to the article that "spawned" all this. Is raytracing the future? Sure it is, it's surely "in" the future of rendering. But not because it's easier, or because it has a lower algorithmic complexity, or because it parallelizes better. None of these big global adjectives apply, that's all bullshit. It's in the future because it's a technique which has a set of trade-offs that are orthogonal to rasterization. It's useful, having more options, techniques, points of view is always useful. There will be areas in which rasterization will always make much more sense, and areas in which raytracing will, while most else will be happy to be able to pick and choose depending on the kind of problem they're facing.

I think you can always peek into the future by looking around, looking what other people in other fields with similar problems are doing... In these years, for CG, raytracing and rasterization are influencing each other more and more.

Offline renderers are mixing techniques from both realms, just see what Pixar's Photorealistic Renderman is capable of doing. It's a Reyes rasterizer, for the first hit. It's parallel, concurrent and distributed. It can do shadowmaps and raytracing. It can cache point clouds and do some sort of screen-space ambient occlusion and so on. Still we lived with rasterization for a long time even in the offline realm especially in some fields that are less concerned about absolute lighting precision.

Only recently we started to see GI and raytracing as useful and productive replacements for some algorithms in the offline realm. I expect everything to be more and more hybrid. I don't think rasterization will ever die. Also, from what I know and I see of the research around raytracing I don't think that we'll see it in realtime, in games for quite some time yet.

08 September, 2011

Mathematica and Skin Rendering

Introduction

Now, if you've been following my blog for a while, you might have noticed that I hate C++ almost as much as I love rapid prototyping, live-editing languages. Wolfram's Mathematica is one of the finest of such tools I've found so far.

It might sound surprising as Mathematica's mostly known as a computer algebra system, performing symbolic manipulations that are arguably not the bread and butter of the realtime rendering engineer, but deep down everything boils down to a Lisp-like language with a twist, as it allows different representations of the parse tree (i.e. plain code, mathematical notation, editable widgets and even inline graphics). All of this is coupled with one of the most powerful mathematical and numerical libraries available, with the symbolical math functionality "helping" the numerical one (automatic algorithm selection, generation of derivatives and so on), and topped with a very good visualization system and amazing documentation.

Sounds cool right? Let's start using it then!

Preintegrated Skin Shading

I've been following Eric Penner's Preintegrated Skin Shading for more than a year now. We were colleagues at EA Canada, but we never ended up on the same project and being both interested in graphics research, we ended up having independently discovering a few of the same ideas while working at our project.

I was pretty amazed when he showed me his "Shader Amortization using Pixel Quad Message Passing" (Now in GPUPro 2) as I had a similar idea in the back of my mind as well, and I guess he was pretty surprised as well when I presented internally to EA a prototype for Fight Night that sported "preblurred" subsurface scattering (at that time FN still did diffuse lighting with cubemaps, so the prototype was about cubemap biasing, normal preblurring and different normal fetches for each of the RGB channels as in "Rapid Acquisition of Specular and Diffuse Normal Maps from "). He combined that same intuition with gradient estimation (as in "Curvature-Dependent Local Illumination Approximation for Translucent Materials", "Curvature-Dependent Reflectance Function for Interactive Rendering of Subsurface Scattering", by Kubo et al and "Curvature-Based Shading of Translucent Materails, such as Human Skin” by Kolchin) and precise integration of Jensen's SSS model to create something fairly amazing, that he recently presented at Siggraph 2011.

While Eric was preparing his talk I was trying to help by finding a good analytic approximation for his integral. Unfortunately this work came too late and couldn't be included in the Siggraph talk (Eric lists it as "future development" possibility) so I'm writing this now to complement his work and as a primer to show Mathematica's usefulness in our line of business.

Mathematica's crash course

When you start up Mathematica's interface you'll get prompted to create a new notebook. A notebook is a handy document where you can mix Mathematica's expressions with normal text and graphics, a sort of "active" wordprocessor.

Everything you enter will go in what Mathematica calls a "cell", by default cells are of "input" type, that means ready to accept expressions. Other kind of cells are not "active" and won't be evaluated when asked to process the entire notebook.

When you want to evaluate the contents of a cell you can press shift-enter in it. That will cause its text to go through the Mathematica kernel, and any output generated by it will be shown underneath, in some output cells grouped together with the input one. Expressions terminated with a semicolon won't generate output (it will be suppressed) which is useful to avoid to spam the notebook with intermediate results which don't require checking.

A nifty thing is that expressions can be entered in various formats, Mathematica supports different notations for the same syntax trees. The cool thing is that you can even ask the interface to convert a given cell in one of the formats, so you can get all your expression in a "programmer's" notation and then have them converted into mathematical formulas ("TraditionalForm") without needing to use complex shortcuts to write everything in the mathematical notation.

Ok let's get started now. We'll need to use the skin diffusion profile of d'Eon and Luebke that is expressed as a sum of gaussians with some given coefficients. We start by defining a couple of lists with these coefficients. List literals in Mathematica are entered using curly braces, so we can write and evaluate the following:

ScatterCoeffs = {0.0064, 0.0484, 0.187, 0.567, 1.99, 7.41}*Sqrt[2]

ScatterColors = {{0.233, 0.455, 0.649}, {0.1, 0.336, 0.344}, {0.118, 0.198, 0}, {0.113, 0.007, 0.007}, {0.358, 0.004, 0}, {0.078, 0, 0}}

Note how the output of the first assignment should already have performed the multiply by 1.414. Using the equal operator in Mathematica forces the evaluation of the expression on the right before assigning it to the variable.

If you squinted you should have noticed already the Sqrt[2] in the previous expression. Function arguments in Mathematica are passed in square brackets, and all Mathematica's built-in functions start with an upper case.

As an exercise, let's now define a Gaussian function manually, instead of relying on Mathematica's internal NormalDistribution:

GaussianFn[var_, x_] := 1/Sqrt[2*Pi*var]*Exp[-(x*x)/(2*var)]

Now, as you evaluate this, you should notice a few things. First of all, it generates no output, the := operator is a delayed assignment, it will evaluate the expression only when needed, and not beforehand (similar to a list quote). Second, that function's input parameters are post-pended with a underscore.

In Mathematica, functions are fundamentally expressions with delayed evaluation, when they are invoked the parameter values are substituted into the expression and then the result is evaluated.

This concept is important, also because being a computer algebra system, Mathematica is all about expressions (see "FullForm" in the documentation), and in fact we could have literally defined the same GaussianFn by performing a delayed replacement (using ReplaceAll, or its shortcut /. note that in Mathematica most basic functional operators infix notations as well) into an expression, like this:

GaussianExpr = 1/Sqrt[2*Pi*var]*Exp[-(x*x)/(2*var)]

GaussianFn[p1_,p2_] := GaussianExpr /. {x->p1, var->p2}

GaussianFn[1,0.5] (* evaluate the function, to prove it works *)

Or as a function object, which won't require a delayed assignment as the function object itself will delay the evaluation of its expression:

GaussianFn[p1_,p2_] = Function[{p1,p2},GaussianExpr/.{x->p1, var->p2}]

GaussianFn[1,0.5]

Got that? Probably not, but all should be clear if you as we go, also try things out and have a look at the excellent online documentation that ships with the program.

Let's define now our last helper expression, and then we will delve into Eric's model and how to fit a simplified function to it.

This is the scattering function, it takes a variable between zero and one and as we don't want to deal with computations in the RGB space which would make visualization of all the functions quite more complex, we pass an index between one and three (lists are one-based in Mathematica) that will decide which channel we're using. We will then need to repeat the fitting for every channel. As I wrote, Mathematica is a lisp-like functional programming environment, so we'll often work with lists and list operations:

ScatterFn[x_, rgb_] := Apply[Plus, Table[GaussianFn[var, x], {var, ScatterCoeffs}]*ScatterColors][[rgb]]

In this case the Table function evaluates GaussianFn multiple times generating a list. Each time it will replace the variable "var" with one of the entries in the "ScatterCoeffs" list (note that the other variable, x, will be replaced as it's one of the arguments of ScatterFn, so no "free variable" is left and we should expect to get a list of numbers out of the evaluation). It then multiplies element-wise the list with the ScatterColors one and it applies to the list the "Plus" operator (note that we could have used the infix version "fn @@ expr" as well), thus summing all the elements. We'll then have a single, three-component result (as ScatterColors is a list of three-elements vectors), and we take the "rgb" indexed one out of it (double square brakets is the list indexing operator).

Integrating Skin Scattering

The key idea in Eric's work is that the subsurface scattering in the skin depends on a neighborhood of the shaded point, if we don't account for lighting discontinuities caused by shadows (which are handled separately). We can then approximate this neighborhood using the surface curvature, and pre-integrate the scattering on a sphere (circle, really) of the same curvature. It then generate a two-dimensional texture based on the curvature and normal dot light with the results of this precomputation. We are going to try to fit a simple expression that we can compute in realtime to avoid having to use the texture.

Let's now express this two-dimensional function in Mathematica, following the expression that you can find in Eric's paper:

(* This is the function defined with integrals, as in the paper *)

DiffuseLight[\[Theta]_,r_,rgb_]:=NIntegrate[Clip[Cos[\[Theta]+x], 0,1}]*ScatterFn[Abs[2*r*Sin[x/2]],rgb],{x,-Pi,Pi}]

DiffuseWeight[r_,rgb_]:=NIntegrate[ScatterFn[Abs[2*r*Sin[x/2]],rgb],{x,-Pi,Pi}]

DiffuseIntegral[\[Theta]_,r_,rgb_]:=DiffuseLight[\[Theta],r,rgb]/DiffuseWeight[r,rgb]

This is pretty simple stuff. We're going to generate samples out of these functions (like Eric's precomputed texture) to then compute a fit, so I directly used the numerical integration (NIntegrate) instead of entering it as a mathematical expression and then asking the numerical approximation of it (N[expr] in Mathematica).
Notice the \[Theta], these were entered as "esc" theta "esc" which is one of the ways of adding symbols in Mathematica's expressions (in this case, the Greek letter theta). These expressions are more readable if pasted in Mathematica and converted into StandardForm.
If you are on a slower computer you can tweak the precision parameters of the NIntegrate or replace it with a NSum, as Eric does in his texture-generator shader:

(* This is a bit faster to evaluate, it's a translation of the code used to generate the lookup table, using an evenly spaced sum instead of NIntegrate *)

DiffuseWeight2[r_,rgb_]:=NSum[ScatterFn[Abs[2*r*Sin[x/2]],rgb],{x,-Pi,Pi,2*Pi/20}]

DiffuseLight2[\[Theta]_,r_,rgb_]:=NSum[Clip[Cos[\[Theta]+x],{0,1}]*ScatterFn[Abs[2*r*Sin[x/2]],rgb],{x,-Pi,Pi,2*Pi/20}]

DiffuseIntegral2[\[Theta]_,r_,rgb_]:=DiffuseLight2[\[Theta],r,rgb]/DiffuseWeight2[r,rgb]

Sampling the function

Let's generate now a table of data to be used for the numerical fitting. That has to be in the form { {input0, input1...., output}, ... }. We will use the Table function again for that, the expression that will be evaluated at each element of the table is the list literal {angle, radius, DiffuseIntegral[...]} (we could have used the List[...] function as well, instead of the literal) but this time we'll need two ranges, one for the radius and one for the angle.
The Flatten call is needed because Table, when used with two ranges, generates a list of lists (a matrix), while we want a flat list of three-dimensional vectors.

channel = 1; (* 1=red 2=green 3=blur *)

dataThetaRVal = Flatten[Table[{\[Theta], r, DiffuseIntegral[\[Theta], r, channel]}, {\[Theta], 0, Pi, 2*Pi/20}, {r, 0.25, 6, 1/20}], 1];

Finding a fitting function

For the fitting function I also want to check how well it extrapolates. So I will compute the fit over the original sample range, but then plot it over an extended interval to verify how the function looks like outside it.

To do this we could use Mathematica's function plotting operators (and it might even be a smarter choice) but instead let's still work with a plots that operate on lists of samples, ListPlot3D.

To plot over an extended range we define an interval table, dataIntervalExt, that's because I want to do these extended plots various times so I "save" the sample inputs and then apply them to different functions.

The first time I use this is to generate a graph of how the lambert lighting model would look like. Using Mathematica's functional programming primitives I define an inline function that maps from x (that will be an element of dataIntervalExt, so a two-dimensional input vector) to a list {first of x, second of x, LambertFn[first of x - that is, the angle]} using the function symbol (in Mathematica it will look like "|->"), then I map this function to every element of dataIntervalExt using the inline /@ map.
Notice how the cosGraph plot is hidden by closing the expression with a semicolon, then the Show[...] function will combine the two plots into a single output.

dataIntervalExt = Flatten[Table[{\[Theta], r}, {\[Theta], 0, Pi, 2*Pi/20}, {r, 0, 12, 1/20}], 1];

dataGraph = ListPlot3D[dataThetaRVal, InterpolationOrder -> 3]

cosGraph = ListPlot3D[(x \[Function] {First[x], Last[x], Clip[Cos[First[x]], {0, 1}]}) /@ dataIntervalExt];

Show[dataGraph, cosGraph]

dataGraph and the combined 2d plots

In order to find a suitable fitting function, I start looking at a graph of some slices slice, fixing a reasonable radius. If the first argument of Plot is a list of functions instead of a single one, it will overlay the plots of all the functions in the list on the same graph. Show does something similar but with any unrelated plots.

plotIn2d = Plot[DiffuseIntegral[x, 3, 1], {x, 0, Pi}]
(* This will generate a graph with three slices and the Lambert's clamped cos *)
Plot[{DiffuseIntegral[x,1,1], DiffuseIntegral[x,3,1], DiffuseIntegral[x,6,1], Max[Cos[x],0]}, {x, 0, Pi}]

I then make some hypothesis and plot a test function against the slice to visually see if it might fit:

(* We have to find a function of cos\[Theta] which could look similar to the DiffuseIntegral*)
Show[Plot[{Max[Cos[x], 0], Max[Cos[x]*0.5 + 0.5, 0]^3}, {x, 0, Pi}], plotIn2d]

Optimizing fitting parameters
Now we can define a function with a number of free parameters, and ask Mathematica to find a fit, a set of parameters that minimizes the distance between the function and the sample set. NonlinearModelFit is used here, that is a slightly more controllable variant of FindFit. This will take quite some time as the function to fit is quite non linear (due to the Max and Clamps).

A word on the function I've selected here and why. I'm sure something better can be done, this is not much more than a simple test, but my reasoning here, after visually inspecting the function to approximate, is that the dependency on the radius is rather simple, while the shape that is dependent on the angle is quite complex.
That's why I started plotting the function in two dimensional graphs that are slices with a fixed radius. Once I've found a possible function that could approximate that with some free parameters, I've modeled the dependency on the radius of these parameters as simple linear equations.
Also I wanted the function to become Lambertian for large radiuses, and I did that by literally having part of the fitting function be a linear interpolation towards the cosine. This constraint is questionable probably for many applications you could just clamp the radius that you will use in a given range, and care only about finding a good fit in that range.

(* Let's find the fit now, first we have to define a parametric function, the fitting process will optimize the parameters to minimize the error *)

(* piecewise linear models... *)
(* fitFunction[\[Theta]_,r_] := Max[Cos[\[Theta]] * Min[(a0*r+a1),1] + Max[(a2*r+a3),0],0] *)
(* fitFunction[\[Theta]_,r_] := Max[Max[Cos[\[Theta]] * Min[(a0*r+a1),1] + Max[(a2*r+a3),0],0], Max[Cos[\[Theta]] * Min[(a4*r+a5),1] + Max[(a6*r+a7),0],0]] *)

(* nonlinear models... *)
(* fitFunction[\[Theta]_,r_] := (Cos[\[Theta]]*(a0*r+a1) + (a2*r+a3))^3 *)

fitFunction[\[Theta]_, r_] := (Cos[\[Theta]]*(a0*r + a1) + (a2*r + a3))^3 * Clip[a4*r + a5, {0, 1}] + Max[Cos[\[Theta]], 0] * (1 - Clip[a4*r + a5, {0, 1}])

fitting = NonlinearModelFit[dataThetaRVal, fitFunction[\[Theta], r], {a0, a1, a2, a3, a4, a5, a6, a7}, {\[Theta], r}, Method -> NMinimize (* We need to use NMinimize for non-smooth models *)]
fitParams = fitting["BestFitParameters"]
Plus @@ Abs /@ fitting["FitResiduals"] (* Print the sum of residuals *)

Original function and fitted one, plotted at an extended range to show extrapolation to Cos(theta)

We can now plot the fitted function and overlay it with the original one for a visual comparison. In order to use the fit we've found, we just replace the parameters we got from the NonlinearModelFit into the original fitFunction using the /. operator.

(* We plot the fit function over dataIntervalExt to check that it extrapolates well *)

dataFit = (x \[Function] {First[x], Last[x], fitFunction[First[x], Last[x]] /. fitParams}) /@ dataIntervalExt;

fitGraph = ListPlot3D[dataFit, InterpolationOrder -> 3, ColorFunction -> "SouthwestColors", Mesh -> None];
Show[fitGraph, cosGraph]
Show[fitGraph, dataGraph]

We can do much more, but this should be a good starting point. I've uploaded a PDF of the whole Mathematica notebook on ScribD here. Enjoy!

P.S. I'd bet everything here could be made ten times faster by applying these suggestions... Left to the reader :)

P.S. You might want to find some better example of Mathematica's goodness than my lame test code. You could start here or here. Also, for Italians only, I've found on Issuu an entire collection of McMicrocomputer magazines, by ADP! Incredible not only for its nostalgia value but also for all the Mathematica's articles by Francesco Romani that showed me more than ten years ago the power and beauty of this software and of maths in general.

What I've learned from shipping a deferred lighting renderer

You're going to draw every mesh three times on average (considering shadows), if you don't fuck up the GPU badly, chances are that you're going to be CPU bound.

Cache-misses in your rendering pipeline are going to kill you.

Use indirections VERY sparingly. Control where resources can be created and destroyed to be able to avoid to have to reference count everything. Prefer "patching" pointers in place to keeping references around.
Don't iterate multiple times on "drawable" objects collections to isolate which ones to draw in each pass. Store only the draw data needed for a given pass. Scene trees are dumb but by now everyone should be well aware of that.

Parallel rendering is really important.

Especially if you don't double buffer and add frames of latecy here and there (bad for gameplay and memory), you won't find much to do later in the command buffer generation to keep the cores busy otherwise.

Batching, "sort-by-material" is needed.
My best bet on how to handle all this is still the old design I've blogged about of having every "drawable" object correspond to a drawcall and represent it as fixed length sort-key whose bits are an encoding of all the state needed for the draw.

You'll either need to have a robust geometrical pipeline to split/aggregate meshes and generate LODs and occluders...

...or you'll have to schedule a sizable chunk of your artist's time for that. And you'd better not think that you can do these optimizations at the very last...
On the upside, software occlusion culling is a big win and not that hard!

Carmack's megatextures (clipmaps) are not (only) attractive to achieve a given art-style, but they have in a deferred setting the big plus of requiring very few material shaders (as you don't need to combine textures, which is most of what material shaders do in a deferred lighting renderer) and less objects (no need of static decals), thus requiring less work on the CPU and making easier to split/merge meshes.
A tiled lighting stage (as opposed to the stencil marked volumes) is a good idea

...especially if you need shader variations in the lighting stage to render some decent materials, consoles have a one bit hi-stencil that won't help you

Point, spot and directional lights are not enough. You'll need some kind of ambient lighting devices to create volume.

If your game can stream in chunks (regions), don't think about continuous streaming.
Edge-filtering AA can work pretty decently and extremely fast. Not limited only to the final framebuffer...

PS3 early-z is a bitch, PC Dx9 tools are non-existent

07 September, 2011

Back from vacations, Parameter Databases poll results.

Back after Siggraph and some well-deserved vacations, I have many posts to prepare. For now, here there are the results from my last poll.

Coast near my hometown

The poll was: "Parameter Database systems should use..." and here are the final results. Didn't get too many votes this time (32), maybe people were in vacation as well, or maybe the subject was boring, anyhow:

Reflection (53%)
Code-Generation (28%)
Handles/Dynamic Types (18%)
Immutable Data (in-game) (18%)
Data Inheritance (18%)
Schema Inheritance (18%)
Relational DB (15%)
Live-editing via tool (78%)
Live-editing in-game (40%)
Serialization via tool (53%)
Serialization in-game (34%)

Predictably I'd say, people prefer to keep editing and serialization in a tool instead of in-game, having the in-game client only to load/listen to the data changes.

As for the database type, relational systems do not seem to be favored, I guess most parameter systems are still key-values. Low scores are also assigned to dynamic typing, data and schema (class) inheritance.

What is maybe surprising that reflection wins over code-generation, even if in C++ there is no easy way to implement the former.

I've noticed how often we, as engineers, have a perverse aesthetic sense for which a C++ disgusting macro based reflection system is considered "prettier" than mixing different languages or crafting languages and tools.

To be fair, reflection could also be done more simply by parsing PDBs or using C++ parsers, but I doubt these are really widespread.

The good thing about reflection though is that code changes can "break" data, or require data changes, but the opposite is not true. With code-generation data changes can "break" code which is arguably worse.

Search this blog