Search this blog

Showing posts with label Rendering tutorials. Show all posts
Showing posts with label Rendering tutorials. Show all posts

17 December, 2020

Hallucinations re: the rendering of Cyberpunk 2077

Introduction

Two curses befall rendering engineers. First, we lose the ability to look at reality without being constantly reminded of how fascinatingly hard it is to solve light transport and model materials.

Second, when you start playing any game, you cannot refrain from trying to reverse its rendering technology (which is particularly infuriating for multiplayer titles - stop shooting at me, I'm just here to look how rocks cast shadows!).

So when I bought Cyberpunk 2077 I had to look at how it renders a frame. It's very simple to take RenderDoc captures of it, so I had really no excuse.

The following are speculations on its rendering techniques, observations made while skimming captures, and playing a few hours.

It's by no means a serious attempt at reverse engineering. For that, I lack both the time and the talent. I also rationalize doing a bad job at this by the following excuse: it's actually better this way. 

I think it's better to dream about how rendering (or anything really) could be, just with some degree of inspiration from external sources (in this case, RenderDoc captures), rather than exactly knowing what is going on.

If we know, we know, there's no mystery anymore. It's what we do not know that makes us think, and sometimes we exactly guess what's going on, but other times we do one better, we hallucinate something new... Isn't that wonderful?

The following is mostly a read-through of a single capture. I did open a second one to try to fill some blanks, but so far, that's all.

This is the frame we are going to look at.

I made the captures at high settings, without RTX or DLSS as RenderDoc does not allow these (yet?). I disabled motionblur and other uninteresting post-fx and made sure I was moving in all captures to be able to tell a bit better when passes access previous frame(s) data.

I am also not relying on insider information for this. Makes everything easier and more fun.

The basics

At a glance, it doesn't take long to describe the core of Cyberpunk 2077 rendering.

It's a classic deferred renderer, with a fairly vanilla g-buffer layout. We don't see the crazy amount of buffers of say, Suckerpunch's PS4 launch Infamous:Second Son, nor complex bit-packing and re-interpretation of channels.

Immediately recognizable g-buffer layout
  • 10.10.10.2 Normals, with the 2-bit alpha reserved to mark hair
  • 10.10.10.2 Albedo. Not clear what the alpha is doing here, it seems to just be set to one for everything drawn, but it might be only the captures I got
  • 8.8.8.8 Metalness, Roughness, Translucency and Emissive, in this order (RGBA)
  • Z-buffer and Stencil. The latter seems to isolate object/material types. Moving objects are tagged. Skin. Cars. Vegetation. Hair. Roads. Hard to tell / would take time to identify the meaning of each bit, but you get the gist...

If we look at the frame chronologically, it starts with a bunch of UI draws (that I didn't investigate further), a bunch of copies from a CPU buffer into VS constants, then a shadowmap update (more on this later), and finally a depth pre-pass.

Some stages of the depth pre-pass.

This depth pre-pass is partial (not drawing the entire scene) and is only used to reduce the overdraw in the subsequent g-buffer pass.

Basically, all the geometry draws are using instancing and some form of bindless textures. I'd imagine this was a big part of updating the engine from The Witcher 3 to contemporary hardware. 

Bindless also makes it quite annoying to look at the capture in renderDoc unfortunately - by spot-checking I could not see too many different shaders in the g-buffer pass - perhaps a sign of not having allowed artists to make shaders via visual graphs? 

Other wild guesses: I don't see any front-to-back sorting in the g-buffer, and the depth prepass renders all kinds of geometries, not just walls, so it would seem that there is no special authoring for these (brushes, forming a BSP) - nor artists have hand-tagged objects for the prepass, as some relatively "bad" occluders make the cut. I imagine that after culling a list of objects is sorted by shader and from there instanced draws are dynamically formed on the CPU.

The opening credits do not mention Umbra (which was used in The Witcher 3) - so I guess CDPr rolled out their own visibility solution. Its effectiveness is really hard to gauge, as visibility is a GPU/CPU balance problem, but there seem to be quite a few draws that do not contribute to the image, for what's worth. It also looks like that at times the rendering can display "hidden" rooms, so it looks like it's not a cell and portal system - I am guessing that for such large worlds it's impractical to ask artists to do lots of manual work for visibility.

A different frame, with some of the pre-pass.
Looks like some non-visible rooms are drawn then covered by the floor - which might hint at culling done without old-school brushes/BSP/cell&portals?

Lastly, I didn't see any culling done GPU side, with depth pyramids and so on, no per-triangle or cluster culling or predicated draws, so I guess all frustum and occlusion culling is CPU-side.

Note: people are asking if "bad" culling is the reason for the current performance issues, I guess meaning on ps4/xb1. This inference cannot be done, nor the visibility system can be called "bad" - as I wrote already. FWIW - it seems mostly that consoles struggle with memory and streaming more than anything else. Who knows...

Let's keep going... After the main g-buffer pass (which seems to be always split in two - not sure if there's a rendering reason or perhaps these are two command buffers done on different threads), there are other passes for moving objects (which write motion vectors - the motion vector buffer is first initialized with camera motion).

This pass includes avatars, and the shaders for these objects do not use bindless (perhaps that's used only for world geometry) - so it's much easier to see what's going on there if one wants to.

Finally, we're done with the main g-buffer passes, depth-writes are turned off and there is a final pass for decals. Surprisingly these are pretty "vanilla" as well, most of them being mesh decals.

Mesh decals bind as inputs (a copy of) the normal buffer, which is interesting as one might imagine the 10.10.10 format was chosen to allow for easy hardware blending, but it seems that some custom blend math is used as well - something important enough to pay for the price of making a copy (on PC at least).

A mesh decal - note how it looks like the original mesh with the triangles that do not map to decal textures removed.

It looks like only triangles carrying decals are rendered, using special decal meshes, but other than that everything is remarkably simple. It's not bindless either (only the main static geometry g-buffer pass seems to be), so it's easier to see what's going on here.

At the end of the decal pass we see sometimes projected decals as well, I haven't investigated dynamic ones created by weapons, but the static ones on the levels are just applied with tight boxes around geometry, I guess hand-made, without any stencil-marking technique (which would probably not help in this case) to try to minimize the shaded pixels.

Projected decals do bind depth-stencil as input as well, obviously as they need the scene depth, to reconstruct world-space surface position and do the texture projection, but probably also to read stencil and avoid applying these decals on objects tagged as moving.

A projected decal, on the leftmost wall (note the decal box in yellow)

As for the main g-buffer draws, many of the decals might end up not contributing at all to the image, and I don't see much evidence of decal culling (as some tiny ones are draws) - but it also might depend on my chosen settings.

The g-buffer pass is quite heavy, but it has lots of detail and it's of course the only pass that depends on scene geometry, a fraction of the overall frame time. E.g. look at the normals on the ground, pushed beyond the point of aliasing. At least on this PC capture, textures seem even biased towards aliasing, perhaps knowing that temporal will resolve them later (which absolutely does in practice, rotating the camera often reveals texture aliasing that immediately gets resolved when stopped - not a bad idea, especially as noise during view rotation can be masked by motion blur).

1:1 crop of the final normal buffer

A note re:Deferred vs Forward+

Most state-of-the-art engines are deferred nowadays. Frostbite, Guerrilla's Decima, Call of Duty BO3/4/CW, Red Dead Redemption 2, Naughty Dog's Uncharted/TLOU and so on.

On the other hand, the amount of advanced trickery that Forward+ allows you is unparalleled, and it has been adopted by a few to do truly incredible rendering, see for example the latest Doom games or have a look at the mind-blowing tricks behind Call of Duty: Modern Warfare / Warzone (and the previous Infinity Warfare which was the first time that COD line moved from being a crazy complex forward renderer to a crazy complex forward+).

I think the jury is still out on all this, and as most thing rendering (or well, coding!) we don't know anything about what's optimal, we just make/inherit choices and optimize around them. 

That said, I'd wager this was a great idea for CP2077 - and I'm not surprised at all to see this setup. As we'll see in the following, CP2077 does not seem to have baked lighting, relying instead on a few magic tricks, most of which operating in screen-space.

For these to work, you need before lighting to know material and normals, so you need to write a g-buffer anyways. Also you need temporal reprojection, so you want motion vectors and to compute lighting effects in separate passes (that you can then appropriately reproject, filter and composite).

I would venture to say also that this was done not because of the need for dynamic GI - there's very little from what I've seen in terms of moving lights and geometry is not destructible. I imagine instead, this is because the storage and runtime memory costs of baked lighting would be too big. Plus, it's easier to make lighting interactive for artists in such a system, rather than trying to write a realtime path-tracer that accurately simulates what your baking system results would be...

Lastly, as we're already speculating things, I'd imagine that CDPr wanted really to focus on artists and art. A deferred renderer can help there in two ways. First, it's performance is less coupled with the number of objects and vertices on screen, as only the g-buffer pass depends on them, so artists can be a smidge less "careful" about these. 
Second, it's simpler, overall - and in an open-world game you already have to care about so many things, that having to carefully tune your gigantic foward+ shaders for occupancy is not a headache you want to have to deal with...

Lighting part 1: Analytic lights

Obviously, no deferred rendering analysis can stop at the g-buffer, we split shading in two, and we have now to look at the second half, how lighting is done.

Here things become a bit dicier, as in the modern age of compute shaders, everything gets packed into structures that we cannot easily see. Even textures can be hard to read when they do not carry continuous data but pack who-knows-what into integers.

Normal packing and depth pyramid passes.

Regardless, it's pretty clear that after all the depth/g-buffer work is said and done, a uber-summarization pass kicks in taking care of a bunch of depth-related stuff.

RGBA8 packed normal (&roughness). Note the speckles that are a tell-tale of best-fit-normal encoding.
Also, note that this happens after hair rendering - which we didn't cover.

It first packs normal and roughness into a RGBA8 using Crytek's lookup-based best-fit normal encoding, then it creates a min-max mip pyramid of depth values.

The pyramid is then used to create what looks like a volumetric texture for clustered lighting.

A slice of what looks like the light cluster texture, and below one of the lighting buffers partially computed. Counting the pixels in the empty tiles, they seem to be 16x16 - while the clusters look like 32x32?

So - from what I can see it looks like a clustered deferred lighting system. 

The clusters seem to be 32x32 pixels in screen-space (froxels), with 64 z-slices. The lighting though seems to be done at a 16x16 tile granularity, all via compute shader indirect dispatches.

I would venture this is because CS are specialized by both the materials and lights present in a tile, and then dispatched accordingly - a common setup in contemporary deferred rendering systems (e.g. see Call of Duty Black Ops 3 and Uncharted 4 presentations on the topic).

Analytic lighting pass outputs two RGBA16 buffers, which seems to be diffuse and specular contributions. Regarding the options for scene lights, I would not be surprised if all we have are spot/point/sphere lights and line/capsule lights. Most of Cyberpunk's lights are neons, so definitely line light support is a must.

You'll also notice that a lot of the lighting is unshadowed, and I don't think I ever noticed multiple shadows under a single object/avatar. I'm sure that the engine does not have limitations in that aspect, but all this points at lighting that is heavily "authored" with artists carefully placing shadow-casting lights. I would also not be surprised if the lights have manually assigned bounding volumes to avoid leaks.

Final lighting buffer (for analytic lights) - diffuse and specular contributions.

Lighting part 2: Shadows

But what we just saw does not mean that shadows are unsophisticated in Cyberpunk 2077, quite the contrary, there are definitely a number of tricks that have been employed, most of them not at all easy to reverse!

First of all, before the depth-prepass, there are always a bunch of draws into what looks like a shadowmap. I suspect this is a CSM, but in the capture I have looked at, I have never seen it used, only rendered into. This points to a system that updates shadowmaps over many frames, likely with only static objects?

Is this a shadowmap? Note that there are only a few events in this capture that write to it, none that reads - it's just used as a depth-stencil target, if RenderDoc is correct here...

These multi-frame effects are complicated to capture, so I can't say if there are further caching systems (e.g. see the quadtree compressed shadows of Black Ops 3) at play. 

One thing that looks interesting is that if you travel fast enough through a level (e.g. in a car) you can see that the shadows take some time to "catch up" and they fade in incrementally in a peculiar fashion. It almost appears like there is a depth offset applied from the sun point of view, that over time gets reduced. Interesting!

This is hard to capture in an image, but note how the shadow in time seems to crawl "up" towards the sun.

Sun shadows are pre-resolved into a screen-space buffer prior to the lighting compute pass, I guess to simplify compute shaders and achieve higher occupancy. This buffer is generated in a pass that binds quite a few textures, two of which look CSM-ish. One is clearly a CSM, with in my case five entries in a texture array, where slices 0 to 3 are different cascades, but the last slice appears to be the same cascade as slice 0 but from a slightly different perspective. 

There's surely a lot to reverse-engineer here if one was inclined to do the work!

The slices of the texture on the bottom (in red) are clearly CSM. The partially rendered slices in gray are a mystery. The yellow/green texture is, clearly, resolved screen-space sun shadows, I've never, so far, seen the green channel used in a capture.

All other shadows in the scene are some form of VSMs, computed again incrementally over time. I've seen 512x512 and 256x256 used, and in my captures, I can see five shadowmaps rendered per frame, but I'm guessing this depends on settings. Most of these seem only bound as render targets, so again it might be that it takes multiple frames to finish rendering them. One gets blurred (VSM) into a slice of a texture array - I've seen some with 10 slices and others with 20.

A few of the VSM-ish shadowmaps on the left, and artefacts of the screen-space raymarched contact shadows on the right, e.g. under the left arm, the scissors and other objects in contact with the plane...

Finally, we have what the game settings call "contact shadows" - which are screen-space, short-range raymarched shadows. These seem to be computed by the lighting compute shaders themselves, which would make sense as these know about lights and their directions...

Overall, shadows are both simple and complex. The setup, with CSMs, VSMs, and optionally raymarching is not overly surprising, but I'm sure the devil is in the detail of how all these are generated and faded in. It's rare to see obvious artifacts, so the entire system has to be praised, especially in an open-world game!

Lighting part III: All the rest...

Since booting the game for the first time I had the distinct sense that most lighting is actually not in the form of analytic lights - and indeed looking at the captures this seems to not be unfounded. At the same time, there are no lightmaps, and I doubt there's anything pre-baked at all. This is perhaps one of the most fascinating parts of the rendering.

First pass highlighted is the bent-cone AO for this frame, remaining passes do smoothing and temporal reprojection.

First of all, there is a very good half-res SSAO pass. This is computed right after the uber-depth-summarization pass mentioned before, and it uses the packed RGBA8 normal-roughness instead of the g-buffer one. 

It looks like it's computing bent normals and aperture cones - impossible to tell the exact technique, but it's definitely doing a great job, probably something along the lines of HBAO-GTAO. First, depth, normal/roughness, and motion vectors are all downsampled to half-res. Then a pass computes current-frame AO, and subsequent ones do bilateral filtering and temporal reprojection. The dithering pattern is also quite regular if I had to guess, probably Jorge's Gradient noise?

It's easy to guess that the separate diffuse-specular emitted from the lighting pass is there to make it easier to occlude both more correctly with the cone information.

One of many specular probes that get updated in an array texture, generating blurred mips.

Second, we have to look at indirect lighting. After the light clustering pass there are a bunch of draws that update a texture array of what appear to be spherically (or dual paraboloid?) unwrapped probes. Again, this is distributed across frames, not all slices of this array are updated per frame. It's not hard to see in captures that some part of the probe array gets updated with new probes, generating on the fly mipmaps, presumably GGX-prefiltered. 

A mysterious cubemap. It looks like it's compositing sky (I guess that dynamically updates with time of day) with some geometry. Is the red channel an extremely thing g-buffer?

The source of the probe data is harder to find though, but in the main capture I'm using there seems to be something that looks like a specular cubemap relighting happening, it's not obvious to me if this is a different probe from the ones in the array or the source for the array data later on. 

Also, it's hard to say whether or not these probes are hand placed in the level, if the relighting assumption is true, then I'd imagine that the locations are fixed, and perhaps artist placed volumes or planes to define the influence area of each probe / avoid leaks.

A slice of the volumetric lighting texture, and some disocclusion artefacts and leaks in a couple of frames.

We have your "standard" volumetric lighting, computed in a 3d texture, with both temporal reprojection. The raymarching is clamped using the scene depth, presumably to save performance, but this, in turn, can lead to leaks and reprojection artifacts at times. Not too evident though in most cases.

Screen-Space Reflections

Now, things get very interesting again. First, we have an is an amazing Screen-Space Reflection pass, which again uses the packed normal/roughness buffer and thus supports blurry reflections, and at least at my rendering settings, is done at full resolution. 

It uses previous-frame color data, before UI compositing for the reflection (using motion vectors to reproject). And it's quite a lot of noise, even if it employs a blue-noise texture for dithering!

Diffuse/Ambient GI, reading a volumetric cube, which is not easy to decode...

Then, a indirect diffuse/ambient GI. Binds the g-buffer and a bunch of 64x64x64 volume textures that are hard to decode. From the inputs and outputs one can guess the volume is centered around the camera and contains indices to some sort of computed irradiance, maybe spherical harmonics or such. 

The lighting is very soft/low-frequency and indirect shadows are not really visible in this pass. This might even by dynamic GI!

Certainly is volumetric, which has the advantage of being "uniform" across all objects, moving or not, and this coherence shows in the final game.

Final lighting composite, diffuse plus specular, and specular-only.

And finally, everything gets composited together: specular probes, SSR, SSAO, diffuse GI, analytic lighting. This pass emits again two buffers, one which seems to be final lighting, and a second with what appears to be only the specular parts.

And here is where we can see what I said at the beginning. Most lighting is not from analytic lights! We don't see the usual tricks of the trade, with a lot of "fill" lights added by artists (albeit the light design is definitely very careful), instead indirect lighting is what makes most of the scene. This indirect lighting is not as "precise" as engines that rely more heavily on GI bakes and complicated encodings, but it is very uniform and regains high-frequency effects via the two very high-quality screen-space passes, the AO and reflection ones.


The screen-space passes are quite noisy, which in turn makes temporal reprojection really fundamental, and this is another extremely interesting aspect of this engine. Traditional wisdom says that reprojection does not work in games that have lots of transparent surfaces. The sci-fi worlds of Cyberpunk definitely qualify for this, but the engineers here did not get the news and made things work anyway!

And yes, sometimes it's possible to see reprojection artifact, and the entire shading can have a bit of "swimming" in motion, but in general, it's solid and coherent, qualities that even many engines using lightmaps cannot claim to have. Light leaks are not common, silhouettes are usually well shaded, properly occluded.

All the rest

There are lots of other effects in the engine we won't cover - for brevity and to keep my sanity. Hair is very interesting, appearing to render multiple depth slices and inject itself partially in the g-buffer with some pre-lighting and weird normal (fake anisotropic?) effect. Translucency/skin shading is surely another important effect I won't dissect.

Looks like charts caching lighting...

Before the frame is over though, we have to mention transparencies - as more magic is going on here for sure. First, there is a pass that seems to compute a light chart, I think for all transparencies, not just particles.

Glass can blur whatever is behind them, and this is done with a specialized pass, first rendering transparent geometry in a buffer that accumulates the blur amount, then a series of compute shaders end up creating three mips of the screen, and finally everything is composited back in the scene.


After the "glass blur", transparencies are rendered again, together with particles, using the lighting information computed in the chart. At least at my rendering settings, everything here is done at full resolution.

Scene after glass blur (in the inset) and with the actual glass rendered on top (big image)

Finally, the all-mighty temporal reprojection. I would really like to see the game without this, the difference before and after the temporal reprojection is quite amazing. There is some sort of dilated mask magic going on, but to be honest, I can't see anything too bizarre going on, it's astonishing how well it works. 

Perhaps there are some very complicated secret recipes lurking somewhere in the shaders or beyond my ability to understand the capture.

On the left, current and previous frame, on the right, final image after temporal reprojection.

This is from a different frame, a mask that is used for the TAA pass later on...

I wrote "finally" because I won't look further, i.e. the details of the post-effect stack, things here are not too surprising. Bloom is a big part of it, of course, almost adding another layer of indirect lighting, and it's top-notch as expected, stable, and wide. 

Depth of field, of course, tone-mapping and auto-exposure... There are of course all the image-degradation fixings you'd expect and probably want to disable: film grain, lens flares, motion blur, chromatic aberration... Even the UI compositing is non-trivial, all done in compute, but who has the time... Now that I got all this off my chest, I can finally try to go and enjoy the game! Bye!

25 November, 2020

Baking a Realistic Renderer from Scratch and other resources for Beginners in Computer Graphics

Dump of a few things I got that can be useful for beginners in 3D Computer Graphics programming.

  • Download a snapshot of my "3D Computer Graphics for Beginners" curated collection of projects and resources. I know all the cool kids do this in GitHub and would call it "awesome something" - but I'm lazy and a contrarian so what you get is am ugly PDF made from a google docs page :)

If you're in the States, maybe you can fine here something to tinker with during this self-isolated thanksgiving. Enjoy your holidays!

11 May, 2017

Where do GPUs come from.

A slide deck for a introduction to CG class.



PPTX - PDF (smaller)

Note: this is not really a tutorial in this form, there are no presenter's notes. But if you want to use this scheme to teach something similar, feel free. The CPU->GPU trajectory is heavily inspired by the brilliant work Kayvon Fatahalian did.

06 August, 2016

The real-time rendering continuum: a taxonomy

What is forward? What is deferred? Deferred shading? Lighting? Inferred? Texture-space? Forward "+"? When to use what? The taxonomy of real-time rendering pipelines is becoming quite complex, and understanding what can be an "optimal" choice is increasingly hard.

- Forward

So, let's start simple. What do we need to do, in a contemporary real-time rendering system, to draw a mesh? Let's say, something along these lines:


This diagram illustrates schematically what could be going on in a "forward" rendering shader. "Forward" here really just means that most of the computation that goes from geometry to final pixel color happens in a single vertex/pixel shader pair. 
We might update in separate steps some resources the shader uses, like shadow maps, reflection maps and so on, but the main steps, from attribute interpolation to texturing, to shading with analytic lights, happen in a single shader.

From there on, the various flavors of forward rendering only deal with different ways of culling and specializing computation, but the shading pipeline remains the same!

- Culling

Classical multi-pass forward binds lights to meshes one at a time, drawing a mesh multiple times to accumulate pixel radiance on the screen. Lights are bound to a pass as shader constants, and as you typically have only a few light types, you can generate ad-hoc shaders that efficiently deal with each. Specialization is easy, but you pay a price to the multiple passes, especially if you have a lot of overlapping lights and decals.

Single-pass forward is an improvement that foregoes the waste of multi-pass shading (bandwidth, repeated computations between passes and multiple draws) by either using a dynamic branching "uber-shader" capable of handling all the possible lights assigned to an object, or by generating static shader permutations to handle exactly what a given object needs.

The latter can easily lead to an explosion in the number of shaders needed, as now we don't need just one per light type, but per permutation of types and number of lights.
The advantage is that it can be much more efficient, especially if one is willing to split a mesh to exactly divide the triangles which need a specific technique (e.g. triangles with one light from ones that need two or more, triangles that need to blend texture layers, to perform other special effects).

This is Advanced Warfare: ~20k shaders per levels and
aggressive mesh splitting generating tons of draw calls
Forward+ is nothing more than a change in the way some of the data is passed to a dynamic branching style single-pass forward renderer: instead of binding lights per mesh (draw) as shader constants, they are stored in some kind of spatial subdivision structure that the shader can easily access. Typically, screen tiles or frustum voxels ("clustered"), but other structures can be employed as well.

At first, it might sound like a terrible idea. It has all the drawbacks of a dynamic branching uber shader (lots of complexity, no ability to specialize shaders over lights, register usage bound by the most expensive path in the shader) but with the added penalty of divergent branches (as the lights are not constant in the shader). So, why would you do it?

Light culling in a conventional forward pipeline can be quite effective for static lights, or lights that follow prescribed path, as we can carve geometry influenced by each and specialize. But what if we have lots of dynamic lights? Or lots of small lights? 
At a given point, carving geometry becomes either inefficient (too many small draws) or impossible. In these situations, Forward+ starts to become attractive, especially if one is able to avoid branch divergence by processing lights one at a time.

In the end, though, it's just culling and specialization. How to assign lights to rendering entities. How to avoid having dynamic branching, generic shaders that create inefficiencies.

Once one thinks in these terms, it's easy to see that other configurations could be possible, for example, one might think of assigning lights to mesh chunks and dynamically grouping them into draws, following the ideas of Ubisoft's and Graham Wihlidal's mesh processing pipelines. Or one could assign lights to a per-object grid, or a world space BSP, and so on.

- Splitting the pipeline

Let's look again at the diagram I drew:


Quite literally we can take this "forward shading" pipeline and cut it an arbitrary point, creating two shader passes from it. This is a "deferred" rendering system, some of the computation is deferred to a second pass, and albeit the most employed system (deferred shading) splits material data from lighting/BRDF evaluation, we almost have today a deferred technique for any reasonable choice of splitting point.

Of course, after we do the split, we'll need the two resulting passes to communicate. The pass that is attached to the geometry (object) needs to communicate some data to the pass that is attached to the pixel output. This data is stored by the first pass in a geometry buffer (g-buffer!) and read in the second. 
Typically, we store g-buffers in screen-space, but other choices are possible.

So, why would we want to do such a split? At first, it seems very odd. Instead of having a single pass that does all computation in registers, locally and fast, we force some of the data to be written all the way out to GPU memory, uncompressed, and then read again from memory in the second pass. Why?

Well, the reasons are exactly the same as -every time- we have to decide if to split or not any GPU computation, be it a post-effect, a linear algebra routine or in our case, mesh rendering, the potential advantages are always the same:
  1. Specialization. We might be able to avoid a dynamic branching uber-shader by stopping the computation at a point and launching a number of specialized routines for the second part.
  2. Inter-thread data access. We might need to reuse the data we're writing out. Or access it in patterns that are not possible with the very limited inter-thread communication the GPU allows (and pixel shaders don't/can't give control over what gets packed in a wave, nor have the concept of thread groups! *)
  3. Modifying data. We might want to inject other computation that changes some of the data before launching the second pass.
  4. Re-packing computation. We might want to launch the second pass using a different topology for our waves.
* Note: it would be interesting to think how a "deferred" system could take advantage of hardware tile-based rendering architectures if one could program passes to operate on each tile... Ironically today on tile-based deferred GPUs, deferred shading is usually not employed, because the deal with tile architectures is to avoid reads/writes to a "slow" main memory, so deferred, going out to memory, would negate that. Also one of the issues that deferred can ameliorate in a traditional GPU is overshading, but on TBDR that doesn't matter because by design you don't overshade there even in forward...

- Decision tree

Adding a split point in our pipeline choices makes things incredibly complex, I'd say out of the reach of rendering engineers just manually doing optimal choices. 
We're not dealing anymore just with dynamic versus static lights, or culling granularity, but on how to balance a GPU between ALU, memory, shader resources and different organizations of computation. 

It's very hard to evaluate all these choices in parallel also because typically prototypes won't be really as optimized as possible for any given one, and optimization can change the performance landscape radically. 
Also, these choices are not local, but the can change how you pack and access data in the entire rendering system. What effects you can easily support, how much material variation you can easily support, how to bake precomputed data, what space you have to inject async computation and so on.

Since we started working on "next-gen" consoles, with a heavy emphasis on compute, I've been interested in automatic tuning, something that is quite common in scientific computing, but not at all yet for real-time rendering.

But even autotuning can only realistically be applied when the problem specification is quite rigid and it's unlikely to be successful when we can change the way we structure all the data and effects in a rendering system, to fit a given choice of pipeline (which doesn't mean we can't do better in terms of our abilities to explore pipeline choices...).

- Deferred versus Forward?

So how can we decide what to use when? Well, some rule of thumbs are possible to devise, looking at the data, the computation we wish to operate, and making sure we don't do anything too unreasonable for a given GPU architecture.

The first bound to consider is just the data bandwidth. How much can I read and write, without being bound by reading and writing? Or to be more precise, how much computation do I have to have in order for the memory operations to not be a big bottleneck? For the latency to be well-hidden? 

As an example, right now, on ps4, it's entirely reasonable to do a deferred shading system writing the typical attributes for GGX shading, at 1080p **, with a typical texture layer compositing system and having the g-buffer pass be mostly ALU-bound. 
The same might not be true for a different system at a different resolution, but right now it works, and some titles shipped with some fairly crazy "fat" g-buffers without problems.

Black Ops 3 is a tiled deferred renderer

** Note: without MSAA. In my view, MSAA for geometry antialiasing is not fundamental anymore; It's still a great technique for supersampling/subsampling, but we need temporal antialiasing (Filmic SMAA is great, and ideally you could do both) not only because it can be faster for comparable quality, but because we want to temporally filter all kind of shading effects! 
I'm also not addressing in this the problem of transparencies for a deferred renderer because it's easy to deal with them in F+, sharing the same light lists and most of the shader (just by "connecting" the ends that were cut in the deferred ones)

The second thing to consider is data access. Do you -need to- access lots of data that is parametrized on the surface (especially, vertices)? E.g. The Order's "fat" lightmaps? Then probably decompressing it and pushing it through screen-space buffers is not the best idea. 
Black Ops 3 for example bakes lighting in volume textures and static occlusion in a compressed shadow-map, while Advanced Warfare uses classic uv-mapped lightmaps and occlusion maps.

On the other hand, do you need to access surface data in screen-space effects? Ambient and specular occlusion, reflections (note for example that The Order doesn't do any of these screen-space effects)... Or modify surface data in screen-space, e.g. via mesh-based decals ***? Then you have to write a g-buffer anyways, the only question is when!

*** Note: Nowadays projected or "volumetric" decals are quite popular, and these can be culled in tiles/clusters just like lights, so they work in -any- rendering pipeline. They have their drawbacks though as they can't just precisely follow a surface. Maybe an idea could be to use small volume textures to map projected decals UVs and to mask their area of influence?

The Order 1886 uses F+ and very advanced lightmapping,
foregoing any screen-space shading technique

- Deferred splits and computation

Often, either memory bandwidth makes the choice "easy" for a given platform, or the preference for certain rendering features do (complex lightmaps, mesh decals, screen space effects...). But if they don't then we're left with performance: how to best structure computation.

One big advantage of deferred shading is just in the ability to dispatch specialized shaders per screen region.

The choice of what to specialize and how many passes to for a tile is entirely non-trivial, but at least is possible and does not result in an incredible number of permutations, like in single-pass forward, both because we resolved all the material layering in the g-buffer pass, thus we don't need to specialize both over lights and material features, and because doing multiple passes over a tile is cheaper that doing them over a mesh.

Note that in F+ we can trivially specialize over material features of a given draw, but not at all over lights, and it's even best to make the various lighting paths very uniform (e.g. use the same filtering for shadows) to avoid dynamic branching issues. 
In deferred shading, on the other hand, we can specialize over lights, over texture layer combiners (in the g-buffer pass) and over materials (albeit with worse culling than forward & we have to store bits in the g-buffer). 

It is true that typically we're more constrained on the material model as the input data is mostly fixed via the g-buffer encoding, but one can use bit flags to specify what is stored into the MRTs, and with PBR rendering we've seen a sharp decrease in the number of material models needed anyways.

The other advantage is of wave efficiency. In a deferred system, only the g-buffer pass uses the rasterizer, and thus is subject to rasterizer inefficiencies: partial quads on triangle edges, overdraw, partial waves due to small draws.
This is though very hard to quantify in practice, as there are lots of ways to balance computation on a GPU. 

For example, a forward system with very heavy shaders might suffer a lot from overdraw, and require spending time in full depth pre-pass to avoid having any, but the pre-pass might overlap with some async compute, making it virtually free.

- Cutting the pipeline "early"

Recently there have been lots of deferred systems that cut the pipeline "high", near the geometry, before texturing, by writing only the data that that the vertices carry, or even just enough to be able to fetch the vertex data manually (e.g. triangle index and barycentric coordinates, the latter can even be reconstructed from vertices and world position). These approaches create so-called "visibility buffers" instead of g-buffers.


Eidos R&D tested a g-buffer that is used only to improve wave occupancy
and avoid overdraw, not to implement deferred rendering features


These techniques are not aimed at implementing rendering techniques that are different from what maps well to forward, as they still do most of the computation in a single pass. 
What they try to do instead is to minimize the work done in pixel shading, to restructure computation so most of the work is done without the constraints imposed by the rasterizer.

The aim for most of these techniques is:
  1. To write thin g-buffers while still supporting arbitrary material data
  2. To avoid partial quad, partial wave and overdraw penalties
  3. Some also focus on analyzing the geometric data to perform shading at sub-sampled rates
In theory, nobody prevents these techniques to work with more than one split: after the geometry pass a material g-buffer could be created replacing the tile data with the data after texturing.

Compared to forward methods, the main difference is that we reorder computation in a "screen-space" centric way, all the shading is done in CS tiles instead of PS waves of quads.

It avoids partial waves, but at the cost of worse "culling": you have to shade considering all the features needed in a tile, regardless of how many pixels a feature uses, you can't specialize shaders over materials (unless you store some extra bits in the visibility buffer and summarize them per tile).
You also "get rid" of a lot of fixed-function hardware, you can't rely on optimized paths to load vertex and interpolate vertex data, compute derivatives/differentials (which become a real, hard problem! most of these systems just rely on analytic differentials, which don't work for dependent texture reads) and post-transform cache (albeit it would be possible to write from the VS back into the vertex buffer, if really needed)

Vertex and object data access becomes less coherent (as now we access based on screen-space patterns instead of over surfaces), supporting multiple vertex formats also becomes a bit harder (might not matter) and tessellation might or might not be possible (depending on what data you store).

Compared to deferred shading, we have similar trade-offs that we have with standard forward or foward+ versus deferred: we don't have screen-space material data for effects that need it, and we do all the shading in a single pass, thus statically specializing a shader needs to take care of more permutations, but we save on g-buffer space.

Note though that how "thin" the g-buffer is per-se is misleading in terms of bandwidth, because the shading pass uses the g-buffer only as an indirection, the real data is per vertex and per draw, these fetches still need to happen, and might be less coherent than other methods.
And we still have a bit of bandwidth "waste" in the method (similar to how g-buffers do waste reading/writing data that the PS already had) as the index buffers and vertex position data is read twice (through indirection!), and depending on the triangle to pixel ratio, that might be even not insignificant.

- Beyond screen-space...

And last, to complete our taxonomy, there has been recently some renderers that decided to split computation storing information in uv-space textures, instead of screen-space. 

These ideas are similar to the early idea of "surface caching" employed by Quake and might follow quite "naturally" if one has already a unique parametrization everywhere in the world.

These systems are very attractive for subsampling computation, both spatially and temporally, as the texture data is not linked to a specific frame and rasterized samples.

If the texture layering is cached, then the scheme is similar to a g-buffer deferred system, just storing the g-buffer in texture space instead of screen space, and it can be coupled with F+ or other deferred schemes that "split early" to reduce the complexity of the shading pass (as the texture layering has already been done in specialized shaders).

If the final shaded results are stored, the decoupled shading rate can also be used as a mean of improving shading stability: even without supersampling aliasing doesn't produce shimmering as the samples never move, and texture sampling naturally "blurs" a bit the results.


Decoupling visibility from shading rate. A good idea.

Caching computation is always very attractive, so these techniques are certainly promising, and the tradeoffs are easy to understand (even if they might not be easy to quantify!). How much of the cache is invalidated at any given time? At which granularity does it need to be computed (and how much waste there is due to it)? How much memory does the cache need?

Fight Night Champion computed diffuse lighting in texture space,
all the fine skin details come only from the specular layer.

- Conclusions???

As I said, it's hard to make predictions and it's hard to say that one method is absolutely better than another, even in quite specific scenarios.

But if I had to go out on a limb, I'd say that right now, for this generation of consoles the following applies:
  • "Vanilla" deferred shading works fine and supports lots of nice rendering features. 
    • In theory, it's not the most efficient rendering technique, simply from the standpoint that it spends lots of energy pushing data in and out memory...
    • But for now it works, and it will likely scale well to 2k and probably even 4k or near 4k resolutions, using reasonably thin buffers.
  • Deferred shading executes well enough in the following important aspects, that probably need to be addressed by any shading technique:
    • The ability to specialize shaders, even if we have architectures with good dynamic branching capabilities (and moving data from vgpr to sgpr on GCN), is quite important and saves a lot of headaches (of trying to fit every feature needed in a single, fast ubershader).
    • Separating and possibly caching or precomputing (most of) texture compositing is important. Very high frequency tiled detail layers will still need to be composited in screen-space.
    • Ameliorating issues with overdraw and small triangles/draws.
  • On top of these, deferred supports well a number of screen-space rendering features that are popular nowadays.
  • Forward+ can be made fast and it works best when lots of surface data (vertex & texture...) is needed.
    • Different material models are probably not a huge concern (LODs might be more attractive, actually), and deferred shading can be specialized over materials as well, with some effort (and worse culling).
    • Forward undeniably will scale better with resolution, but might have a slower "baseline" (e.g. 1080p)
    • Mapping data to surfaces (e.g. lightmaps, occlusion cones...) allows for cheap and high-quality bakes, but it doesn't work on moving objects, particles and so on, so it's usually a compromise: it has better quality for static meshes, but it lacks the uniformity of volumetric bakes.
  • Single pass forward, when done properly, can still very, very fast!
    • Especially in games that don't have too many small triangles and don't have many small or moving lights.
    • That's still a fairly large proportion of games! Lots of games are in daylight, or anyhow in settings where there aren't many overlapping lights! It is not simple to optimize though.
  • Volumetric data structures are here and going to stay, we'll probably see them evolve in something more adaptive than the simple voxel grids that we use today.
  • Caching is certainly interesting, especially when it comes to flattening texture layers (which is quite common, especially for terrain). 
    • Caching shading is a "natural" extension, the tradeoffs there are still unproven, but once one has the option of working in texture-space it's hard not to imagine that there isn't anything of the shading computations that could be meaningfully cached there...
  • Visibility buffers
    • If g-buffers passes are not bandwidth or ROP/export bound (writing the data), the benefit of "earlier" splits is questionable. But these techniques are -very- interesting, and might even be used in hybrid g-buffer/attribute buffer renderers.
    • The general idea of using deferred methods to cluster pixels via similarity and subsample shading is very interesting... 
    • The same applies to trying to pack waves without resorting to predetermined screen-space tiles (e.g. via stream compaction, which the "old" stencil volume deferred methods did automatically via the early-stencil hardware). None of these have been proven in production so far.
  • It would be great to see more research on hybrid renderers in general
    • Shaders can be written in a "unified" fashion, the splits can be largely automatic
    • Deferred shading and F+ share the same lighting representation!
    • A rendering engine could draw using different techniques based on heuristics
  • On the other hand, there has been recently lots of work on "GPU driven pipelines", where most of the draw dispatch work (and draw culling) is done on the GPU.
    • These pipelines favor very uniform draws (no per-draw shader specialization)
    • This might be though entirely a limitation of current APIs...