Search this blog

28 August, 2016

A retrospective on Call of Duty rendering

In my last post I did a quick recap of the research Activision published at Siggraph 2016. As I already broke my long-standing tradition of never mentioning the company I work for on my blog, I guess it wouldn't hurt, for completeness sake, to do a short "retrospective" of sorts, recapping some of the research published about the past few Call of Duty titles. 

I'll to remark, my opinion on the blog is as always personal, and my understanding of COD is very partial as I don't sit in production for any of the games.

I think COD doesn't often do a lot of "marketing" of its technology compared to other games, and I guess it makes sense for a title that sells so many copies to focus its marketing elsewhere and not pander to us hardcore technical nerds, I love the work Activision does on the trailers in general (if you haven't seen the live-action ones, you're missing out), but still it's a shame that very few people consider what tricks come into play when you have a 60hz first person shooter on ps3.

Certainly, -I- didn't! But my relationship with COD is also kinda odd, I was fairly unaware/uninterested in it till someone lent me the DVD I think of MW2 for 360 a long time after release, and from there I binge-played MW1 and Black Ops... Strictly single-player, mostly for their -great- cinematic atmosphere (COD MW2 is probably my favorite in that regard of previous-gen, together with Red Dead Redemption), and never really trying to dissect their rendering.
By the way, if you missed on COD single player, I think this critique of the campaign over the various titles is outstanding, worth your time.

Thing is, most games near the end of the 360/ps3 era went on to adopt new deferred rendering systems, often without, in my opinion, having solid reasons to do so. 

COD instead stayed on a "simple" single-pass forward rendering, mostly with a single analytic light per object. 
Not much to talk about at Siggraph there (with some exceptions), but with a great focus on mastering that rendering system: aggressive mesh splitting per lights and materials (texture layer groups), an engine -very- optimized to emit lots of drawcalls, and lightmapped GI (which manages to be quite a bit better than most of no-GI many-light deferred engines of the time).

Modern Warfare 2

IMHO a lesson on picking the right battles and mastering a craft, before trying to add a lot of kitchen-sink features without much reasoning, which is always very difficult to balance in rendering (we all want to push more "stuff" in our engines, even when it's actually detrimental, as all unneeded complexity is).

Call of Duty: Ghosts

The way I see Ghosts is as a transition title. It's the first COD to ship on next-gen consoles, but it still had to be strong on current-gen, which is to be expected for a franchise that needs a large install base to be able to hit the numbers it hits. 

Developers now had consoles with lots more power, but still had to take care of asset production for the "previous" generation while figuring out what to do with all the newfound computational resources.

For Ghosts, Infinity Ward pushed a lot on the geometrical detail rather than doing much more computation per pixel and that makes sense as it's easier to "scale" geometry than it is to fit expensive rendering systems on previous-gen consoles. This came with two main innovations: hardware displacement mapping and hardware subdivision (Catmull-Clark) surfaces.

Both technologies were a considerable R&D endeavor and were presented at Siggraph, GDC and in GPU Pro 7.

Albeit both are quite well known and researched, neither was widely deployed on console titles, and the current design of the hardware tessellator on GPUs is of limited utility, especially for displacement, as it's impossible to create subdivision patterns that match well the frequency detail of the heightmaps (this is currently, afaik, the state-of-the-art, but doesn't map to tiled and layered displacement for world detail).

Wade Brainerd recently presented with Tim Foley et al. some quite substantial improvements for hardware Catmull-Clark surfaces and made a proposal for a better hardware tessellation scheme at this years' Open Problems in Computer Graphics.

A level in COD:Ghosts, showcasing displacement mapping

For the rest, Ghosts is still based on mesh-splitting single-pass forward shading, and non-physically based models (Phong, that the artists were very familiar with).

Personally, I have to say that IW's artists did pull the look together and I Ghosts can look very pretty, but, in general, the very "hand-painted" and color-graded look is not the art style I prefer.

Call of Duty: Advanced Warfare

Advanced Warfare is the first COD to have its production completely unrestricted by "previous-gen" consoles, and compared to Ghosts it takes an approach that is almost a complete opposite, spending much more resources per pixel, with a completely new lighting/shading/rendering system.

At its core is still a forward renderer but now capable of doing many lights per surface, with physically based shaders. It does even more "mesh splits" (thus more drawcalls) and generates a huge amount of shader permutations to specialize rendering exactly for what's needed by a given piece of geometry (shadows, lighting, texturing and so on...).

We did quite a bit of research on the fundamental math of PBR for it and it employs an entirely new lightmap baking pipeline, but where it really makes a huge difference, in my opinion, is not on the rendering engine per se, but on its keen dedication to perceptual realism.

It's a physically based renderer done "right": doing PBR math right is relatively "easy", teaching PBR authoring to artists is much harder (arguably, not something that can be done over a single product, even) but making sure that everything makes (perceptual) sense is where the real deal is, and Sledgehammer's attitude during the project was just perfect.

Math and technology of course matter, and checking the math against ground-truth simulations is very important, but you can do a perfectly realistic game with empirical math and proper understanding of how to validate (or fit) your art against real-world reference, and you can do on the other hand a completely "wrong" rendering out of perfectly accurate math...

"Squint-worthy" is how Sledgehammers's rendering lead Danny Chan calls AW's lighting quality.
Things make sense, they are overall in the right brightness ratios

Advanced Warfare did also many other innovations, Jorge Jimenez authored a wonderful post-effect pipeline (he really doesn't do anything if it's not better than state of the art!) and AW also took a new version of our perpetually improving performance capture (and face rendering) technology, developed in collaboration with ICT

But I don't think that any single piece of technology mattered for AW more than the studio's focus on perceptual realism as a rendering goal. And I love it!

Call of Duty: Black Ops 3 

I won't talk much about Black Ops 3, also because I already did a post on Siggraph 2016 where we presented lots of rendering innovations done during the previous few years. 
The only presentation I'd like to add here, which is a bit unrelated to rendering, is this one: showing how to fight latency by tweaking animations.

Personally I think that if Sledgehammer's COD is great in its laser focus on going very deep exploring a single objective, Black Ops 3 is just crazy. Treyarch is crazy! Never before I've seen so many rendering improvements pushed on such a large, important franchise. Thus I wouldn't know really where to start... 

BO3 notably switched from forward to deferred, but I'd say that most of what's on screen is new, both in terms of being coded from scratch and often times also in terms of being novel research, and even behind the scenes lots of things changed, tools, editors, even the way it bakes GI is completely unique and novel.

If I had to pick I guess I can narrow BO3 rendering philosophy down to unification and productivity. Everything is lit uniformly, from particles to volumetrics to meshes, there are a lot less "rendering paths" and most of the systems are easier to author for.

All this sums up to a very coherent rendering quality across the screen, it's impossible to tell dynamic objects from static ones for example, and there are very few light leaks, even on specular lighting (which is quite hard to occlude).

Stylistically I'd say, to my eyes, it's somewhere in between AW and Ghosts. It's not quite as arbitrarily painted as Ghosts, and it is a PBR renderer done paying attention to its accuracy, but the data is quite more liberally art directed, and the final rendered frames lean more towards a filmic depiction than close adherence to perceptual realism.

Call of Duty: Infinite Warfare not out yet, and I won't talk about its rendering at all, of course!

It's quite impressive though to see how much space each studio has to completely tailor their rendering solutions to each game, year after year. COD is no Dreams, but compared to titles of similar scope, I'd say it's very agile.

So rest assured, it's yet again quite radically different than what was done before, and it packs quite a few cute tricks... I can't imagine that most of them won't appear at a Siggraph or GDC, till then, you can try to guess what it's trying to accomplish, and how, from the trailers!

Three years, three titles, three different rendering systems, crafted for specific visual goals and specific production needs. The Call of Duty engines don't even have a name (not even internally!), but a bunch of very talented, pragmatic people with very few artificial constraints put on what they can change and how.

13 August, 2016

Activision @ Siggraph 2016

Siggraph 2016 recently ended, as always it was inspiring and lots of great discussions were had. Activision presence was quite strong this year after we shipped Black Ops 3.

If you missed any of our presentations, this post might help. I'll try to keep this updated with links relative to our presentations, as soon as they come online.

Note: for convenience, I've written a summary of the techniques in the text below but keep in mind that, as always, this blog is truly made of personal opinions and observations - which come from my limited, R&D centric point of view.

Call of Duty has always been a lightmapped title, but lightmaps come with a quite large set of issues: large baking times, complexity in representing runtime materials and effects, lighting discontinuities with dynamic objects and the inability of "kit-bash" geometric detail (compenetrating meshes).

Static and dynamic objects, particles, volumetrics:
all lit with the same illumination data!

It's incredible that after all this time, there still isn't in real-time rendering much research on ways of solving the problem of baked irradiance. Naive lightmaps are not enough to generate an image that does look coherent, without leaks and discontinuities.

This new development at Treyarch tackles all these issues at once with a new runtime representation of baked irradiance that works seamlessly with deferred shading and is tightly coupled with prefiltered irradiance cubemaps and new, state of the art, heuristics for parallax-corrected reflections and a baking system that cut down iteration times substantially for artists without having to resort to renderfarms.

Baking via prefiltered cubemaps

Now all our art production happens in a fully WYSIWYG editor, and all the lighting is unified, regardless of the object type (dynamic, static, skinned, particles, volumetrics...).
This is achieved by employing hardware-filtered volumetric textures as the only representation of baked irradiance.

WYSIWYG editor with tools to quickly place volumes in the scene
Artists place irradiance volumes quickly in a level, and author clipping convex volumes to avoid light leaks (note that these have to be authored anyways for reflection probes, so it's not really extra work, and our editor supports very fast workflows for placing volumes and planes), and the authored volumes are usually robust to iterations in the level geometry.

As part of this system, Josiah Manson also further developed his fast GGX pre-integration scheme, now presented at EGSR 2016.

Kevin Myers presented in the "dark hides the light" session our new compression technology for baked shadowmaps. 
Call of Duty had a system for caching shadowmaps since Ghosts, as the new generation of consoles saw a bigger increase in memory than in computation, caching techniques have seen a resurgence. 

This work is a quite radical improvement of our shadowmap caching technology, and allows to fully pre-bake a shadowmaps for entire levels, achieving thousand-fold compression ratios.

In comparison with precomputed voxelized shadows this technique allows for easily recovering the shadowmap depth, which makes it easier to integrate with other effects (i.e. volumetric lighting). It's also fast to traverse. The closest relative to SSTs are the Compressed Multiresolution Hierarchies of Scandolo et al., but our solution was developed independently in parallel, so the data encoding employed is not the same.

To my knowledge the first use of compressed shadowmaps in game production. It also enabled savings in the g-buffer, as we can use the SST to quickly shadow far objects, instead of having to bake a lightmap space occlusion map.

This is the continuation of the subdivision surface research started by Wade on Call of Duty Ghosts, (which to my knowledge is the first videogame to make extensive use of Catmull-Clark subdivision surfaces in real-time).

It's a quite neat technique, which sidesteps a limitation in current hardware tessellation pipelines by passing a variable number of control points to the hull shader via L2 read/writes.

During the development of this technique, Wade also made a nifty thread tracer for NVidia GPUs, which helped debugging issues with work being (erroneously) serialized on the GPU.

This is the only presentation at Siggraph 2016 that is not directly linked with a shipped Call of Duty title.

Natasha and Wade gave a compelling presentation at the open problems in computer graphics course, surveying artists in many companies, even going outside the videogame companies circle we work in.

This is one of my favorite courses at Siggraph, and the only presentation from Activision that I haven't reviewed beforehand, so it was truly interesting for me to watch it live.

Jorge Jimenez presented a new version of his SMAA antialiasing technique, greatly improving both performance and quality (sharpness and stability) with a plethora of tweaks and innovations. The slides also summarize relevant previous published techniques and their image quality tradeoffs.

One of MANY improvements presented

In my opinion, with Jorge's Filmic SMAA, the quest for antialiasing techniques is largely over, and we can say that temporal reprojection has "won".

Not only temporal reprojection often achieves better quality than MSAA when using comparable performance budgets, but it's being easier to integrate into deferred renderers and nowadays it's becoming unavoidable anyways because so many other effects can benefit from the ability to perform temporal supersampling (e.g. shadows, reflections, ambient occlusion...).

Hybrid techniques are still very interesting, and there are probably still improvements possible (and probably Jorge will come yet again on stage in the future to show something amazing in that space), but in general the interest on MSAA reconstruction techniques for edge antialiasing has diminished from being a must-have and a pain point for deferred rendering, to a much less important solution.

MSAA is still a great feature to have in hardware though as it allows to subsample or supersample certain effects and render passes (e.g. particles), but it should be more thought as a way to do mixed resolution than strictly for edge antialiasing.

Jorge, Xian, Adrian and myself worked on extending rendering ambient occlusion state of the art by crafting techniques that are based on modeling the (ray traced) ground truth solution for diffuse and specular occlusion.

Fast, accurate ambient occlusion...

We derived closed-form analytic solutions when possible, and when such models could not be found for ampler generalizations of the problems at hand, we extended the analytic solutions by fitting "residual" functions to the ground truth data, or by employing look-up tables.

...and specular occlusion.

GTAO has already been used in production, as a drop-in replacement of the previous state-of-the-art technique we were employing (HemiAO), yielding better image quality (actually, even better than HBAO, which is a very popular high-quality solution) in the same performance budget (0.5ms).

06 August, 2016

The real-time rendering continuum: a taxonomy

What is forward? What is deferred? Deferred shading? Lighting? Inferred? Texture-space? Forward "+"? When to use what? The taxonomy of real-time rendering pipelines is becoming quite complex, and understanding what can be an "optimal" choice is increasingly hard.

- Forward

So, let's start simple. What do we need to do, in a contemporary real-time rendering system, to draw a mesh? Let's say, something along these lines:

This diagram illustrates schematically what could be going on in a "forward" rendering shader. "Forward" here really just means that most of the computation that goes from geometry to final pixel color happens in a single vertex/pixel shader pair. 
We might update in separate steps some resources the shader uses, like shadow maps, reflection maps and so on, but the main steps, from attribute interpolation to texturing, to shading with analytic lights, happen in a single shader.

From there on, the various flavors of forward rendering only deal with different ways of culling and specializing computation, but the shading pipeline remains the same!

- Culling

Classical multi-pass forward binds lights to meshes one at a time, drawing a mesh multiple times to accumulate pixel radiance on the screen. Lights are bound to a pass as shader constants, and as you typically have only a few light types, you can generate ad-hoc shaders that efficiently deal with each. Specialization is easy, but you pay a price to the multiple passes, especially if you have a lot of overlapping lights and decals.

Single-pass forward is an improvement that foregoes the waste of multi-pass shading (bandwidth, repeated computations between passes and multiple draws) by either using a dynamic branching "uber-shader" capable of handling all the possible lights assigned to an object, or by generating static shader permutations to handle exactly what a given object needs.

The latter can easily lead to an explosion in the number of shaders needed, as now we don't need just one per light type, but per permutation of types and number of lights.
The advantage is that it can be much more efficient, especially if one is willing to split a mesh to exactly divide the triangles which need a specific technique (e.g. triangles with one light from ones that need two or more, triangles that need to blend texture layers, to perform other special effects).

This is Advanced Warfare: ~20k shaders per levels and
aggressive mesh splitting generating tons of draw calls
Forward+ is nothing more than a change in the way some of the data is passed to a dynamic branching style single-pass forward renderer: instead of binding lights per mesh (draw) as shader constants, they are stored in some kind of spatial subdivision structure that the shader can easily access. Typically, screen tiles or frustum voxels ("clustered"), but other structures can be employed as well.

At first, it might sound like a terrible idea. It has all the drawbacks of a dynamic branching uber shader (lots of complexity, no ability to specialize shaders over lights, register usage bound by the most expensive path in the shader) but with the added penalty of divergent branches (as the lights are not constant in the shader). So, why would you do it?

Light culling in a conventional forward pipeline can be quite effective for static lights, or lights that follow prescribed path, as we can carve geometry influenced by each and specialize. But what if we have lots of dynamic lights? Or lots of small lights? 
At a given point, carving geometry becomes either inefficient (too many small draws) or impossible. In these situations, Forward+ starts to become attractive, especially if one is able to avoid branch divergence by processing lights one at a time.

In the end, though, it's just culling and specialization. How to assign lights to rendering entities. How to avoid having dynamic branching, generic shaders that create inefficiencies.

Once one thinks in these terms, it's easy to see that other configurations could be possible, for example, one might think of assigning lights to mesh chunks and dynamically grouping them into draws, following the ideas of Ubisoft's and Graham Wihlidal's mesh processing pipelines. Or one could assign lights to a per-object grid, or a world space BSP, and so on.

- Splitting the pipeline

Let's look again at the diagram I drew:

Quite literally we can take this "forward shading" pipeline and cut it an arbitrary point, creating two shader passes from it. This is a "deferred" rendering system, some of the computation is deferred to a second pass, and albeit the most employed system (deferred shading) splits material data from lighting/BRDF evaluation, we almost have today a deferred technique for any reasonable choice of splitting point.

Of course, after we do the split, we'll need the two resulting passes to communicate. The pass that is attached to the geometry (object) needs to communicate some data to the pass that is attached to the pixel output. This data is stored by the first pass in a geometry buffer (g-buffer!) and read in the second. 
Typically, we store g-buffers in screen-space, but other choices are possible.

So, why would we want to do such a split? At first, it seems very odd. Instead of having a single pass that does all computation in registers, locally and fast, we force some of the data to be written all the way out to GPU memory, uncompressed, and then read again from memory in the second pass. Why?

Well, the reasons are exactly the same as -every time- we have to decide if to split or not any GPU computation, be it a post-effect, a linear algebra routine or in our case, mesh rendering, the potential advantages are always the same:
  1. Specialization. We might be able to avoid a dynamic branching uber-shader by stopping the computation at a point and launching a number of specialized routines for the second part.
  2. Inter-thread data access. We might need to reuse the data we're writing out. Or access it in patterns that are not possible with the very limited inter-thread communication the GPU allows (and pixel shaders don't/can't give control over what gets packed in a wave, nor have the concept of thread groups! *)
  3. Modifying data. We might want to inject other computation that changes some of the data before launching the second pass.
  4. Re-packing computation. We might want to launch the second pass using a different topology for our waves.
* Note: it would be interesting to think how a "deferred" system could take advantage of hardware tile-based rendering architectures if one could program passes to operate on each tile... Ironically today on tile-based deferred GPUs, deferred shading is usually not employed, because the deal with tile architectures is to avoid reads/writes to a "slow" main memory, so deferred, going out to memory, would negate that. Also one of the issues that deferred can ameliorate in a traditional GPU is overshading, but on TBDR that doesn't matter because by design you don't overshade there even in forward...

- Decision tree

Adding a split point in our pipeline choices makes things incredibly complex, I'd say out of the reach of rendering engineers just manually doing optimal choices. 
We're not dealing anymore just with dynamic versus static lights, or culling granularity, but on how to balance a GPU between ALU, memory, shader resources and different organizations of computation. 

It's very hard to evaluate all these choices in parallel also because typically prototypes won't be really as optimized as possible for any given one, and optimization can change the performance landscape radically. 
Also, these choices are not local, but the can change how you pack and access data in the entire rendering system. What effects you can easily support, how much material variation you can easily support, how to bake precomputed data, what space you have to inject async computation and so on.

Since we started working on "next-gen" consoles, with a heavy emphasis on compute, I've been interested in automatic tuning, something that is quite common in scientific computing, but not at all yet for real-time rendering.

But even autotuning can only realistically be applied when the problem specification is quite rigid and it's unlikely to be successful when we can change the way we structure all the data and effects in a rendering system, to fit a given choice of pipeline (which doesn't mean we can't do better in terms of our abilities to explore pipeline choices...).

- Deferred versus Forward?

So how can we decide what to use when? Well, some rule of thumbs are possible to devise, looking at the data, the computation we wish to operate, and making sure we don't do anything too unreasonable for a given GPU architecture.

The first bound to consider is just the data bandwidth. How much can I read and write, without being bound by reading and writing? Or to be more precise, how much computation do I have to have in order for the memory operations to not be a big bottleneck? For the latency to be well-hidden? 

As an example, right now, on ps4, it's entirely reasonable to do a deferred shading system writing the typical attributes for GGX shading, at 1080p **, with a typical texture layer compositing system and having the g-buffer pass be mostly ALU-bound. 
The same might not be true for a different system at a different resolution, but right now it works, and some titles shipped with some fairly crazy "fat" g-buffers without problems.

Black Ops 3 is a tiled deferred renderer

** Note: without MSAA. In my view, MSAA for geometry antialiasing is not fundamental anymore; It's still a great technique for supersampling/subsampling, but we need temporal antialiasing (Filmic SMAA is great, and ideally you could do both) not only because it can be faster for comparable quality, but because we want to temporally filter all kind of shading effects! 
I'm also not addressing in this the problem of transparencies for a deferred renderer because it's easy to deal with them in F+, sharing the same light lists and most of the shader (just by "connecting" the ends that were cut in the deferred ones)

The second thing to consider is data access. Do you -need to- access lots of data that is parametrized on the surface (especially, vertices)? E.g. The Order's "fat" lightmaps? Then probably decompressing it and pushing it through screen-space buffers is not the best idea. 
Black Ops 3 for example bakes lighting in volume textures and static occlusion in a compressed shadow-map, while Advanced Warfare uses classic uv-mapped lightmaps and occlusion maps.

On the other hand, do you need to access surface data in screen-space effects? Ambient and specular occlusion, reflections (note for example that The Order doesn't do any of these screen-space effects)... Or modify surface data in screen-space, e.g. via mesh-based decals ***? Then you have to write a g-buffer anyways, the only question is when!

*** Note: Nowadays projected or "volumetric" decals are quite popular, and these can be culled in tiles/clusters just like lights, so they work in -any- rendering pipeline. They have their drawbacks though as they can't just precisely follow a surface. Maybe an idea could be to use small volume textures to map projected decals UVs and to mask their area of influence?

The Order 1886 uses F+ and very advanced lightmapping,
foregoing any screen-space shading technique

- Deferred splits and computation

Often, either memory bandwidth makes the choice "easy" for a given platform, or the preference for certain rendering features do (complex lightmaps, mesh decals, screen space effects...). But if they don't then we're left with performance: how to best structure computation.

One big advantage of deferred shading is just in the ability to dispatch specialized shaders per screen region.

The choice of what to specialize and how many passes to for a tile is entirely non-trivial, but at least is possible and does not result in an incredible number of permutations, like in single-pass forward, both because we resolved all the material layering in the g-buffer pass, thus we don't need to specialize both over lights and material features, and because doing multiple passes over a tile is cheaper that doing them over a mesh.

Note that in F+ we can trivially specialize over material features of a given draw, but not at all over lights, and it's even best to make the various lighting paths very uniform (e.g. use the same filtering for shadows) to avoid dynamic branching issues. 
In deferred shading, on the other hand, we can specialize over lights, over texture layer combiners (in the g-buffer pass) and over materials (albeit with worse culling than forward & we have to store bits in the g-buffer). 

It is true that typically we're more constrained on the material model as the input data is mostly fixed via the g-buffer encoding, but one can use bit flags to specify what is stored into the MRTs, and with PBR rendering we've seen a sharp decrease in the number of material models needed anyways.

The other advantage is of wave efficiency. In a deferred system, only the g-buffer pass uses the rasterizer, and thus is subject to rasterizer inefficiencies: partial quads on triangle edges, overdraw, partial waves due to small draws.
This is though very hard to quantify in practice, as there are lots of ways to balance computation on a GPU. 

For example, a forward system with very heavy shaders might suffer a lot from overdraw, and require spending time in full depth pre-pass to avoid having any, but the pre-pass might overlap with some async compute, making it virtually free.

- Cutting the pipeline "early"

Recently there have been lots of deferred systems that cut the pipeline "high", near the geometry, before texturing, by writing only the data that that the vertices carry, or even just enough to be able to fetch the vertex data manually (e.g. triangle index and barycentric coordinates, the latter can even be reconstructed from vertices and world position). These approaches create so-called "visibility buffers" instead of g-buffers.

Eidos R&D tested a g-buffer that is used only to improve wave occupancy
and avoid overdraw, not to implement deferred rendering features

These techniques are not aimed at implementing rendering techniques that are different from what maps well to forward, as they still do most of the computation in a single pass. 
What they try to do instead is to minimize the work done in pixel shading, to restructure computation so most of the work is done without the constraints imposed by the rasterizer.

The aim for most of these techniques is:
  1. To write thin g-buffers while still supporting arbitrary material data
  2. To avoid partial quad, partial wave and overdraw penalties
  3. Some also focus on analyzing the geometric data to perform shading at sub-sampled rates
In theory, nobody prevents these techniques to work with more than one split: after the geometry pass a material g-buffer could be created replacing the tile data with the data after texturing.

Compared to forward methods, the main difference is that we reorder computation in a "screen-space" centric way, all the shading is done in CS tiles instead of PS waves of quads.

It avoids partial waves, but at the cost of worse "culling": you have to shade considering all the features needed in a tile, regardless of how many pixels a feature uses, you can't specialize shaders over materials (unless you store some extra bits in the visibility buffer and summarize them per tile).
You also "get rid" of a lot of fixed-function hardware, you can't rely on optimized paths to load vertex and interpolate vertex data, compute derivatives/differentials (which become a real, hard problem! most of these systems just rely on analytic differentials, which don't work for dependent texture reads) and post-transform cache (albeit it would be possible to write from the VS back into the vertex buffer, if really needed)

Vertex and object data access becomes less coherent (as now we access based on screen-space patterns instead of over surfaces), supporting multiple vertex formats also becomes a bit harder (might not matter) and tessellation might or might not be possible (depending on what data you store).

Compared to deferred shading, we have similar trade-offs that we have with standard forward or foward+ versus deferred: we don't have screen-space material data for effects that need it, and we do all the shading in a single pass, thus statically specializing a shader needs to take care of more permutations, but we save on g-buffer space.

Note though that how "thin" the g-buffer is per-se is misleading in terms of bandwidth, because the shading pass uses the g-buffer only as an indirection, the real data is per vertex and per draw, these fetches still need to happen, and might be less coherent than other methods.
And we still have a bit of bandwidth "waste" in the method (similar to how g-buffers do waste reading/writing data that the PS already had) as the index buffers and vertex position data is read twice (through indirection!), and depending on the triangle to pixel ratio, that might be even not insignificant.

- Beyond screen-space...

And last, to complete our taxonomy, there has been recently some renderers that decided to split computation storing information in uv-space textures, instead of screen-space. 

These ideas are similar to the early idea of "surface caching" employed by Quake and might follow quite "naturally" if one has already a unique parametrization everywhere in the world.

These systems are very attractive for subsampling computation, both spatially and temporally, as the texture data is not linked to a specific frame and rasterized samples.

If the texture layering is cached, then the scheme is similar to a g-buffer deferred system, just storing the g-buffer in texture space instead of screen space, and it can be coupled with F+ or other deferred schemes that "split early" to reduce the complexity of the shading pass (as the texture layering has already been done in specialized shaders).

If the final shaded results are stored, the decoupled shading rate can also be used as a mean of improving shading stability: even without supersampling aliasing doesn't produce shimmering as the samples never move, and texture sampling naturally "blurs" a bit the results.

Decoupling visibility from shading rate. A good idea.

Caching computation is always very attractive, so these techniques are certainly promising, and the tradeoffs are easy to understand (even if they might not be easy to quantify!). How much of the cache is invalidated at any given time? At which granularity does it need to be computed (and how much waste there is due to it)? How much memory does the cache need?

Fight Night Champion computed diffuse lighting in texture space,
all the fine skin details come only from the specular layer.

- Conclusions???

As I said, it's hard to make predictions and it's hard to say that one method is absolutely better than another, even in quite specific scenarios.

But if I had to go out on a limb, I'd say that right now, for this generation of consoles the following applies:
  • "Vanilla" deferred shading works fine and supports lots of nice rendering features. 
    • In theory, it's not the most efficient rendering technique, simply from the standpoint that it spends lots of energy pushing data in and out memory...
    • But for now it works, and it will likely scale well to 2k and probably even 4k or near 4k resolutions, using reasonably thin buffers.
  • Deferred shading executes well enough in the following important aspects, that probably need to be addressed by any shading technique:
    • The ability to specialize shaders, even if we have architectures with good dynamic branching capabilities (and moving data from vgpr to sgpr on GCN), is quite important and saves a lot of headaches (of trying to fit every feature needed in a single, fast ubershader).
    • Separating and possibly caching or precomputing (most of) texture compositing is important. Very high frequency tiled detail layers will still need to be composited in screen-space.
    • Ameliorating issues with overdraw and small triangles/draws.
  • On top of these, deferred supports well a number of screen-space rendering features that are popular nowadays.
  • Forward+ can be made fast and it works best when lots of surface data (vertex & texture...) is needed.
    • Different material models are probably not a huge concern (LODs might be more attractive, actually), and deferred shading can be specialized over materials as well, with some effort (and worse culling).
    • Forward undeniably will scale better with resolution, but might have a slower "baseline" (e.g. 1080p)
    • Mapping data to surfaces (e.g. lightmaps, occlusion cones...) allows for cheap and high-quality bakes, but it doesn't work on moving objects, particles and so on, so it's usually a compromise: it has better quality for static meshes, but it lacks the uniformity of volumetric bakes.
  • Single pass forward, when done properly, can still very, very fast!
    • Especially in games that don't have too many small triangles and don't have many small or moving lights.
    • That's still a fairly large proportion of games! Lots of games are in daylight, or anyhow in settings where there aren't many overlapping lights! It is not simple to optimize though.
  • Volumetric data structures are here and going to stay, we'll probably see them evolve in something more adaptive than the simple voxel grids that we use today.
  • Caching is certainly interesting, especially when it comes to flattening texture layers (which is quite common, especially for terrain). 
    • Caching shading is a "natural" extension, the tradeoffs there are still unproven, but once one has the option of working in texture-space it's hard not to imagine that there isn't anything of the shading computations that could be meaningfully cached there...
  • Visibility buffers
    • If g-buffers passes are not bandwidth or ROP/export bound (writing the data), the benefit of "earlier" splits is questionable. But these techniques are -very- interesting, and might even be used in hybrid g-buffer/attribute buffer renderers.
    • The general idea of using deferred methods to cluster pixels via similarity and subsample shading is very interesting... 
    • The same applies to trying to pack waves without resorting to predetermined screen-space tiles (e.g. via stream compaction, which the "old" stencil volume deferred methods did automatically via the early-stencil hardware). None of these have been proven in production so far.
  • It would be great to see more research on hybrid renderers in general
    • Shaders can be written in a "unified" fashion, the splits can be largely automatic
    • Deferred shading and F+ share the same lighting representation!
    • A rendering engine could draw using different techniques based on heuristics
  • On the other hand, there has been recently lots of work on "GPU driven pipelines", where most of the draw dispatch work (and draw culling) is done on the GPU.
    • These pipelines favor very uniform draws (no per-draw shader specialization)
    • This might be though entirely a limitation of current APIs...