If you're oldschool enough, you've started learning about code optimization in terms of pure cycle counts. When I've started, caches were already the king, but you still did count cycles, and their schedule in the U,V pipes of the first Pentium processor.
Nowdays you reason in terms of pipelines, latency and stalls.
Both on the GPU and CPU the strategy is the same (even if at different granularities). You identify your bottleneck and try to solve it.
But what if you can't solve it? If you can't win, join them!
Do you have to draw a big transparent polygon all over the screen that's stalling your blending pipeline? See it as a whole lot of free pixel shader cycles, and even more vertex shader ones!
Do you have some SIMD code, let's say a frustum/primitive test, that's stalling on a memory instruction? Nice, you can replace that cheap bounding sphere test with a bounding box or maybe use multiple spheres now! Or maybe you can interleave some other computation, or keep another pipeline busy (the integer one, or the floating point one... ).
On modern CPUs you'll have plenty of such situations, especially if you're coding on PowerPC based ones (Xbox 360, Ps3) that are in-order (they don't reschedule instructions to keep pipes busy, that's done only statically by the compiler or by you), have lots of registers and very long pipes. Sidenote: If you're basing most of your math on the vector unit on those architectures, think twice! They're way different from the Intel desktop processors, that were made with a lot of fancy decoding stanges so you could care less about making pipelines happy. The vector pipeline is so long that seldom you'll have enough data, and with no dependencies to use it fully, in most cases your best bet it to use FPU for most code, and VPU only in some number-crunching (unrolled and branchless) loops!
The worst thing that can happen is that now you have fancier/more accurate effects! But chances are that you can fill those starving stages with other effects that you needed, or that fancier effects can substitute multiple simple passes (i.e. consider the balance between using lots of simple particles versus a few volumetric ones or a single raymarched volume...), or that more accurate computations can save you some time in other stages (i.e. better culling code)!
Nowdays you reason in terms of pipelines, latency and stalls.
Both on the GPU and CPU the strategy is the same (even if at different granularities). You identify your bottleneck and try to solve it.
But what if you can't solve it? If you can't win, join them!
Do you have to draw a big transparent polygon all over the screen that's stalling your blending pipeline? See it as a whole lot of free pixel shader cycles, and even more vertex shader ones!
Do you have some SIMD code, let's say a frustum/primitive test, that's stalling on a memory instruction? Nice, you can replace that cheap bounding sphere test with a bounding box or maybe use multiple spheres now! Or maybe you can interleave some other computation, or keep another pipeline busy (the integer one, or the floating point one... ).
On modern CPUs you'll have plenty of such situations, especially if you're coding on PowerPC based ones (Xbox 360, Ps3) that are in-order (they don't reschedule instructions to keep pipes busy, that's done only statically by the compiler or by you), have lots of registers and very long pipes. Sidenote: If you're basing most of your math on the vector unit on those architectures, think twice! They're way different from the Intel desktop processors, that were made with a lot of fancy decoding stanges so you could care less about making pipelines happy. The vector pipeline is so long that seldom you'll have enough data, and with no dependencies to use it fully, in most cases your best bet it to use FPU for most code, and VPU only in some number-crunching (unrolled and branchless) loops!
The worst thing that can happen is that now you have fancier/more accurate effects! But chances are that you can fill those starving stages with other effects that you needed, or that fancier effects can substitute multiple simple passes (i.e. consider the balance between using lots of simple particles versus a few volumetric ones or a single raymarched volume...), or that more accurate computations can save you some time in other stages (i.e. better culling code)!
3 comments:
So true.
But nothing hurts more than the LHS. That's got to be the dumbest thing with the in-order PPC cores.
Yep, indeed load/hit/stores are one of the worst performance bottlenecks when working on a PPC, second only to cache misses. More often than not it's not possible to use all the computational power of the PPC, the pipilines, especially the vector one, are very long. But the power is there, and as I wrote, it's nice after you did all the possible optimizations around the implementation of a given algorithm, to fill up the holes in the pipeline with extra features...
I certainly agree with you there, it sure is nice to stuff extra "free" work into various loops. I have done this on a number of occasions. This works well on both the CPU's and GPU's. Yay!
I mention the PPC LHS because it seems to get in the way of everything, and of course sometimes eliminating LHS means keeping things in the same register sets and therefore sometimes in the longest pipes.
Balancing this is tricky, and it's been bugging me much recently :-)
Post a Comment