Since I have no new cool visuals to show, I'd thought I should blog about some random tech-topic. So lets talk about vectorization in shaders. Doing this in your HLSL code can improve the compiled code significantly. For example, a naive implementation of my bilateral boxfilter looks something like this:
With the ps2_b profile, this compiles to:
This is an example of bad vectorization. The shader compiles to 151 instructions of witch 124 is arithmetic. The problem is that many instructions (like pow and abs) are not used with full power. Many instructions are SIMD instructions and can work on multiple data in parallel.
We can rewrite this shader to perform some of the arithmetic's in parallel like this:
This compiles to:
This time there is 108 instructions of witch 81 arithmetic.
In this shader, when calculating the weights, more work is done per instruction. For example, the pow instruction is used only three times instead of eight. The summation of weights are done with two add and one dp4 instead of nine add.
Note that we could have vectorized the texture coordinate calculations as well, but then we would need SM 3.0 (arbitrary swizzles) and change how the offset constant is set up.
With a small effort we got rid of 43 ALU instructions! It should be noted that this necessarily won't lead to a speed-up since instruction count is just part of the story. ALU/Texture instruction ratio, texture cache etc. comes in to play as well. On my Radeon X1600 this shader is limitied by texture lookups so the win is not that big.
At last I would like to point out that I'm by no means a shader optimization guru, just a guy trying to fill his blog ;)