Shader Optimizations

A small list of basic and sometimes overlooked shader optimization possibilities. This are very small gains, but they can sum up and maybe there will be some free time for an additional point light or a better shadow filter?

Full screen quad vs full screen triangle

Post processing effects or color/z downsampling usually are rendered using full screen quads. Hardware works on at least 2×2 pixels groups. Pixel group size goes up with time (just like expected game resolution). For example NVIDIA Fermi has 4×2 pixel groups and older  NVIDIA G80-G92 use 2×2 quads. This means, that rendering two fullscreen triangles creates some overlapping quads on the diagonal. In 1000×1000 pixel resolution and 2×2 pixel quads, there will be 500 quads shaded two times. If we factor out cache misses, there will be 0.2% of additional work. Besides using single fullscreen triangle there is one vertex less to push to GPU :). In my synthetic test (on crappy geforce 240) difference was around 0.2% – 0.3%.

Direct3D shader compiler (FXC) on PC

Instruction counts displayed by FXC on PC doesn’t mean much nowadays. It just translates HLSL to asm, which later will be translated by the driver to special hardware IL. It’s quite possible to decrease instruction count displayed by the FXC and slowdown shader at the same time. Instead of relying on FXC numbers it’s better to check real performance (FPS/ms) or/and check numbers generated by special tools (ShaderAnalyzer and ShaderPerf). They also display GPR count, which is quite important as it shows how much stuff can be run in parallel.

Hardware instructions

  • MADD is a hardware instruction on GPU. Convert code like “( x – a ) * b” to “x * c + d”. This can save 1 ALU instruction.
  • Saturate, negation and abs are instruction modifiers and are free. Yes, there is a free dinner :). Sometimes equations can be changed to use saturate instead of clamp/min/max. Negation and abs can help to decrease number of used constant registers.
  • Some instructions are executed on the transcendental units. Transcendental units compute everything as scalars and there are like one transcendental unit per 2-8 ALUs. It’s a good idea to avoid excessive usage of instructions like sin, cos, log, sqrt, pow (very bad – calculated using three instructions).

Vectorize with care

Some GPUs have vector ALU units (AMD/ATI cards and NVIDIA cards older than G80) and some have scalar (NVIDIA G80, G92, Fermi). A vector ALU means that a scalar instruction takes same time as a vector one, which computes 4 components at once. Usually people try to vectorize everything in shaders, which can add some additional computations and actually result in slower shader on scalar ALU hardware. It’s a good idea to mask vector computations. For example in a blur shader there is no need to calculate alpha channel, so just use float3 for accumulation. We could go further and even write two shader versions: one for vector ALUs and one for scalar ones. No point in vectorizing instructions computed on transcendal units (sin, cos, log, pow…) – they are always scalar.

Clip/texkill

Consider adding clip/texkill (or alpha test, which can be faster on old hardware) when alpha blending is enabled. Think deferred lights, particles, volumetric light shafts. This can remove some work from ROP units if You don’t have uber tight geometry.

Interpolators

Shader bottlenecks aren’t only about ALU, GPR and texture fetches. On rare occasions (or on some hardware) they can become a bottleneck. Sometimes when using short pixel shaders it’s better to move computations from vertex shader to pixel shader if it can help to decrease interpolator count.

// 8 interpolators and minimal ALU in pixel shader
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv[ i ];
    }
    return color;
}

// one interpolator and some ALU in pixel shader
float4 gSomeValMul[ 8 ];
float4 gSomeValAdd[ 8 ];
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv * gSomeValMul[ i ] + gSomeValAdd[ i ];
    }
    return color;
}

8 interpolator version runs at 28.17ms (100 runs on geforce 240). 1 interpolator + some ALU version runs at 21.44ms (just as empty pixel shader). This is of course a very specific case. Still it’s a good idea to watch out and pack interpolators.

Advertisements
This entry was posted in Graphics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s