Krzysztof Narkowicz

Software occlusion culling

Advertisements

Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn’t a new idea (HOM). Basically it’s filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256×114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one – transformation pipeline. It can be a bottleneck if it isn’t properly done. Step two – clipper. Clipper code quality isn’t so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three – scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for “min-spec” PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy – it’s enough to interpolate 1/z’ (z’ = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z’. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it’s enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256×128 resolution and 1 thread. CPU / quad pixel screen size:

256×128 61×61 21×21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s 0.025 GV/s
core2 quad Q8200 (2.33ghz) 9.20 ms 2.30 ms 0.67 ms 1.76 GPix/s 0.019 GV/s

This shows true power of i7 – almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included – should be already done for frustum culling and it’s quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case – no early out (cleared z-buffer). 256×128 resolution. 1 thread. CPU / quad pixel screen size:

120×120 30×30 10×10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz) 3.30ms 0.09 ms 0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast – something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool – You can have skinned occluders :). It’s easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won’t be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it’s easier to parallelize. Maybe one day I’ll try to add it to our engine at work and see how does it handle real test cases.

Advertisements

Advertisements