Software occlusion culling

rast

Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn’t a new idea (HOM). Basically it’s filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256×114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one – transformation pipeline. It can be a bottleneck if it isn’t properly done. Step two – clipper. Clipper code quality isn’t so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three – scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for “min-spec” PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy – it’s enough to interpolate 1/z’ (z’ = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z’. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it’s enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256×128 resolution and 1 thread. CPU / quad pixel screen size:

256×128 61×61 21×21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s 0.025 GV/s
core2 quad Q8200 (2.33ghz) 9.20 ms 2.30 ms 0.67 ms 1.76 GPix/s 0.019 GV/s

This shows true power of i7 – almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included – should be already done for frustum culling and it’s quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case – no early out (cleared z-buffer). 256×128 resolution. 1 thread. CPU / quad pixel screen size:

120×120 30×30 10×10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz) 3.30ms 0.09 ms 0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast – something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool – You can have skinned occluders :). It’s easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won’t be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it’s easier to parallelize. Maybe one day I’ll try to add it to our engine at work and see how does it handle real test cases.

Posted in Uncategorized | Tagged , , | 4 Comments

Aggregated deferred lighting

Random idea about a new way to do deferred lighting. The idea is to decouple lighting from geometry normals. In order to do that, lighting information is stored as aggregated lights ( direction + color ).

  1. 1st pass – z-prepass ( just render depth )
  2. 2nd pass – render lighting geometry / quads / tiles…. Output aggregated virtual directional lights for every pixel. This means weighted average of light directions and weighted sum of light colors for every pixel.
  3. 3rd pass – render geometry and shade using buffer with aggregated directional lights (and maybe add standard forward directional light)

2nd pass render target layout:

RT0: aggregated light color RGB
RT1: aggregated light direction XYZ

We want to achieve this:

AggregatedLightColor = 0.
AggregatedLightDir = 0.

for every light
{
AggregatedLightColor += LightColor * LightAttenuation
AggregatedLightDir += LightDir * intensity(LightColor * LightAttenuation)
}

In order to do this, we need:

  1. Init RT0 and RT1 with 0x00000000
  2. Setup additive blending states
  3. Output from light pixel shader:
ColorRT0 = LightColor * LightAttenuation
ColorRT1 = LightDirection * dot( ColorRT0, ToGrayscaleVec )

Cons?

  • Light aggregation as virtual directional lights per pixel is an approximation. Moreover we can’t properly blend normals by using their arithmetic averages. It means that with many lights per pixel (with opposing directions) it won’t be too accurate (but it shouldn’t be too visible).

Benefits?

  • Flexibility. You can use almost any lighting model
  • You can render lighting in lower resolution as high frequency normal map details are added later. There will be artifacts at depth discontinuities, but maybe for some type of content (think desaturated and gray as Gears of War or Killzone 2 :)) they won’t be to visible
  • Less bandwidth and memory usage (if we compare it to deferred lighting and shading, which stores full specular color, not just it’s intensity).
  • Z prepass is faster than rendering GBuffer or normals + exponent
  • A bit simpler calculations. No need for encoding / decoding material properties (normal, exponent,…).

Now it’s time to find some free time and code a demo in order to compare it to deferred lighting/shading in real application :).

P.S. decoupling can be also done by storing lighting as spherical harmonics or cubemaps: link1 link2 link3 ( thanks Hogdman from gd.net forums ). Downside of that method is lack of proper specular, because of low frequency lighting data and this method will be slower.

P.S. 2 It looks like it would be better to store normals as angles (RT1.xy – weighted 2 angles, RT1.z – sum of weights). It would ensure proper aggregated light direction interpolation.

UPDATE: I prototyped this method and it doesn’t work too well :). Comparison screenshot with hard case for idea – two points lights with very different color influencing same area. Left – normal lighting and right – aggregated to direction and color:

aggr

Posted in Uncategorized | Tagged , | 8 Comments

Rendering Light Geometry for Deferred Shading/Lighting

Interesting idea from Call Of Juarez 2 about rendering deferred light geometry. When deferred light geometry intersects with camera you need to switch culling and turn off zbuffer. In COJ2, instead of testing intersection on CPU and switching states, they just push out light geometry vertices:

// vertex shader
float3 posCS = mul( in.pos, worldToCamera ).xyz;
posCS.z = max( posCS.z, nearPlaneZ + offset );
out.pos = mul( float4( posCS, 1. ), cameraToScreen );

Could be a win if You are CPU bound.

Posted in Uncategorized | Tagged , | 3 Comments

VC++ and multiple inheritance

Today at work we were optimizing memory usage. At some moment we found out that size (on stack) of our basic data structures is x bytes bigger than summed size of their members. Every basic data structure was written following Alexandrescu policy based design – using inheritance from some templated empty classes. Let’s see a simple example:

class A { };
class B { };
class C : public A, B
{
int test;
};

int main()
{
printf( "%d\n", sizeof( C ) );
return 0;
}

Compiler uses 4 byte aligment. Will this program print 4? That depends. Compiled by GCC it will print 4, but compiled by VC++ (2005-2010) it will print 8.

Every class in C++ has to be at least 1 byte of size in order to have a valid memory adress. With multiple inheritance sizeof(C) = sizeof(A) + sizeof(B) + some aligment. So VC++ behavior is correct, but not optimal. It’s strange that it was reported to MS in 2005 and still they didn’t fix it.

Posted in Uncategorized | Tagged | 4 Comments

CELL SDK installation on Fedora 13 x86_64

I just installed IBM CELL SDK, CELL simulator and CELLIDE on my PC box with newest Fedora. It was a quite painful process and I couldn’t find anywhere full installation instruction. It’s a pity that IBM releases such interesting technology without proper support. So here is the full installation guide.

First get all the needed rpm’s and iso’s from IBM or other website:


systemsim-cell-3.1-25.f9.x86_64.rpm
sysroot_image-3.1-1.noarch.rpm
cell-install-3.1.0-0.0.noarch.rpm
CellSDK-Extras-Fedora_3.1.0.0.0.iso
CellSDK-Devel-Fedora_3.1.0.0.0.iso

Now let’s install sdk and simulator.


yum install tk rsync sed tcl wget
rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
cd /opt/cell
cellsdk --iso /root/cell/cellsdk/ install
cellsdk_sync_simulator install

Open ~/.bash_rc with You favourite text editor and modify PATH there:


export PATH=$PATH:/opt/ibm/systemsim-cell/bin:/opt/cell/toolchain/bin

To run simulator from console use:


systemsim -g

Remember to use fast simulator mode, it’s very useful even on newest i7 :). Now let’s setup cellide (You don’t need to install Fedora Eclipse).


yum install cellide cell-spu-timing alf-ide-template fdpr-launcher ibm-java2-i386-jre

Time to download fix pack 3.1-SDKMA-Linux-x86_64-IF01 “intended only for RHEL” :), so Eclipse will detect local cell simulator. Install it.


rpm -Uvh cellide-3.1.0-7.i386.rpm

Finally run eclipse with:


./eclipse -vm /opt/ibm/java2-i386-50/jre/bin

Posted in Uncategorized | Tagged | 5 Comments

Color grading and tonemapping using photoshop

There is a very simple and nice idea in UDK documentation about adding color grading to engine in easy way, without writing any special tools. Color grading in UDK uses LUT table (3d texture, with mapping from RGB to RGB). Pretty standard stuff. The interesting part is that this texture is authored in photoshop. LUT slices are added to photoshop layers and game screenshot is set as main layer. When You tweak the screenshot, changes are automatically propagated to layers with LUT table slices. After tweaking screenshot You need just to import authored LUT slices and use them in game. So there is no need to duplicate photoshop functionality and force game artists to work with unfamiliar custom tools.

Moreover we could go further and use HDR source screenshot and HDR LUT table. This way we could add tonemapping with color grading, contrast, saturation… in 30 minutes, without writing any tool code.

Posted in Uncategorized | Tagged , | Leave a comment

C++ compiler VS human

There is a very nice article about writing fast math library on gamasutra. It shows for example that using operator overloading generates slower code than when using functions. A lot of people believe that compiler will generate better code than a programmer can. They just happily write code and don’t check that their compiler generates.

Let’s test how good is VC++ against simple algebra and OOP crap :). I decided to be not too harsh, so I omitted here topics like SIMD, FPU or aliasing.  I have chosen VC++ 2008, because it’s the most popular compiler in gamedev industry (and I think it will maintain it’s position until the VC++2010 SP1 release 🙂 ).

Default release build settings – /O2 etc. + very simple test cases, just int main(), scanf and printf.

// x/x = 1
int y = x / x;

00401018 mov ecx,dword ptr [esp+8]
0040101C mov eax,ecx
0040101E cdq
0040101F idiv eax,ecx

mov and idiv? FAIL.

// 0/x = 0
int y = 0 / x;

00401018 xor eax,eax
0040101A cdq
0040101B idiv eax,dword ptr [esp+8]

Another idiv, but at least compiler uses xor instead of loading 0 from memory.

// z = x * x
// w = z * z
// y = w * w
int y = x * x * x * x * x * x * x * x;

00401018 mov eax,dword ptr [esp+8]
0040101C mov ecx,eax
0040101E imul ecx,eax
00401021 imul ecx,eax
00401024 imul ecx,eax
00401027 imul ecx,eax
0040102A imul ecx,eax
0040102D imul ecx,eax
00401030 imul ecx,eax

It was also too difficult for the compiler.

Ok, so maybe let’s try some OO code?

class Object0
{
public:
void virtual Print() { printf( "a" ); }
};

int main()
{
Object0 *obj = new Object0;
obj->Print();
return 0;
}

0040101E mov dword ptr [eax],offset Object0::`vftable' (402104h)
00401024 mov edx,dword ptr [eax]
00401026 mov ecx,eax
00401028 mov eax,dword ptr [edx]
0040102A call eax

vftable? Another failure. Generated code can be fixed by writing obj->Object1::Print(); or removing virtual keyword.

Remember to hit alt+8 next time to open the disasm window :).

Posted in Uncategorized | Tagged | Leave a comment

Particle data structure

Particles are usually some kind of simple structs stored in an array. Then every frame vertex buffer is filled using that array. You really don’t want to sort them (at least not all particles and sometimes sorted particles don’t look good). You need also to retain insertion order to prevent particle popping. Using simple array and replacing deleted element with the last one isn’t a good idea.

Straightforward solution is to use circular buffer. This has some issues – if particles have no uniform lifetime it needs some checks in order to detect dead particles when filling the vertex buffer. There will be also some holes in the circular buffer, so memory won’t be used effectively.

Another solution is to use array and defragment it every frame. This means no conditional statements, but could lead to a lot of memory copying (depending on the lifetime of particles).

The interesting thing is that above methods can be combined. Third method uses an assumption that in case of the defragmented array You usually remove particles from the beginning and add to the end. So why not just have large array and two floating pointers for marking begin and end of particle data. Now when You add something – increase end pointer. When You remove an element – increase begin pointer. In case of removal from the middle of data – defragment by shifting data from left or right. When end pointer can’t be increased just defragment by copying data to the beginning of the array. This doesn’t change the worst case – removal from the array middle, but should greatly speed up the average case.

Posted in Uncategorized | Tagged , | 3 Comments

Premultiplied alpha

If You don’t know what’s premultiplied alpha and why it solves all the world’s problems, just read Tom Forsyth on premultiplied alpha or Shawn Hargreaves on premultiplied alpha.

Pros:

  1. Better quality for textures with sharp alpha cutouts.
  2. Removes some blending state changes (which can be important if You are coding for a PC).
  3. Can mix additive blending with alpha blending without a state/shader/texture change.

What are the hidden cons, which aren’t mentioned by Tom or Shawn?

  1. Fixed pipeline fog doesn’t work with it.
  2. Worse quality for smooth alpha with DXT5 compression.

Why premultiplied alpha doesn’t work very well with DXT5 for textures with smooth alpha gradients? In DXT5 texture is divided into the 4×4 pixel blocks. For every block, colors (and alphas) are approximated with equidistant points on a line between two end points (using some index table). Color end points are quantized (5:6:5 bits) and alpha end points are saved with 8 bit precision. This means that for the most textures we get better precision for the alpha channel than for the color channel.

Furthermore compressing values of a broader range gives us better precision. For example if we have alpha filled with 1/255 and RGB values in range [0; 1] then premultiplied texture RGB channels will contain only two different numbers – 0 or 1/125. This means that by using standard alpha blending we get better precision in case we would like to multiply RGB by a factor greater than 1 or tonemap final results.

Standard alpha blending:

srcAlpha = Compress_color( src.rgb ) * Compress_alpha( src.a )

Premultiplied alpha blending:

srcAlpha = Compress_color( src.rgb * src.a )

Let’s see how does it look in practice. I created a sample texture with a “lighting” gradient in RGB channels and smoke puff in alpha (from left to right: RGB, alpha and premultiplied RGB by alpha):

srcTile

Now let’s alpha blend two compressed textures (DXT5), zoom and compare results (left image – standard alpha blending, right – premultiplied alpha blending):

cmp2

This looks like a small difference, but can be quite visible in a real game – especially when smoke particles are big or their color is multiplied by a factor greater than 1 or when using some kind of tone-mapping.

BTW there is interesting feature/bug in NVIDIA Photoshop texture tools. You can’t save DXT5 with full black alpha (it just creates a DDS without alpha channel).

Posted in Uncategorized | Tagged | 1 Comment

SPU programming on a retail “fat” PS3 with Linux

Larabee is approaching, so it’s a good time to learn more about coding a modern multi-core software renderer. Thanks for Sony, everyone with a retail “fat” PS3 can install some Linux distro and have fun with SPU’s. Not as fun as with a real devkit – no Visual Studio with ProDG,  no access to RSX and no close to the metal SPU libs. You can’t even manually assign SPU tasks to physical SPU’s.

It took me some time to install and configure Linux (YDL 6.1). Small hint – if you have to use a window manager, use Fluxbox. It’s much faster on the retail PS3 than GNOME (slow), KDE(very slow) or Enlightenment(very slow). You can also work remotely using ssh/putty/Eclipse IDE (Linux only).
For people without the retail PS3 there is a cell simulator on the IBM site (again Linux only). Currently only the toolchain is ported to windows (windows cell sdk), so you can compile SPU stuff on Windows, but can’t run it there.
Be prepared for some posts about software rendering and low level SPU coding :).
Posted in Uncategorized | Tagged | Leave a comment