Virtual memory on PC

There is an excelent post about virtual memory. It’s written mainly from a perspective of console developer. On consoles most of memory issues are TLB misses and physical memory limit. I’ll try to write more about how (bad) it looks on PC (windows) with 32 bits programs. Especially nowadays when games require more and more data.

Firstly half of program’s virtual address space is taken by kernel. This means that first pointer’s bit is unused and it can be used for some evil trickery :). Moreover first and last 64kb are reserved by kernel.

Program’s source and heap has to be loaded somewhere. When compiling using VC++ default place is 0x0040000. Then a bunch of DLLs are loaded into strange virtual memory addresses. You can check what DLLs are loaded, into what address and see their size using Dependacy Walker. Use start profiling feature to see real virtual memory address of given DLL. DLLs and program usually aren’t loaded into one contiguous address range. At this point we didn’t call new/malloc even once and virtual memory is already fragmented.

Now there comes video driver. It will use precious virtual memory for managed resources, command buffer and temporary for locking non managed resources. Especially creating/locking non managed resources is quite misinforming as DirectX returns “out of video memory” instead of “out of virtual memory”. It’s very tempting to put all static level geometry into one 100mb non-managed vertex buffer. When creating/filling this VB video driver will try to allocate contiguous 100mb chunk of virtual memory. This will likely result in program crash after some time.

Windows uses 4kb pages, so doing smaller allocations will lead to internal fragmentation. I guess already everyone is using some kind of custom memory allocator, so it isn’t a problem.

There is /LARGEADDRESSAWARE linker flag, which allows to use additional 1gb of virtual memory. It requires user to change boot params and usually doesn’t work well in practice (system stability issues etc.). It’s also possible to compile as 64 bit program, but according to Steam HW survey half of gamers use a 32 bit OS. This is really annoying that MS is still making 32 bit systems because currently min PC game spec CPUs are core2 or similar with 64 bit support.

Summarizing in theory memory shouldn’t be a problem on PC, but in practice it’s a precious and fragile resource.

Posted in PC | 1 Comment

Shader Optimizations

A small list of basic and sometimes overlooked shader optimization possibilities. This are very small gains, but they can sum up and maybe there will be some free time for an additional point light or a better shadow filter?

Full screen quad vs full screen triangle

Post processing effects or color/z downsampling usually are rendered using full screen quads. Hardware works on at least 2×2 pixels groups. Pixel group size goes up with time (just like expected game resolution). For example NVIDIA Fermi has 4×2 pixel groups and older  NVIDIA G80-G92 use 2×2 quads. This means, that rendering two fullscreen triangles creates some overlapping quads on the diagonal. In 1000×1000 pixel resolution and 2×2 pixel quads, there will be 500 quads shaded two times. If we factor out cache misses, there will be 0.2% of additional work. Besides using single fullscreen triangle there is one vertex less to push to GPU :). In my synthetic test (on crappy geforce 240) difference was around 0.2% – 0.3%.

Direct3D shader compiler (FXC) on PC

Instruction counts displayed by FXC on PC doesn’t mean much nowadays. It just translates HLSL to asm, which later will be translated by the driver to special hardware IL. It’s quite possible to decrease instruction count displayed by the FXC and slowdown shader at the same time. Instead of relying on FXC numbers it’s better to check real performance (FPS/ms) or/and check numbers generated by special tools (ShaderAnalyzer and ShaderPerf). They also display GPR count, which is quite important as it shows how much stuff can be run in parallel.

Hardware instructions

  • MADD is a hardware instruction on GPU. Convert code like “( x – a ) * b” to “x * c + d”. This can save 1 ALU instruction.
  • Saturate, negation and abs are instruction modifiers and are free. Yes, there is a free dinner :). Sometimes equations can be changed to use saturate instead of clamp/min/max. Negation and abs can help to decrease number of used constant registers.
  • Some instructions are executed on the transcendental units. Transcendental units compute everything as scalars and there are like one transcendental unit per 2-8 ALUs. It’s a good idea to avoid excessive usage of instructions like sin, cos, log, sqrt, pow (very bad – calculated using three instructions).

Vectorize with care

Some GPUs have vector ALU units (AMD/ATI cards and NVIDIA cards older than G80) and some have scalar (NVIDIA G80, G92, Fermi). A vector ALU means that a scalar instruction takes same time as a vector one, which computes 4 components at once. Usually people try to vectorize everything in shaders, which can add some additional computations and actually result in slower shader on scalar ALU hardware. It’s a good idea to mask vector computations. For example in a blur shader there is no need to calculate alpha channel, so just use float3 for accumulation. We could go further and even write two shader versions: one for vector ALUs and one for scalar ones. No point in vectorizing instructions computed on transcendal units (sin, cos, log, pow…) – they are always scalar.

Clip/texkill

Consider adding clip/texkill (or alpha test, which can be faster on old hardware) when alpha blending is enabled. Think deferred lights, particles, volumetric light shafts. This can remove some work from ROP units if You don’t have uber tight geometry.

Interpolators

Shader bottlenecks aren’t only about ALU, GPR and texture fetches. On rare occasions (or on some hardware) they can become a bottleneck. Sometimes when using short pixel shaders it’s better to move computations from vertex shader to pixel shader if it can help to decrease interpolator count.

// 8 interpolators and minimal ALU in pixel shader
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv[ i ];
    }
    return color;
}

// one interpolator and some ALU in pixel shader
float4 gSomeValMul[ 8 ];
float4 gSomeValAdd[ 8 ];
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv * gSomeValMul[ i ] + gSomeValAdd[ i ];
    }
    return color;
}

8 interpolator version runs at 28.17ms (100 runs on geforce 240). 1 interpolator + some ALU version runs at 21.44ms (just as empty pixel shader). This is of course a very specific case. Still it’s a good idea to watch out and pack interpolators.

Posted in Graphics | Leave a comment

Software occlusion culling

rast

Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn’t a new idea (HOM). Basically it’s filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256×114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one – transformation pipeline. It can be a bottleneck if it isn’t properly done. Step two – clipper. Clipper code quality isn’t so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three – scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for “min-spec” PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy – it’s enough to interpolate 1/z’ (z’ = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z’. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it’s enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256×128 resolution and 1 thread. CPU / quad pixel screen size:

256×128 61×61 21×21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s 0.025 GV/s
core2 quad Q8200 (2.33ghz) 9.20 ms 2.30 ms 0.67 ms 1.76 GPix/s 0.019 GV/s

This shows true power of i7 – almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included – should be already done for frustum culling and it’s quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case – no early out (cleared z-buffer). 256×128 resolution. 1 thread. CPU / quad pixel screen size:

120×120 30×30 10×10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz) 3.30ms 0.09 ms 0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast – something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool – You can have skinned occluders :). It’s easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won’t be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it’s easier to parallelize. Maybe one day I’ll try to add it to our engine at work and see how does it handle real test cases.

Posted in Graphics | 4 Comments

Aggregated deferred lighting

Random idea about a new way to do deferred lighting. The idea is to decouple lighting from geometry normals. In order to do that, lighting information is stored as aggregated lights ( direction + color ).

  1. 1st pass – z-prepass ( just render depth )
  2. 2nd pass – render lighting geometry / quads / tiles…. Output aggregated virtual directional lights for every pixel. This means weighted average of light directions and weighted sum of light colors for every pixel.
  3. 3rd pass – render geometry and shade using buffer with aggregated directional lights (and maybe add standard forward directional light)

2nd pass render target layout:

RT0: aggregated light color RGB
RT1: aggregated light direction XYZ

We want to achieve this:

AggregatedLightColor = 0.
AggregatedLightDir = 0.

for every light
{
    AggregatedLightColor += LightColor * LightAttenuation
    AggregatedLightDir += LightDir * intensity(LightColor * LightAttenuation)
}

In order to do this, we need:

  1. Init RT0 and RT1 with 0x00000000
  2. Setup additive blending states
  3. Output from light pixel shader:
ColorRT0 = LightColor * LightAttenuation
ColorRT1 = LightDirection * dot( ColorRT0, ToGrayscaleVec )

Cons?

  • Light aggregation as virtual directional lights per pixel is an approximation. Moreover we can’t properly blend normals by using their arithmetic averages. It means that with many lights per pixel (with opposing directions) it won’t be too accurate (but it shouldn’t be too visible).

Benefits?

  • Flexibility. You can use almost any lighting model
  • You can render lighting in lower resolution as high frequency normal map details are added later. There will be artifacts at depth discontinuities, but maybe for some type of content (think desaturated and gray as Gears of War or Killzone 2 :)) they won’t be to visible
  • Less bandwidth and memory usage (if we compare it to deferred lighting and shading, which stores full specular color, not just it’s intensity).
  • Z prepass is faster than rendering GBuffer or normals + exponent
  • A bit simpler calculations. No need for encoding / decoding material properties (normal, exponent,…).

Now it’s time to find some free time and code a demo in order to compare it to deferred lighting/shading in real application :).

P.S. decoupling can be also done by storing lighting as spherical harmonics or cubemaps: link1 link2 link3 ( thanks Hogdman from gd.net forums ). Downside of that method is lack of proper specular, because of low frequency lighting data and this method will be slower.

P.S. 2 It looks like it would be better to store normals as angles (RT1.xy – weighted 2 angles, RT1.z – sum of weights). It would ensure proper aggregated light direction interpolation.

UPDATE: I prototyped this method and it doesn’t work too well :). Comparison screenshot with hard case for idea – two points lights with very different color influencing same area. Left – normal lighting and right – aggregated to direction and color:

aggr

Posted in Graphics, Lighting | 8 Comments

Rendering Light Geometry for Deferred Shading/Lighting

Interesting idea from Call Of Juarez 2 about rendering deferred light geometry. When deferred light geometry intersects with camera you need to switch culling and turn off zbuffer. In COJ2, instead of testing intersection on CPU and switching states, they just push out light geometry vertices:

// vertex shader
float3 posCS = mul( in.pos, worldToCamera ).xyz;
posCS.z = max( posCS.z, nearPlaneZ + offset );
out.pos = mul( float4( posCS, 1. ), cameraToScreen );

Could be a win if You are CPU bound.

Posted in Graphics | 3 Comments

Siggraph 2010 Papers

A small list of Siggraph 2010 papers (I’ll try to keep it up to date):

Posted in Conference | Leave a comment

VC++ and multiple inheritance

Today at work we were optimizing memory usage. At some moment we found out that size (on stack) of our basic data structures is x bytes bigger than summed size of their members. Every basic data structure was written following Alexandrescu policy based design – using inheritance from some templated empty classes. Let’s see a simple example:

class A { };
class B { };
class C : public A, B
{
    int test;
};

int main()
{
    printf( "%d\n", sizeof( C ) );
    return 0;
}

Compiler uses 4 byte aligment. Will this program print 4? That depends. Compiled by GCC it will print 4, but compiled by VC++ (2005-2010) it will print 8.

Every class in C++ has to be at least 1 byte of size in order to have a valid memory adress. With multiple inheritance sizeof(C) = sizeof(A) + sizeof(B) + some aligment. So VC++ behavior is correct, but not optimal. It’s strange that it was reported to MS in 2005 and still they didn’t fix it.

Posted in C++ | 4 Comments

CELL SDK installation on Fedora 13 x86_64

I just installed IBM CELL SDK, CELL simulator and CELLIDE on my PC box with newest Fedora. It was a quite painful process and I couldn’t find anywhere full installation instruction. It’s a pity that IBM releases such interesting technology without proper support. So here is the full installation guide.

First get all the needed rpm’s and iso’s from IBM or other website:


systemsim-cell-3.1-25.f9.x86_64.rpm
sysroot_image-3.1-1.noarch.rpm
cell-install-3.1.0-0.0.noarch.rpm
CellSDK-Extras-Fedora_3.1.0.0.0.iso
CellSDK-Devel-Fedora_3.1.0.0.0.iso


Now let’s install sdk and simulator.


yum install tk rsync sed tcl wget
rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
cd /opt/cell
cellsdk --iso /root/cell/cellsdk/ install
cellsdk_sync_simulator install


Open ~/.bash_rc with You favourite text editor and modify PATH there:


export PATH=$PATH:/opt/ibm/systemsim-cell/bin:/opt/cell/toolchain/bin

To run simulator from console use:


systemsim -g

Remember to use fast simulator mode, it’s very useful even on newest i7 :). Now let’s setup cellide (You don’t need to install Fedora Eclipse).


yum install cellide cell-spu-timing alf-ide-template fdpr-launcher ibm-java2-i386-jre

Time to download fix pack 3.1-SDKMA-Linux-x86_64-IF01 “intended only for RHEL” :), so Eclipse will detect local cell simulator. Install it.


rpm -Uvh cellide-3.1.0-7.i386.rpm

Finally run eclipse with:


./eclipse -vm /opt/ibm/java2-i386-50/jre/bin

Posted in CELL | 5 Comments

Color grading and tonemapping using photoshop

There is a very simple and nice idea in UDK documentation about adding color grading to engine in easy way, without writing any special tools. Color grading in UDK uses LUT table (3d texture, with mapping from RGB to RGB). Pretty standard stuff. The interesting part is that this texture is authored in photoshop. LUT slices are added to photoshop layers and game screenshot is set as main layer. When You tweak the screenshot, changes are automatically propagated to layers with LUT table slices. After tweaking screenshot You need just to import authored LUT slices and use them in game. So there is no need to duplicate photoshop functionality and force game artists to work with unfamiliar custom tools.

Moreover we could go further and use HDR source screenshot and HDR LUT table. This way we could add tonemapping with color grading, contrast, saturation… in 30 minutes, without writing any tool code.

Posted in Graphics, Post Processing | Leave a comment

C++ compiler VS human

There is a very nice article about writing fast math library on gamasutra. It shows for example that using operator overloading generates slower code than when using functions. A lot of people believe that compiler will generate better code than a programmer can. They just happily write code and don’t check that their compiler generates.

Let’s test how good is VC++ against simple algebra and OOP crap :). I decided to be not too harsh, so I omitted here topics like SIMD, FPU or aliasing.  I have chosen VC++ 2008, because it’s the most popular compiler in gamedev industry (and I think it will maintain it’s position until the VC++2010 SP1 release 🙂 ).

Default release build settings – /O2 etc. + very simple test cases, just int main(), scanf and printf.

// x/x = 1
int y = x / x;

00401018 mov ecx,dword ptr [esp+8]
0040101C mov eax,ecx
0040101E cdq
0040101F idiv eax,ecx

mov and idiv? FAIL.

// 0/x = 0
int y = 0 / x;

00401018 xor eax,eax
0040101A cdq
0040101B idiv eax,dword ptr [esp+8]

Another idiv, but at least compiler uses xor instead of loading 0 from memory.

// z = x * x
// w = z * z
// y = w * w
int y = x * x * x * x * x * x * x * x;

00401018 mov eax,dword ptr [esp+8]
0040101C mov ecx,eax
0040101E imul ecx,eax
00401021 imul ecx,eax
00401024 imul ecx,eax
00401027 imul ecx,eax
0040102A imul ecx,eax
0040102D imul ecx,eax
00401030 imul ecx,eax

It was also too difficult for the compiler.

Ok, so maybe let’s try some OO code?

class Object0
{
public:
    void virtual Print() { printf( "a" ); }
};

int main()
{
    Object0 *obj = new Object0;
    obj->Print();
    return 0;
}

0040101E mov dword ptr [eax],offset Object0::`vftable' (402104h)
00401024 mov edx,dword ptr [eax]
00401026 mov ecx,eax
00401028 mov eax,dword ptr [edx]
0040102A call eax

vftable? Another failure. Generated code can be fixed by writing obj->Object1::Print(); or removing virtual keyword.

Remember to hit alt+8 next time to open the disasm window :).

Posted in C++ | Leave a comment