ACES Filmic Tone Mapping Curve

Careful mapping of HDR values to LDR is an important part of a modern game rendering pipeline. One of the goals of our new renderer was to replace Reinhard‘s tone mapping curve with some kind of a filmic tone mapping curve. We tried one from Ucharted 2 and tried rolling our own, but weren’t happy with either of this solutions. Finally, we settled on the one from ACES, which is currently a default tone mapping curve in Unreal Engine 4.

ACES color encoding system was designed for seamless working with color images regardless of input or output color space. It also features a carefully crafted filmic curve for displaying HDR images on LDR output devices. Full ACES integration is a bit of overkill for games, but we can just sample ODT( RRT( x ) ) transform and fit a simple curve to this data. We don’t even need to run any ACES code at all, as ACES provides reference images for all transforms. Although there is no linear RGB D65 ODT transform, but we can just use REC709 D65 and remove 2.4 gamma from it.

Curve was manually fitted (max fit error: 0.0138) to be more precise in the blacks – after all we will be applying some kind gamma afterwards. Additionally, data was pre-exposed, so 1 on input maps to ~0.8 on output and resulting image’s brightness is more consistent with the one without any tone mapping curve at all. For the original ACES curve just multiply input (x) by 0.6.

Fitted curve’s HLSL source code (free to use under public domain CC0 or MIT license):

float3 ACESFilm(float3 x)
    float a = 2.51f;
    float b = 0.03f;
    float c = 2.43f;
    float d = 0.59f;
    float e = 0.14f;
    return saturate((x*(a*x+b))/(x*(c*x+d)+e));

Fitted curve plotted against source data’s sample points:


UPDATE: This is a very simple luminance only fit, which over saturates brights. This was actually something consistent with our art direction, but for a more realistic rendering you may want a more complex fit like this one from Stephen Hill.

Posted in Graphics, Lighting | 33 Comments

Analytical DFG Term for IBL

Image-based lighting is an important part of a physically based rendering. Unfortunately straightforward IBL implementation for more complicated lighting models than Phong requires a huge lookup table and isn’t practical for real time. Current state of the art approach is split sum approximation [Kar13], which decomposes IBL integral into two terms: LD and DFG. LD is stored in standard cube map and DFG is stored in one global 2D LUT texture. This texture is usually 128×128 R16G16F, contains scale and bias for specular color and is indexed by roughness/gloss and ndotv. DFG LUT is quite regular and looks like it could be efficiently approximated by some kind of low order polynomial.

My main motivation was to create a custom 3ds Max shader, so artists could see how their work will look in our engine. Of course 3ds Max supports custom textures, but it’s not very user friendly and error prone when artists need to assign some strange LUT texture. It’s better to hide such internal details. Furthermore it can be beneficial for performance, as you can replace memory lookup with ALU. Especially on bandwidth constrained platforms like mobile devices.

Surface fitting

There are many surface fitting tools, which given some data points and equation, automatically find best coefficients. It’s also possible to transform curve fitting problem into a nonlinear optimization problem and use tool designed for solving them. I prefer to work with Matlab, so of course I used Matlab’s cftool. It’s a separate application with GUI. You just enter an equation and it automatically fits functions, plots surface against data points and computes error metrics like SSE or RMSE. Furthermore you can compare side by side with previous approximations. Popular Mathematica can also easily fit surfaces (FindFit), but it requires more work, as you need to write some code for plotting and calculating error metrics.

Usually curve fitting is used for smoothing data, so most literature and tools focus on linear functions like polynomial and Gaussian curves. For real-time rendering polynomial curves are most cost efficient on modern scalar architectures like GCN. Polynomial curves avoid costly transcendentals (exp2, log2 etc.), which are quarter rate on GCN. For extra quality add freebies like saturate or abs to constrain function output. In some specific cases it’s worth to add other full rate instructions like min, max or cndmask.

Most fitting is done with non linear functions, where fitting tools often are stuck in a local solution. In order to find a global one you can either write a script which fits for different starting points and compares results or just try a few points by hand until plotted function will look good. For more complicated cases there are smarter tools for finding global minimum like Matlab’s MultiStart or GlobalSearch.

Last thing is not only to try polynomial of some order, but also play with all it’s variables. Usually I first search for order of polynomial which properly approximates given data and then try to remove higher order variables and compare results. This step could be automatized to check all variable combinations. I never did it, as higher order are impractical for real time rendering, so there aren’t too many combinations.


First I tried to generate LUT inside Matlab, but it was too slow compute, so I switched to C++ and loaded that LUT as CSV. Full C++ source for LUT generation is on Github. It uses popular GGX distribution, Smith geometry term and Schlick’s Fresnel approximation. Additionally I use roughness remap gloss=(1-roughness)^{4} which results results in similar distribution to Blinn-Phong 2^{gloss*16}. This remap is also similar to gloss=(1-roughness*0.7)^{6}, which was used by Crytek in Ryse [Sch14].

for ( unsigned y = 0; y < LUT_HEIGHT; ++y )
    float const ndotv = ( y + 0.5f ) / LUT_WIDTH;

    for ( unsigned x = 0; x < LUT_WIDTH; ++x )
        float const gloss = ( x + 0.5f ) / LUT_HEIGHT;
        float const roughness = powf( 1.0f - gloss, 4.0f );

        float const vx = sqrtf( 1.0f - ndotv * ndotv );
        float const vy = 0.0f;
        float const vz = ndotv;

        float scale = 0.0f;
        float bias = 0.0f;

        for ( unsigned i = 0; i < sampleNum; ++i )
            float const e1 = (float) i / sampleNum;
            float const e2 = (float) ( (double) ReverseBits( i ) / (double) 0x100000000LL );

            float const phi = 2.0f * MATH_PI * e1;
            float const cosPhi = cosf( phi );
            float const sinPhi = sinf( phi );
            float const cosTheta = sqrtf( ( 1.0f - e2 ) / ( 1.0f + ( roughness * roughness - 1.0f ) * e2 ) );
            float const sinTheta = sqrtf( 1.0f - cosTheta * cosTheta );

            float const hx = sinTheta * cosf( phi );
            float const hy = sinTheta * sinf( phi );
            float const hz = cosTheta;

            float const vdh = vx * hx + vy * hy + vz * hz;
            float const lx = 2.0f * vdh * hx - vx;
            float const ly = 2.0f * vdh * hy - vy;
            float const lz = 2.0f * vdh * hz - vz;

            float const ndotl = std::max( lz, 0.0f );
            float const ndoth = std::max( hz, 0.0f );
            float const vdoth = std::max( vdh, 0.0f );

            if ( ndotl > 0.0f )
                float const gsmith = GSmith( roughness, ndotv, ndotl );
                float const ndotlVisPDF = ndotl * gsmith * ( 4.0f * vdoth / ndoth );
                float const fc = powf( 1.0f - vdoth, 5.0f );

                scale += ndotlVisPDF * ( 1.0f - fc );
                bias += ndotlVisPDF * fc;

            scale /= sampleNum;
            bias /= sampleNum;

Code above outputs texture like this:



[Laz13] presented an analytical solution to DFG term. He used Blinn-Phong distribution, so first I fitted his approximation for GGX and my roughness remap. Instead of storing scale directly, delta is used (scale = delta – bias). It simplifies fitting as delta is a simpler surface than scale. Additionally to get a tighter fit I added saturate for bias and delta values.

float3 EnvDFGLazarov( float3 specularColor, float gloss, float ndotv )
    float4 p0 = float4( 0.5745, 1.548, -0.02397, 1.301 );
    float4 p1 = float4( 0.5753, -0.2511, -0.02066, 0.4755 );

    float4 t = gloss * p0 + p1;

    float bias = saturate( t.x * min( t.y, exp2( -7.672 * ndotv ) ) + t.z );
    float delta = saturate( t.w );
    float scale = delta - bias;

    bias *= saturate( 50.0 * specularColor.y );
    return specularColor * scale + bias;



Then I tried to find a better approximation. I focused on simple instructions in order to avoid transcendentals like exp, which are quarter rate on GCN. I tried many ideas for bias fitting – from simple polynomials to expensive Gaussians. Finally settled on two polynomials oriented to axes and combined with min. One depends only on x and second only on y. Fitting delta was easy – 2nd order polynomial with some additional term did the job.

float3 EnvDFGPolynomial( float3 specularColor, float gloss, float ndotv )
    float x = gloss;
    float y = ndotv;

    float b1 = -0.1688;
    float b2 = 1.895;
    float b3 = 0.9903;
    float b4 = -4.853;
    float b5 = 8.404;
    float b6 = -5.069;
    float bias = saturate( min( b1 * x + b2 * x * x, b3 + b4 * y + b5 * y * y + b6 * y * y * y ) );

    float d0 = 0.6045;
    float d1 = 1.699;
    float d2 = -0.5228;
    float d3 = -3.603;
    float d4 = 1.404;
    float d5 = 0.1939;
    float d6 = 2.661;
    float delta = saturate( d0 + d1 * x + d2 * y + d3 * x * x + d4 * x * y + d5 * y * y + d6 * x * x * x );
    float scale = delta - bias;

    bias *= saturate( 50.0 * specularColor.y );
    return specularColor * scale + bias;



Some screenshots comparing reference and two approximations:


Instruction histograms on GCN architecture:

Lazarov Polynomial fit
v_exp_f32 1
v_mac_f32 3 3
v_min_f32 1 1
v_mov_b32 4 2
v_mul_f32 1 5
v_add_f32 2
v_madmk_f32 4
v_mad_f32 2 1
v_subrev_f32 1 1
total cycles: 16 19


To sum up I presented here a simple analytical function for DFG approximation. In practice it’s hard to distinguish this approximation from reference and it uses a moderate amount of ALU.


[Kar13] B. Karis – “Real Shading in Unreal Engine 4”, Siggraph 2013
[Laz13] D. Lazarov – “Getting More Physical in Call of Duty: Black Ops II”, Siggraph 2013
[Sch14] N. Schulz – “The Rendering Technology of Ryse”, GDC 2014

Posted in Graphics, Lighting | 13 Comments

Lightmapping in Anomaly 2 mobile


In 2013 Anomaly 2 mobile version (iOS/Android/Blackberry) by a small indie studio 11 Bit Studios was released. It was an interesting project as we needed to run heavy content from PC version on much weaker mobile platforms. I’d like to write about rendering technology behind it and share my experiences about implementing a lightmap baker. Especially because there isn’t too much information about lightmapping and rendering on mobile devices.

Anomaly 2 mobile goal was to try to reach PC version quality and reuse as many assets as possible. PC version had tons of dynamic lights, dynamic shadows, SSAO and similar effects. There was no way to run it on mobile directly. Programming graphics for mobile feels like going 5-10 years back in time. Comparing to PC or consoles mobile GPUs are slow, games are rendered in high resolutions and there is a solid amount of memory available. A perfect fit for some kind of precomputed lighting.

Lightmapper overview

Anomaly 1 mobile version had simple lightmaps, which I hacked in a few hours. Baking was done in two steps. First step rendered PC real-time directional light with shadows. Second step added ambient occlusion by rendering manually placed darkening quads. Lightmap was applied only to the terrain, which consisted of a single textured 2D plane.

For Anomaly 2 mobile we decided to write a proper baked lighting solution. Unfortunately it had to be based on DX9. At that time our game editor was based on DX9 as we had no time for implementing DX11 support. DX11 enables new possibilities for optimizations. Namely compute shaders and smaller draw call overhead. Both are very important, as GPU lightmapper is often bottlenecked by CPU and some operations aren’t a good fit for pixel shaders.

There are many lightmap format flavors – plain accumulated diffuse lighting, directional normal maps (radiosity normal maps), spherical harmonics or dominant light direction (ambient and directional light per texel). For a good overview check out presentation by Illuminate Labs (Beast creators) [Lar10].

Graphics artists wanted high lightmap density for sharp shadows and detailed lighting. Additionally changing difficulty levels changed objects placement, so some parts of lightmap needed to be stored multiple times per difficulty level. There was no way we could use any format other than accumulated diffuse because of memory requirements. Storing accumulated diffuse is also the fastest method and every millisecond counts on mobile. Unfortunately selected format doesn’t support normal mapping, so I had to resort to a hack to add normal maps.

For baking I went with a classical approach. Render a hemicube from a POV of every lightmap texel to gather irradiance at that point and later integrate it to compute radiance [Eli00]. Similar to solution used in The Witness, which is described by Ignacio Castaño in a series of great posts on their blog [Cas10] [Cas10].

Object chart UV

First step is to generate some unique UV for every static object in the game which has lightmaps. It’s important to place seams manually in places where lighting is discontinuous. Those seems will prevent lighting from leaking across hard edges. Theoretically hard edges information should be read from mesh source file. For our case it was easier. XSI by default creates hard edges for angles greater than 60 degrees, so it was enough to place a seam in places where angle between normals was higher than mentioned value.


For chart UV generation D3DX UVAtlas was used. It’s based on research made 10 years ago (“Iso-Charts” [ZSGS04] and “Signal Parametrization” [SGSH02]), so actually you can find better algorithms nowadays. Especially interesting are surface quadrangulation algorithms like [BZK09] or [ZHLB10], which are compatible with “invisible seam” UV algorithm [RNLL10]. Pretty intense stuff, but fixes seam issues forever.

Chart UV are generated so it has appropriate density (texel per square meter). I used two metrics – average density around target and min density no less than 90% of target.

Level creation pipeline was heavily based on prefabs (object instances). There was no point in using methods like signal based parametrization [SGSH02]. Theoretically It could be used in order to place more detail on roofs of buildings and less on the sides. In practice there was no way to find how that building was used. Every instance could have custom rotation and non-uniform scale.

UV chart packing per object

Most charts have rectangular shapes, so first all charts are rotated to their best fit rectangle. Then charts are sorted by area and max side. Finally charts are packed using a brute force chart packer. It introduces one chart at time, testing all possible locations and all combinations of multiple of 90 angle rotations and mirror transforms. Best location is chosen using extent metrics. There was one additional constraint in order to minimize texture compression artifacts – max 3 charts per 4×4 texel block.

Most papers use tetris like packing schemes. Nowadays there is plenty power for a bruteforce solution, which achieves much better pack ratios. For faster packing even GPU could be used:

1. Rasterize all possible chart combinations or blit prerasterized sprites using additive blending.
2. Test for chart overlap and other constraint violations using pixel or compute shader.

For us an optimized CPU packer was fast enough, as we were packing charts only once per prefab (object template) during mesh import. When packing charts for an entire level into one atlas a GPU approach can help a lot. In that case there is one additional trick to achieve better pack ratios. That is to leverage GPU wrap texture address mode and allow charts to wrap around the borders of the atlas [NS11].

Most our levels were city landscapes with a lot of hard edges. This means a lot of small UV charts which require a lot of padding. I have used some tricks to reduce those borders:

1. 1×1 texel charts were collapsed and snapped to texel center.
2. 1xN / Nx1 texel charts were similarly collapsed to a line and snapped to texel centers.
3. NxN charts were resized and aligned to texel centers.


Atlas packing

After placing objects on the level their lightmaps were packed into multiple 2kx2k atlases. Classic binpacking using a binary tree was used [Sco03]. First all objects(rectangles) were sorted by their area and max side. Then objects were inserted into binary tree one at a time. If couldn’t insert into first 2kx2k atlas, then tried to insert into second etc. Every object was surrounded by a proper border (1 texel width). Additionally every object was resized and aligned to full 4×4 texel block in order to minimize texture compression artifacts.

After packing unique scale and bias lightmap UV parameters were assigned to every instance. At runtime scale and bias were applied to lightmap UV in vertex shader. It allowed to reuse lightmap UV per object, so all mesh instances could share the same vertex buffer. Additionally every mesh has unique UV mapping, which can be used for other purposes.

Direct lighting baking

In order to bake lightmap we first need to calculate appropriate world position and normal per lightmap texel. Most straightforward way is to rasterize the geometry in the lightmap UV space, writing position and normal. It’s best to use conservative rasterization with analytical antialiasing. Conservative rasterization ensures that every texel covered by geometry will be included. Not only those where texel center is covered by geometry. Analytical antialiasing ensures that the best centroid will be taken.

Actual rasterization was done using half-space rasterization with a bit tweaked fill rules and line equations in order to make rasterization conservative. For each rasterized texel, source triangle was clipped to that texel’s bounding box and its area and centroid was calculated. In a case of multiple overlapping triangles per texel the one with highest texel area coverage was picked. Finally centroid was used to calculate world position and normal. Very similar rasterization implementation can be found in NVIDIA mesh processing tool sources.

Direct lighting baking was done by looping through lights and drawing quads which covered affected objects in lightmap. Almost the same pixel shader was used as for real time lights used in PC version, just with disabled specular etc. In order to sample shadows, objects were batched according to location, shadow map was focused on that batch and entire batch was rendered. This step outputs a set of FP16 lightmaps with accumulated direct diffuse lighting.

Indirect lighting baking

To bake indirect (bounced) lighting and ambient occlusion hemicubes were used. Apart from already calculated world position and normal, hemicube rendering requires some arbitrary “up direction”. It can be either constant direction resulting in banding or a random one – resulting in noise. For us random “up direction” worked best. Just needed to make sure that random seed is dependent on world position.Without it some lightmaps which were generated per difficulty level wouldn’t fit with the rest of lightmaps, which were generated once.

Now we have, per lightmap texel, a hemicube position (world position), hemicube direction (world normal) and hemicube up direction. Straightforward hemicube approach requires rendering scene to 5 different viewports per every lightmap texel. For better performance I decided to use a single plane with a ~126.87 degrees FOV (double side length of image plane compared to hemicube rending). In our case it looked almost as good, but turned out to be a few times faster. There is also a question how to treat missing samples. According to [Gau08] the best approach is to replicate edge texels – increase weights on edges. In our case simple weight renormalization looked best.

When baking sometimes camera’s near plane intersects with nearby surfaces. A simple solution would be to move near plane closer to the eye until that overlap disappears. Unfortunately moving that plane too much introduces depth buffer artifacts. To fix it pixel shader tests for back faces using VFACE (SV_IsFrontFace). When back faces are encountered ambient occlusion and bounced light is set to zero. Theoretically this is not always correct, but in practice looked good and fixed all visual issues.

Finally, gathered radiance needs to be converted to irradiance. It requires weighting samples by cosinus factor and solid angle. Texels located at cube map corners have smaller solid angle and should influence the result less. Solid angle can be calculated by projecting texel onto unit sphere and calculating its area on the sphere [Dri12].

I’ve precomputed weights and stored them in a lookup texture, so they could be directly used by pixel shader:

float ElementArea( float x, float y )
    return atan2f( x * y, sqrtf( x * x + y * y + 1.0f ) );

// calculate weights
float weightSum = 0.0f;
for ( unsigned y = 0; y < IL_PROBE_SIZE; ++y )
    for ( unsigned x = 0; x < IL_PROBE_SIZE; ++x )
        float const tu = ( 4.0f * ( x + 0.5f ) / IL_PROBE_SIZE ) - 2.0f;
        float const tv = ( 4.0f * ( y + 0.5f ) / IL_PROBE_SIZE ) - 2.0f;

        // cosFactor = |v1( tu, tv, 1 )| DOT v2( 0, 0, 1 ) = |v1|.z
        float const cosFactor = 1.0f / sqrtf( tu * tu + tv * tv + 1.0f );

        // solid angle projection
        float const texelStep = 2.0f / IL_PROBE_SIZE;
        float const x0 = tu - texelStep;
        float const y0 = tv - texelStep;
        float const x1 = tu + texelStep;
        float const y1 = tv + texelStep;
        float const solidAngle = ElementArea( x0, y0 ) - ElementArea( x0, y1 ) - ElementArea( x1, y0 ) + ElementArea( x1, y1 );

        float const weight = cosFactor * solidAngle;
        weightSum += weight;

        ILProbeWeights[ x + y * IL_PROBE_SIZE ] = weight;

// normalize weights
for ( unsigned y = 0; y < IL_PROBE_SIZE; ++y )
    for ( unsigned x = 0; x < IL_PROBE_SIZE; ++x )
        ILProbeWeights[ x + y * IL_PROBE_SIZE ] /= weightSum;

For achieving best performance all mentioned steps were done exclusively on the GPU. Only final lightmaps were copied to main memory in order to store them on disk.

1. Render multiple IL probes
2. Integrate using precomputed cubemap and downscale to 1×1 texel (watch out for F16 precision issues)
3. copy texel to final position in lightmap
Last steps were batched in order to optimally use GPU.

For fast preview mode and generally for faster baking irradiance caching [Cas11] [Cas14] [Dri09] could be used. It works by smartly picking sample positions and filling missing places using interpolation. Sample placement is estimated by analyzing calculated radiance in existing sample locations. It leads to workloads, which are much harder to batch and are less GPU friendly than bruteforce approach. It’s especially bad, when you can’t use compute shaders. Due to this and due to time constraints I didn’t implement it, but results from other people look very promising. It’s something I’d like to implement in future if I ever write another baker.

Terrain lightmap

Terrain consisted mostly of multiple flat tiles and decals placed on top of them. Tile were small and wasted a lot of lightmap space on padding. Decals didn’t reuse lightmap and lighting values were needlessly duplicated. Borders between those tiles were also problematic because of seam artifacts created by UV discontinuity and aggressive PVR compression. In order to solve those issues a special terrain lightmap was introduced. Basically it was a big 2D plane, placed on a specific height, for which lighting was baked. Artist could mark which objects and decals should use this lightmap instead of having it’s own. Artist also used this lightmap for small 3d objects on terrain (eg. small debris). This solved terrain seam issues and resulted in more efficient lightmap texture usage.

Lightmap real-time composition

Baking results were stored as two FP16 textures with linear values. First texture contained direct lighting, second – bounced lighting and ambient occlusion. All inputs could be mixed in real-time in editor, just like layers in Photoshop. Everything was controlled by curves. Artists could tweak ambient occlusion strength, colorize lighting, increase bounced lighting strength etc. Everything was handled by a single pixel shader and was really fast. Not a physically correct approach, but it enabled fast iterations for final light tweaks, without requiring lengthy lighting rebake.

From a technical side this step merged high precision lighting components and outputted a single lightmap in RGBX8_SRGB format. Lighting values were rescaled to [0;2] range for some extra lighting range. It’s possible to dynamically select lighting range per object for better encoding. However in our case lightmap textures were low precision and were heavily compressed (PVR 2bpp), so it would result in visible discontinuities between two objects with different lighting scale.

Apart from composition this step also fills unused lightmap texels and tries to weld lightmap UV discontinuities. Parts of mesh that are connected in 3D space can be disjoint in lightmap UV space. Interpolation along this disjoint edge causes visible seams. Invisible seam algorithm [RNLL10] could be used to fix it. Unfortunately invisible seam algorithm requires complicated quadrangulation algorithms like [BZK09] or [ZHLB10]. It also imposes additional restrictions on UV charts, so UV space usage is less efficient. Compression is an another source of seams along those discontinuities. Compression seam removal requires introducing additional constraints and reduces UV space usage efficiency even more.

This was an overkill for us, as there are approximate methods, which work with any UV parametrization. Those methods search for a “best” border texel value. Either by evaluating a few points and using least squares [Iwa13] or trying to solve analytically bilinear filtering equation [Yan06].

I went with a simple and rough solution – average values across seams during lightmap composition pass. Additionally in order to reduce texture compression artifacts final composition pass was to flood fill lightmap values, so unused texels would have similar values as neighbors.

Improving usability

Iteration times are very important for graphics artists. That’s why I added selective baking and minimal rebake. Selective baking allows to bake only selected objects. Minimal rebake allows to bake only modified objects and their appropriate surroundings. Over night all levels were automatically rebaked and changes pushed to SVN. So usually only a small part of scene needed to be rebaked during normal workflow and iteration times were manageable.

Static object lighting

Static object lighting was based on baked diffuse in lightmaps. For better quality and lighting resolution normal maps were used. Due to memory constraints I couldn’t do proper normal mapping with lightmaps and had resort to a hack. Normals were combined with dominant light’s direction (usually sun) and were used for perturbing lightmap values:

float diffuseMult = saturate( dot( normalTS, lightDirTS ) ) * FLDScale + FLDBias;
float3 diffuse = diffuse * diffuseMult;

This hack worked out quite nice in practice adding extra detail and helping to reduce compression artifacts. This was very important as we were using hardcore PVR 2bpp compression. Normal maps were also used for real time specular (calculated for a single light) and for envmaps.

Dynamic object lighting

Dynamic objects were primary lit using diffuse stored in light probes (captured irradiance for all possible orientations at a single point of space). Specular and normal maps were added just like in case of static object’s.

There are many ways of implementing light probes. Again because of performance I had to choose the simplest method – “Valve ambient cube” [MMG06]. Which is almost like a 1×1 texel face cubemap, which stores lighting values per face direction. For general usage it has a lot issues – for example it’s not rotationally invariant, so lighting error depends on light’s angle. For our case it was a good fit. Game had almost always top down view and those basis allowed to harness that fact. The top cube face was the most important one.

A single ambient cube was stored as a single chrominance value with 6 intensity values per each face. This allowed to drive down light probe memory usage and reduce computations a bit. In our case the top face was the most important, so it had highest weight and bottom face had the lowest weight when computing a single chrominance value for captured lighting.

Light probes were stored as a few layers of dense 2D grids. There are much better schemes like tetrahedral tessellation [Cup12]. In our case 4 regular 2D grids (4 height layers) were enough, so memory wasn’t here an issue. Besides regular grid speeds up light probe lookup and simplifies debugging. With grid I could just save results as a 2D texture for simple visualization.

At runtime a single light probe per object was calculated. It was done by selecting one cell in grid (cube consisting of 8 probes in the corners) and using trilinear interpolation to compute light probe value at object’s center.

When light probes are arranged in a regular grid, some of them are placed inside of geometry. The result is that dynamic objects are too dark near obstacles. To solve this issue, probes for which most of gathered environment consists of back faces were marked as invalid. At runtime those invalid probes were discarded and were excluded from interpolation by setting their weights to zero and renormalizing other weights.
Single light probe per entire object lighting is only correct for the center of that object, so no shadow transitions are possible on its surface. In order to fix it irradiance gradients [Tat05] were used. Again used the top down view to reduce computations, so irradiance gradient was computed and applied only for top face. For fast evaluation a simple linear gradient was used. Gradient computation was quite straightforward. Per axis two additional light probes were calculated (located at center -/+ 70% of bbox half extent). Then gradient value which minimizes error was taken. At runtime, per vertex, a position offset was calculated and multiplied by that value:

float3 probeL;
probeL.x = LightProbeLuminance[ ( nrmWS.x >= 0.0 ? 1 : 0 ) + 0 ];
probeL.y = LightProbeLuminance[ ( nrmWS.y >= 0.0 ? 1 : 0 ) + 2 ];
probeL.z = LightProbeLuminance[ ( nrmWS.z >= 0.0 ? 1 : 0 ) + 4 ];

// apply irradiance gradient to top face
float3 offsetWS = posWS - LightProbeCenterWS;
probeL.y += * LightProbeGradientX;
probeL.y += offsetWS.yyy * LightProbeGradientY;
probeL.y += offsetWS.zzz * LightProbeGradientZ;

float3 sqNormalWS = normalWS * normalWS;
float3 diffuse = dot( sqNormalWS, probeL ) * LightProbeRGB;

Of course it doesn’t look no way as good as real shadows. On the other hand it enabled some shadow transitions at a cost of a few vertex shader instructions.

I tried to render more than 6 directions per ambient cube and then compute final probe values using least squares, but it didn’t noticeably improve quality.

When baking light probes dynamic objects can’t be included, so they can’t occlude sun or other light coming from above. Resulting lighting at the bottom of light probe is too strong and objects placed near terrain tend to have unnatural bright lighting from below. To fix it ambient cube’s bottom face was darkened a bit.

There were many redundant light probes, so for storage a simple dictionary coder was used. Per grid layer a dictionary of light probes was maintained. Light probes were stored in grid as indices to specific entry in the dictionary. For extra compression a kd-tree could be used for storing those indices [Ani09], but in our case dictionary coder was enough to reach storage requirements.

Dynamic object shadows

Dynamic shadows are always a big challenge when using baked lighting. Moreover in our case rendering budget was very tight. In multiplayer matches players could place really a lot of dynamic objects, so there was no way we could use shadow maps.

For Anomaly 1 mobile Bartosz Brzostek build an interesting system of dynamic shadows for units. Every dynamic object had prebaked projective shadows. In other words shadow sprites were projected on the terrain. Those sprites were attached to selected attachment points (bones). I extended this system with new lightmap support and reused it for Anomaly 2 mobile. For example a tank has one sprite for chassis and one sprite for turret.


Shadows were gathered in screen space using a small offscreen (x16 times downscaled) buffer. First this buffer was cleared to white. Later, shadow sprites were projected on a 2D ground plane and rendered using min blend mode to prevent double shadowing artifacts. Finally this offscreen target was combined with lightmap during normal rendering pass. This constrained shadow receivers only to flat terrain. Additionally combining lightmap, which accumulates lighting from multiple shadowing light sources, with dynamic shadows is wrong on many levels. On the other hand it allowed cheap multiple dynamic shadows.

It enabled one additional cool trick – artist could prebake shadow penumbra. Parts near the ground had dark and sharp shadows and parts placed high were brighter and more blurred.


Lightmapper allowed graphics artists to move heavy content from the PC versions. It also turned out quite fast. 20-30 min for full quality level rebake on standard PC. In order to reach this level of performance I’ve written a special lightweight render path just for baking. Still baking was mostly bottlenecked by draw calls (driver time).

Finally I’d like to thank the entire team at 11 Bit Studios for making this game, Wojciech Sterna for proofreading and Michał Iwanicki for an interesting discussion about lightmaps.


[Lar10] David Larsson – “The Devil is in the Details: Nuances of Light Mapping”, Gamefest 2010
[ZSGS04] – K. Zhou, J. Snyder, B. Guo, H-Y Shum – “Iso-charts: Stretch-driven Mesh Parameterization using Spectral Analysis”, Eurographics 2004
[SGSH02] – P.V. Sander, S.J. Gortler J. Snyder, H. Hoppe – “Signal-Specialized Parametrization”, Eurographics 2002
[RNLL10] N. Ray, V. Nivoliers, S. Lefebvre, B. Lévy – “Invisible Seams”, Eurographics 2010
[BZK09] D.Bommes H.Zimmer L.Kobbelt – “Mixed-Integer Quadrangulation”, Siggraph 2009
[ZHLB10] M. Zhang, J. Huang, X. Liu, H. Bao – “A Wave-based Anisotropic Quadrangulation Method”, Siggraph 2010
[Sco03] Jim Scott – “Packing Lightmaps”, 2003
[Dri12] Rory Driscoll – “Cubemap Texel Solid Angle”, 2012
[Yan06] Yann L –“Radiosity on curved surfaces?”, forum post 2006
[Iwa13] Michał Iwanicki – “Lighting Technology Of “The Last Of Us”, Siggraph 2013
[NS11] T.Nöll, D.Stricker – “Efficient Packing of Arbitrary Shaped Charts for Automatic Texture Atlas Generation“, Eurographics 2011
[Eli00] Hugo Elias – “Radiosity”, 2000
[Cas10] Ignacio Castaño – “Hemicube Rendering and Integration”, 2010
[Cas10] Ignacio Castaño – “Lightmap Parameterization”, 2010
[Cas11] Ignacio Castaño – “Irradiance Caching – Part 1”, 2011
[Cas14] Ignacio Castaño – “Irradiance Caching – Continued”, 2014
[Dri09] Rory Driscoll – Irradiance Caching: Part 1, 2009
[Gau08] Pascal Gautron – “Practical Global Illumination With Irradiance Caching”, Siggraph 2008 class notes
[Cup12] Robert Cupisz – “Light probe interpolation using tetrahedral tessellations”, GDC 2012
[MMG06] J. Mitchell, G. McTaggart, C. Green – “Shading in Valve’s Source Engine”, Siggraph 2006
[Tat05] – Natalya Tatarchuk – “Irradiance Volumes for Games”, GDC 2005
[Ani09] S. Anichini – “A Production Irradiance Volume Implementation Described”, 2009

Posted in Graphics, Lighting | 3 Comments

Digital Dragons 2014 Programming Track

Digital Dragons is a game industry conference held in Kraków (Poland). This year’s programming track was outstanding and mainly focused on graphics. Allpresentations are really worth reading.

Posted in Conference | 4 Comments

Octahedron normal vector encoding

Many rendering techniques benefit from encoding normal (unit) vectors. For example in deferred shading G-buffer space is a limited resource. Additionally it’s nice to be able to encode world space normals with uniform precision. Some encoding techniques work only for view space normals, because they use variable precision depending on normal direction.

World space normals have some nice properties – they don’t depend on camera. This means that on static objects specular and reflections won’t wobble when camera moves (imagine FPS game with slight camera movement on idle). Besides their precision doesn’t depend on camera. This is important because sometimes we need to deal with normals pointing away from camera. For example because of normals map and perspective correction or because of calculating lighting for back side (subsurface scattering).

Octahedron-normal vectors [MSS*10] are a simple and clever extension of octahedron environment maps [ED08]. The idea is to encode normals by projecting then on a octahedron, folding it and placing on one square. This gives some nice properties like quite uniform value distribution and low encoding and decoding cost.

I compared octahedron to storing 3 components (XYZ) and spherical coordinates. Not a very scientific approach – just rendered some shiny reflective spheres. Normals were stored in world space in a R8G8B8A8 render target. Post contains also complete source code (which unfortunately isn’t provided in original paper), so you can paste into your engine and see yourself how this compression looks in practice.


float3 Encode( float3 n )
    return n * 0.5 + 0.5;

float3 Decode( float3 f )
    return f * 2.0 - 1.0;



Spherical coordinates

float2 Encode( float3 n )
    float2 f;
    f.x = atan2( n.y, n.x ) * MATH_INV_PI;
    f.y = n.z;

    f = f * 0.5 + 0.5;
    return f;

float3 Decode( float2 f )
    float2 ang = f * 2.0 - 1.0;

    float2 scth;
    sincos( ang.x * MATH_PI, scth.x, scth.y );
    float2 scphi = float2( sqrt( 1.0 - ang.y * ang.y ), ang.y );

    float3 n;
    n.x = scth.y * scphi.x;
    n.y = scth.x * scphi.x;
    n.z = scphi.y;
    return n;



Octahedron-normal vectors

float2 OctWrap( float2 v )
    return ( 1.0 - abs( v.yx ) ) * ( v.xy >= 0.0 ? 1.0 : -1.0 );

float2 Encode( float3 n )
    n /= ( abs( n.x ) + abs( n.y ) + abs( n.z ) );
    n.xy = n.z >= 0.0 ? n.xy : OctWrap( n.xy );
    n.xy = n.xy * 0.5 + 0.5;
    return n.xy;

float3 Decode( float2 f )
    f = f * 2.0 - 1.0;

    float3 n = float3( f.x, f.y, 1.0 - abs( f.x ) - abs( f.y ) );
    float t = saturate( -n.z );
    n.xy += n.xy >= 0.0 ? -t : t;
    return normalize( n );




Spherical coordinates have bad value distribution and bad performance. Distribution can be fixed by using some kind of spiral [SPS12]. Unfortunately it still requires costly trigonometry and quality is only marginally better than octahedron encoding.

One other method worth mentioning is Crytek’s best fit normals [Kap10]. It provides extreme precision. On the other hand it won’t save any space in G-Buffer as it requires 3 components. Also encoding uses a 512×512 lookup texture, so it’s quite expensive.

Octahedron encoding uses a low number of instructions and there are only two non-full rate instruction (calculated on “transcendental unit”). One rcp during encoding and one rcp during decoding. In addition quality is quite good. Concluding octahedron-normal vectors have great quality to performance ratio and blow out of water old methods like spherical coordinates.

UPDATE: As pointed by Alex in the comments, interesting normal encoding technique survey was just released [CDE*14]. It includes detailed octahedron normal comparison with other techniques.

UPDATE 2: Added optimized Octahedron decoding by Rune Stubbe.


[MSS*10] Q. Meyer, J. Sübmuth, G. Subner, M. Stamminger, G. Greiner – “On Floating-Point Normal Vectors”,  Computer Graphics Forum 2010
[ED08] T. Engelhardt, C. Dachsbacher – “Octahedron Environment Maps”, VMW 2008
[Kap10] A. Kaplanyan – “CryENGINE 3: Reaching the speed of light”, Siggraph 2010
[SPS12] J. Smith, G. Petrova, S. Schaefer – “Encoding Normal Vectors using Optimized Spherical Coordinates”, Computer and Graphics 2012
[CDE*14] – Z. H. Cigolle, S. Donow, D. Evangelakos, M. Mara, M. McGuire, Q. Meyer – “A Survey of Efficient Representations for Independent Unit Vectors”, JCGT 2014

Posted in Graphics | 15 Comments

Simple GPUView custom event markers

GPUView is a powerful tool for GPU/CPU interaction profiling for Windows. It’s interface isn’t very user friendly, but it gets job done. I used it for optimizing in-house GPU lightmapper and spend some time trying to find a way to add custom event markers.

Most solutions on web are quite complicated – involving writing strange DLLs, manifests, using ECManGen.exe… Thankfully there is a much simpler solution.

First register an event handler using custom GUID:

REGHANDLE gEventHandle;
GUID guid;
UuidFromString( (RPC_CSTR) "a9744ea3-e5ac-4f2f-be6a-42aad08a9c6f", &guid );
EventRegister( &guid, nullptr, nullptr, &gEventHandle );

Then just call EventWriteString with custom text:

EventWriteString( gEventHandle, 0, 0, L"Render" );

Final step is to modify log.cmd in order to add this custom GUID for tracing (same one, which was passed to EventRegister). Just pass it as new Xperf parameter (see TRACE_DSHOW or TRACE_DX variables for reference).

During next GPUView profiling session open “Event Listing” dialog and locate custom event by GUID:


I guess it should also work for XPerf and other tools which use windows event tracing. For better integration look up Writing anifest-based eventsWriting an Instrumentation Manifest and ECManGen.exe on MSDN.

Posted in Graphics | 3 Comments

LA Noire

LA Noire has some amazing tech for face animations. Basically actors are filmed from multiple cameras and resulting movies are converted to a keyframed animation and animated textures. All the textures are captured in neutral lighting conditions, so usually lighting doesn’t fit in game environment. Looks like that those textures are animated at around 3 frames per second. Eyes are animated separately and at higher rate. This approach has also some interesting “artifacts”, as it’s impossible to capture everything during one day. For example you can see as hair shifts and changes when blending between two performances captured at different days:


More info:
LA Noire face tech animation trailer
LA Noir tech description by IQGamer
MotionScan website

Posted in Graphics | Leave a comment

Unreal Engine 4 gaussian specular normalization

Recently Epic did a nice presentation about their new tech: “The technology behind Unreal 4 Elemental demo“. Among a lot of impressive stuff they showed their gaussian specular aproximation. Here is a BRDF with U4 specular for Disney’s BRDF explorer:


::begin parameters
float n 1 512 100
bool normalized 1
::end parameters

::begin shader

vec3 BRDF( vec3 L, vec3 V, vec3 N, vec3 X, vec3 Y )
    vec3 H = normalize( L + V );
    float Dot = clamp( dot( N, H ), 0, 1 );
    float Threshold = 0.04;
    float CosAngle = pow( Threshold, 1 / n );
    float NormAngle = ( Dot - 1 ) / ( CosAngle - 1 );
    float D = exp( -NormAngle * NormAngle );

    if ( normalized )
        D *= 0.17287429 + 0.01388682 * n;

    return vec3( D );
::end shader

This aproximation was tweaked to have less aliasing than the standard Blinn-Phong specular (it has smoother falloff):






Mentioned presentation doesn’t include a normalization factor for it. It was a nice excuse for spending some time with Mathematica and try to derive it myself.

Basic idea of normalization factor is that lighting needs to be energy conserving (outgoing energy can’t be greater than incoming energy). This means that integral of BRDF times cos(theta) over upper hemisphere can’t exceed 1 or more specifically in our case we want it to be equal 1:


The highest values will be when light direction equals normal (L=N). This means that we can replace dot(N,H) with cos(theta/2), as now angle between H (halfway vector) and N equals to half of angle between L and N. This greatly simplifies the integral. Now we can replace the f(l,v) with U4 gaussian aproximation:


Unfortunately neither I nor Mathematica could solve it analytically. So I had to calculate values numerically and try to fit various simple functions over range [1;512]. The best aproximation which I could find was: 0.17287429 + 0.01388682 * n. Where n is Blinn-Phong specular power.



As you can see it isn’t accurate for small specular power values, but on the other hand it’s very fast and specular power below 16 aren’t used often.

Posted in Graphics, Lighting | 1 Comment

Visual C++ linker timestamp

Nice trick to get build’s timestamp at runtime (Visual C++ only):



IMAGE_NT_HEADERS const* ntHeader
= (IMAGE_NT_HEADERS*) ( (char*) &__ImageBase + __ImageBase.e_lfanew );
DWORD const timeStamp = ntHeader->FileHeader.TimeDateStamp;

It’s not very portable, but it doesn’t require any additional recompilation (as __DATE__ __TIME__ macros do).

Posted in C++ | 1 Comment

PixelJunk Eden data extractor

Recently Q-Games ported one of their games – PixelJunk Eden to Windows. Actually it’s their first Windows release. From tech perspective it’s not as impressive as PixelJunk Shooter series, but still it was quite interesting to poke around this game’s files. Unfortunately game data was stored in custom format and encrypted. I wonder why people waste time to encrypt game data.

I tried to reverse engineer data files and wrote a simple extractor program. You can download sources here. The rest of the post contains information about encryption and file formats.

Files are encrypted using a homemade xor encryption scheme:

int seed = fileSize + 0x006FD37D;
for ( unsigned i = 0; i < fileSize; ++i )
    int const xorKey = seed * ( seed * seed * 0x73 - 0x1B ) + 0x0D;
    fileByteArr[ i ] ^= xorKey;

Game data is stored in lump_x_x.pak files with description in the lump.idx file. Lump.idx consits of 1 header, multiple lump_x_x.pak descriptors and multiple packed file descriptors. All files (*.idx and *.pak) are encrypted using the mentioned above xor scheme.

lump.idx header:

struct IndexHeader
    char m_magic[ 4 ]; // "PACK"
    unsigned m_unknown0;
    unsigned m_unknown1;
    unsigned m_packedFileMaxID; // packed file num - 1
    unsigned m_lumpFileMaxSize;
    unsigned m_lumpFileNum;
    char m_align[ 230 ];

lump.idx file descriptors of lump_x_x.pak files:

struct IndexLumpDesc
    unsigned m_unknown;
    unsigned m_lumpSize;
    unsigned char m_lumpPartID;
    unsigned char m_lumpID;
    char m_align[ 2 ];

lump.idx file descriptors of packed files:

struct IndexFileDesc
    char m_filename[ 120 ];
    unsigned m_offset;
    unsigned m_size;

lump_x_x.pak files contain packed files at specified offsets. Every packed file is stored at offset aligned to 128 bytes.

Posted in C++ | Leave a comment