Stage3D “extended” and optimizations

Hi! UAFPUG and FlashGAMM were awesome and at first I wanna talk about new “extended” profile in Stage3D. I’m really excited about new features. You can download new FlashPlayer 11.6 and AIR 3.6 Beta with Stage3D “extended” profile, here is a release notes

Multiple render targets(MRT)
MRTs allow a fragment shader to write out data to up to four different targets. This potentially removes the need to render objects multiple times, once for each effect. However, the savings are only useful in some situations. For example, now with Stage3D we can implement deferred shading technique that computes and stores multiple values simultaneously: it computes all material properties, surface normal, diffuse and specular material properties and stores these in separate textures. These properties become used when lighting the scene with multiple lights in subsequent passes. All render targets must have the same width, height and bit depth.

GPU Derivatives
Complex filtering depends on knowing just how much of the texture (or shading) we need to filter. And now Stage3D provides partial derivative functions to help us. For any value used in shading, we can ask the GPU: “How much does this value change from pixel to pixel, in either the screen-x or the screen-y direction?” These functions are ddx and ddy. Although they are little used, they can be very helpful for filtering and antialiasing operations. These derivative functions provide us with the necessary information to perform procedural filtering or to adroitly modify the filtering inherent in most texture sampling.

if you want to get more into MRTs and GPU Derivatives please visit:
Chapter 19. Deferred Shading in Tabula Rasa
Chapter 24. High-Quality Filtering
Deferred Shading
Deferred Rendering in Killzone 2
ShaderX2 : Shader Programming Tips & Tricks with DirectX 9
Per-Pixel Displacement Mapping with Distance Functions

How does Stage3D render?

Molehill always renders in separate Render thread

drawTriangles call takes all state calls for the drawcall, copies it into a queue and then immediatly returns. Then the render thread pulls from the queue, validate states and then send them directly to the GPU. And if you want to get an error messages from a validation you need to set enableErrorChecking = true, it sets the rendering pipeline to the main thread and you can get the errors.

Ok, after present() method call, the main and rendering threads must be synchronized, and you must keep in mind that CPU(physics computation and etc.) and GPU can contain bottlenecks which can drop your FPS, because one thread will wait for the other.

GPU bottlenecks:
1) Vsync
2) You’ve asked the GPU to do too much in a frame and the CPU is waiting for it to finish


Vsync is an optional GPU driver property, which solves the synchronization problem of frame rate with vertical blanking interval of monitor for smooth swapping from back to front(at first we draw to the back buffer and then it swaps to the front). Without Vsync you can get some artifacts, for example, one part of the image is a previous frame and other part is the current frame. This produces a small penalty in latency, because the program has to wait until the video controller has finished transmitting the image to the display before continuing. Triple buffering reduces this latency significantly. But in Stage3D there are no any API to work with buffers and we will find other ways to optimize it.

The GPU only draws frames on a vsync(1/60th of a second). When Flash asks it to draw, it waits for the next vsync. This can be long for two reasons:

  • The CPU work per frame is less than 1/60th of a second. In this case the GPU swap time represents “idle” time while we wait for a vsync. This is harmless. This is probably the case for DisplayList apps.
  • Flash has its own, separate timer that drives the frame rate, and it can get offset from vsync. Imagine two lights blinking at different frequencies. At first they blink simultaneously, then they gradually drift apart, then they come together again, and then drift apart again. During some phases of this cycle, GPU swap will be small: Flash asks the GPU to draw just before a vsync. During other phases, GPU swap will be big: Flash asks the GPU to draw just after a vsync, and the GPU waits for the next one. This results in periodic dropped frames. Try changing your framerate to 20, 30 and 60.

(thanks to Adam Cath for the explanation)

You’ve asked the GPU to do too much

I recommend this book from Nvidia and Sebastian Marketsmueller talk “Deep Dive into Molehill”. There’s everything you need to know about the optimizations.

Common optimizations:

  • Cull early on CPU (use FVC, PVS, Occluders and etc.)
  • Merge geometry and textures. For example, you can use an index to refer which matrix will be used for a particular vertex. Example implementation
  • Use ATF textures for 3d
  • Minimize buffer allocations and uploads
  • Pre-compute what not dynamic
  • Maximum logic in a vertex shader and minimum in a fragment shader, simplify your lighting models

Try to avoid an alphatest(kil opcode)

PowerVR SGX is the graphics processor in the iPhone 3GS, iPhone 4, third-generation iPod touch, and the iPad, and is designed to support OpenGL ES 2.0. The PowerVR SGX uses a technique known as tile based deferred rendering (TBDR). When you call OpenGL ES functions to submit rendering commands to the hardware, those commands are buffered until a large list of commands are accumulated. These commands are rendered by the hardware as a single operation. To render the image, the framebuffer is divided into tiles, and the commands are drawn once for each tile, with each tile rendering only the primitives that are visible within it. The key advantage to a deferred renderer is that it accesses memory very efficiently. Partitioning rendering into tiles allows the GPU to more effectively cache the pixel values from the framebuffer, making depth testing and blending more efficient.
Another advantage of deferred rendering is that it allows the GPU to perform hidden surface removal before fragments are processed. Pixels that are not visible are discarded without sampling textures or performing fragment processing, significantly reducing the calculations that the GPU must perform to render the tile. To gain the most benefit from this feature, draw as much of the frame with opaque content as possible and minimize use of blending and the kil instruction in AGAL shaders. Because the hardware performs hidden surface removal, it is not necessary for your application to sort primitives from front to back.
If your application uses the kil opcode in the fragment shader, this hardware depth-buffer optimizations must be disabled. An alternative to using alpha test or kil is to use alpha blending with alpha forced to zero. This effectively eliminates any contribution to the framebuffer color while retaining the Z-buffer optimizations. This does change the value stored in the depth buffer and so may require back-to-front sorting of the transparent primitives.
For example, UnrealEngine Flash Demo use blending, but their sorting algorithm is not correct:

If you need to use a kil opcode, draw these objects separately in the scene after processing any primitives that do not require it.


1) Adobe Scout is the best way to detect bottlenecks
2) Use Intel GPA for debugging 3D

Thank you! Please, join to our Stage3D facebook group.

Rob Bateman, Anton Volkov, Thibault Imbert and me 🙂

Visit Gonchar Website.

4 Responses to “Stage3D “extended” and optimizations”

  1. Awesome post. You have covered everything i’ve missed.

  2. Arne Neugebauer says:

    Hi , very interessting articel. It brings me a little bit light in my darkness about the new features from Stage3D. Does this mean that i have Multiply Render Targets and so one just on Dektop Applications? Is there a timewindow for bringing this features on web applications without AIR?

  3. Gonchar says:

    Yep extended mode works on AIR and the web too, but doesn’t work on mobile platforms

  4. […] Video redet. ddx, ddy, mrt … Hier noch ein Blogartikel, der da ein paar Sachen zusammenfasst: Stage3D “extended” and optimizations | Sergey Gonchar Wird bei mir wohl noch ein Weilchen dauer, bis ich das brauche … __________________ […]

Leave a Reply

%d bloggers like this: