23.9 C
New York
Wednesday, September 4, 2024

Multisampled Anti-aliasing For Virtually Free — On Tile-Based mostly Rendering {Hardware} | by Shahbaz Youssefi | Android Builders


Anti-aliasing (AA) is a vital approach to enhance the standard of rendered graphics. Quite a few algorithms have been developed through the years:

  • Some depend on post-processing aliased pictures (similar to FXAA): These methods are quick, however produce low high quality pictures
  • Some depend on shading a number of samples per pixel (SSAA): These methods are costly as a result of excessive variety of fragment shader invocations
  • More moderen methods (similar to TAA) unfold the price of SSAA over a number of frames, lowering the fee to single-sampled rendering at the price of code complexity
Example of anti-aliasing. Left: Aliased, Right: Anti-Aliased
Anti-aliasing in Motion. Left: Aliased scene. Proper: Anti-aliased scene.

Whereas TAA and the likes are gaining reputation, MSAA has for a very long time been the compromise between efficiency and complexity. On this technique, fragment shaders are run as soon as per pixel, however protection assessments, depth assessments, and so on are carried out per pattern. This technique can nonetheless be costly as a result of increased quantity of reminiscence and bandwidth consumed by the multisampled pictures on Speedy-Mode Rendering (IMR) architectures.

Nonetheless, GPUs with a Tile-Based mostly Rendering (TBR) structure accomplish that properly with MSAA, it may be almost free if completed proper. This text describes how that may be achieved. Evaluation of prime OpenGL ES video games on Android exhibits MSAA isn’t used, and when it’s, its utilization is suboptimal. Visuals in Android video games might be dramatically improved by following the recommendation on this weblog submit, and virtually without cost!

The primary part beneath demonstrates how to do that on the {hardware} stage. The sections that observe level out the required API items in Vulkan and OpenGL ES to realize this.

With out going into an excessive amount of element, TBR {hardware} operates on the idea of “render passes”. Every render go is a set of draw calls to the identical “framebuffer” with no interruptions. For instance, say a render go within the software points 1000 draw calls.

TBR {hardware} takes these 1000 draw calls, runs the pre-fragment shaders and figures out the place every triangle falls within the framebuffer. It then divides the framebuffer in small areas (known as tiles) and redraws the identical 1000 draw calls in every of them individually (or slightly, whichever triangle really hits that tile).

The tile reminiscence is successfully a cache you can’t get unfortunate with. Not like CPU and lots of different caches, the place dangerous entry patterns may cause thrashing, the tile reminiscence is a cache that’s loaded and saved at most as soon as per render go. As such, it’s extremely environment friendly.

So, let’s put one tile into focus.

Memory accesses between RAM, Tile Memory and shader cores. The Tile Memory is a form of fast cache that is (optionally) loaded or cleared on render pass start and (optionally) stored at render pass end. The shader cores only access this memory for framebuffer attachment output and input (through input attachments, otherwise known as framebuffer fetch).
Reminiscence accesses between RAM, Tile Reminiscence and shader cores. The Tile Reminiscence is a type of quick cache that’s (optionally) loaded or cleared on render go begin and (optionally) saved at render go finish. The shader cores solely entry this reminiscence for framebuffer attachment output and enter (via enter attachments, in any other case referred to as framebuffer fetch).

Within the above diagram, there are a selection of operations, every with a value:

  • Fragment shader invocation: That is the actual value of the applying’s draw calls. The fragment shader may entry RAM for texture sampling and so on, not proven within the diagram. Whereas this value is critical, it’s irrelevant to this dialogue.
  • Fragment shader attachment entry: Shade, depth and stencil information is discovered on the tile reminiscence, entry to which is lightning quick. This value can also be irrelevant to this dialogue.
  • Tile reminiscence load: This prices time and power, as accessing RAM is sluggish. Luckily, TBR {hardware} has methods to keep away from this value:
    – Skip the load and go away the contents of the framebuffer on the tile reminiscence undefined (for instance as a result of they’ll be fully overwritten)
    – Skip the load and clear the contents of the framebuffer on the tile reminiscence instantly
  • Tile reminiscence retailer: That is not less than as pricey as load. TBR {hardware} has methods to keep away from this value too:
    – Skip the shop and drop the contents of the framebuffer on the tile reminiscence (for instance as a result of that information is not wanted)
    – Skip the shop as a result of the render go didn’t modify the values that had been beforehand loaded

Crucial takeaway from the above is:

  • Keep away from load in any respect prices
  • Keep away from retailer in any respect prices

With that in thoughts, right here is how MSAA is completed on the {hardware} stage with virtually the identical value as single-sampled rendering:

  • Allocate area for MSAA information solely on the tile reminiscence
  • Do NOT load MSAA information
  • Render into MSAA framebuffer on the tile reminiscence
  • “Resolve” the MSAA information into single-sampled information on the tile reminiscence
  • Do NOT retailer MSAA information
  • Retailer solely the resolved single-sampled information

For comparability, the equal single-sampled rendering can be:

  • Do NOT load information
  • Render into framebuffer on the tile reminiscence
  • Retailer information

Trying extra intently, the next might be noticed:

  • MSAA information by no means leaves the tile reminiscence. There is no such thing as a RAM entry value for MSAA information.
  • MSAA information doesn’t take up area in RAM
  • No information is loaded on tile reminiscence
  • The identical quantity of knowledge is saved in RAM in each instances

Mainly then the one further value of MSAA is on-tile protection assessments, depth assessments and so on, which is dwarfed compared with all the things else.

When you can implement that in your program, it is best to have the ability to get MSAA rendering at no reminiscence value and virtually no GPU time and power value. For as soon as, you possibly can have your cake and eat it too! Simply don’t go overboard with the pattern rely, the tile reminiscence continues to be restricted. 4xMSAA is your best option on immediately’s {hardware}.

Learn extra about Render Passes with out MSAA right here.

Vulkan makes it very straightforward to make the above occur, because it’s virtually structured with the above mode of rendering in thoughts. All you want is:

  • Allocate your MSAA picture with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT, on reminiscence that has VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT
    – The picture is not going to be allotted in RAM if no load or retailer is ever completed to it
  • Do NOT use VK_ATTACHMENT_LOAD_OP_LOAD for MSAA attachments
  • Do NOT use VK_ATTACHMENT_STORE_OP_STORE for MSAA attachments
  • Use a resolve attachment for any MSAA attachment for which you want the information after the render go
    – Use VK_ATTACHMENT_LOAD_OP_DONT_CARE and VK_ATTACHMENT_STORE_OP_STORE for this attachment

The above instantly interprets to the free MSAA rendering recipe outlined within the earlier part.

This may be completed even simpler with the VK_EXT_multisampled_render_to_single_sampled extension the place supported, the place multisampled rendering might be completed on a single-sampled attachment, with the motive force caring for all of the above particulars.

For reference, please see this modification to the “hello-vk” pattern: https://github.com/android/ndk-samples/pull/995. Particularly, this commit exhibits how a single-sampled software might be shortly changed into a multisampled one utilizing the VK_EXT_multisampled_render_to_single_sampled extension, and this commit exhibits the identical with resolve attachments.

When it comes to numbers, with locked GPU clocks on a Pixel 6 with a latest ARM GPU driver, the render passes in several modes take roughly 650us when single-sampled and 800us when multisampled with both implementation (so, not fully free). GPU reminiscence utilization is an identical in each instances. For comparability, when utilizing resolve attachments, if the shop op of the multisampled coloration attachments is VK_ATTACHMENT_STORE_OP_STORE, the render go takes roughly 4300us and GPU reminiscence utilization is considerably elevated. That’s greater than 5x decelerate through the use of the fallacious retailer op!

In distinction with Vulkan, OpenGL ES doesn’t make it clear tips on how to greatest make the most of TBR {hardware}. In consequence, quite a few functions are riddled with inefficiencies. With the information of the perfect render go within the sections above, nonetheless, an OpenGL ES software may carry out environment friendly rendering.

Earlier than entering into the small print, it is best to know concerning the GL_EXT_multisampled_render_to_texture extension, which permits multisampled rendering to a single-sampled texture and lets the motive force do all of the above routinely. If this extension is on the market, it’s one of the simplest ways to get MSAA rendering for almost free. It is sufficient to use glRenderbufferStorageMultisampleEXT() or glFramebufferTexture2DMultisampleEXT() with this extension to show single-sampling into MSAA.

Now, let’s see what OpenGL ES API calls can be utilized to create the perfect render go with out that extension.

Single Render Move

Crucial factor is to verify the render go is just not cut up into many. Avoiding render go splits is essential even for single-sampled rendering. That is really fairly difficult with OpenGL ES, and drivers do their greatest to reorder the applying’s calls to maintain the variety of render passes to a minimal.

Nonetheless, functions can assist by having the render go include nothing however:

  • Bind packages, textures, different assets (not framebuffers)
  • Set rendering state
  • Draw

Altering framebuffers or their attachments, sync primitives, glReadPixels, glFlush, glFinish, glMemoryBarrier, useful resource write-after-read, read-after-write or write-after-write, glGenerateMipmap, glCopyTexSubImage2D, glBlitFramebuffer, and so on are examples of issues that may trigger a render go to prematurely end.

Load

To keep away from loading information from RAM onto the tile reminiscence, the applying can both clear the contents (with glClear()) or let the motive force know the contents of the attachment is just not wanted. This latter is a vital operate for TBR {hardware} that’s sadly severely underutilized:

const GLenum discards[N] = {GL_COLOR_ATTACHMENT0, …};
glInvalidateFramebuffer(GL_DRAW_FRAMEBUFFER, N, discards);

The above should be completed earlier than the render go begins (i.e. the primary draw of the render go) if the framebuffer is just not in any other case cleared and previous information doesn’t have to be retained. That is additionally helpful for single-sampled rendering.

Retailer

The important thing to avoiding storing information to RAM can also be glInvalidateFramebuffer(). Even with out MSAA rendering, this can be utilized for instance to discard the contents of the depth/stencil attachment after the final go that makes use of it.

const GLenum discards[N] = {GL_COLOR_ATTACHMENT0, …};
glInvalidateFramebuffer(GL_DRAW_FRAMEBUFFER, N, discards);

You will need to word that this should be completed proper after the render go is completed. If it’s completed any later, it could be too late for the motive force to have the ability to modify the render go’s retailer operation accordingly.

Resolve

Invalidating the contents of the MSAA coloration attachments alone is just not helpful; all rendered information shall be misplaced! Earlier than that occurs, any information that must be saved should be resolved right into a single-sampled attachment. In OpenGL ES, that is completed with glBlitFramebuffer():

glBindFramebuffer(GL_READ_FRAMEBUFFER, msaaFramebuffer);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer);
glBlitFramebuffer(0, 0, width, top, 0, 0, width, top,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

Be aware that as a result of glBlitFramebuffer() broadcasts the colour information into each coloration attachment of the draw framebuffer, there ought to be just one coloration buffer in every framebuffer used for resolve. To resolve a number of attachments, use a number of framebuffers. Depth/stencil information might be resolved equally with GL_DEPTH_BUFFER_BIT and GL_STENCIL_BUFFER_BIT.

The Full Image

Right here is all of the above in motion:

// MSAA framebuffer setup
glBindRenderbuffer(GL_RENDERBUFFER, msaaColor0);
glRenderbufferStorageMultisample(GL_RENDERBUFFER, 4, GL_RGBA8,
width, top);
glBindRenderbuffer(GL_RENDERBUFFER, msaaColor1);
glRenderbufferStorageMultisample(GL_RENDERBUFFER, 4, GL_RGBA8,
width, top);

glBindFramebuffer(GL_FRAMEBUFFER, msaaFramebuffer);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_RENDERBUFFER, msaaColor0);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT1,
GL_RENDERBUFFER, msaaColor1);

// Resolve framebuffers setup
glBindTexture(GL_TEXTURE_2D, resolveColor0);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA8, width, top);
glBindFramebuffer(GL_FRAMEBUFFER, resolveFramebuffer0);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_TEXTURE_2D, resolveColor0, 0);

glBindTexture(GL_TEXTURE_2D, resolveColor1);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA8, width, top);
glBindFramebuffer(GL_FRAMEBUFFER, resolveFramebuffer1);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_TEXTURE_2D, resolveColor1, 0);

// Begin with no load. Alternatively, you possibly can clear the framebuffer.
const GLenum discards[] = {GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1};
glBindFramebuffer(GL_FRAMEBUFFER, msaaFramebuffer);
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, discards);

// Draw after draw after draw ...

// Resolve the primary attachment (if wanted)
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer0);
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBlitFramebuffer(0, 0, width, top, 0, 0, width, top,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

// Resolve the second attachment (if wanted)
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer1);
glReadBuffer(GL_COLOR_ATTACHMENT1);
glBlitFramebuffer(0, 0, width, top, 0, 0, width, top,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

// Invalidate the MSAA contents (nonetheless accessible because the learn framebuffer)
glInvalidateFramebuffer(GL_READ_FRAMEBUFFER, 2, discards);

Be aware once more that it’s of utmost significance to not carry out the resolve and invalidate operations too late; they should be completed proper after the render go is completed.

Additionally value noting that if rendering to a multisampled window floor, the motive force does the above routinely as properly, however solely on swap. Utilization of a multisampled window floor might be limiting on this approach.

For reference, please see this modification to the “hello-gl2” pattern: https://github.com/android/ndk-samples/pull/996. Particularly, this commit exhibits how a single-sampled software might be shortly changed into a multisampled one utilizing the GL_EXT_multisampled_render_to_texture extension, and this commit exhibits the identical with glBlitFramebuffer().

With locked GPU clocks on a Pixel 6 with a latest ARM GPU driver, efficiency and reminiscence utilization is analogous between the single-sampled and GL_EXT_multisampled_render_to_texture. Nonetheless, utilizing actual multisampled pictures, glBlitFramebuffer() and glInvalidateFramebuffer(), efficiency is as sluggish as if the glInvalidateFramebuffer() name was by no means completed. This exhibits that optimizing this sample is difficult for some GL drivers, and so GL_EXT_multisampled_render_to_texture stays one of the simplest ways to do multisampling. With ANGLE because the OpenGL ES driver (which interprets to Vulkan), the efficiency of the above demo is corresponding to GL_EXT_multisampled_render_to_texture.

On this article, we’ve seen one space the place TBR {hardware} notably shines. When completed proper, multisampling can add little or no overhead on such {hardware}. Fortunately, the price of multisampling is so excessive when completed fallacious, it is vitally straightforward to identify. So, don’t worry multisampling on TBR {hardware}, simply keep away from the pitfalls!

I hope that with the above information we will see increased high quality rendering in cell video games with out sacrificing FPS or battery life.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles