GPU Talk

Home » Posts tagged 'GPU compute'

Tag Archives: GPU compute

Vivante Vega GPU Geometry and Tessellation Shader Overview

GPU Technology Overview

GPU hardware has gone through an extensive overhaul over the past decade with the industry moving from first generation fixed function graphics accelerators (precursor to the GPU) all the way to the current generation general purpose “shader” pipelines that can be configured for graphics, parallel compute, image/vision, and video workloads. Keeping at least one step ahead of industry trends, the latest generation of Vega GPU products is highlighted below, including the addition of geometry shaders (GS) and tessellation shaders (TS) that add extreme visual rendering to the GPU pipeline. The new features allow developers to create photo realistic images and customized effects in their programs, and give consumers an amazing experience that brings PC-level graphics to mobile, home, and embedded products to create a seamless experience across any screen.


The images above showcase some of the major differences in visual quality and processing capabilities between successive generations of GPU hardware based on industry standard application programming interfaces (APIs) like OpenGL® ES and Microsoft® DirectX®. Graphics APIs are a common interface that provides a hardware abstraction layer for application developers to access GPU hardware through programming calls to the operating system (OS). With APIs, developers only need to focus on the high level details of their graphics application so they can focus on maximizing performance, visuals, and UI quality and not be concerned with low level programming details of the underlying GPU hardware and architecture. A simplified process is as follows. When an application wants to render an object onscreen, the application uses standard API function calls. The API calls then go to the OS which invokes the GPU driver and tells the GPU hardware to draw the corresponding object and display it on the device screen.

APIs are just a starting point and guideline for GPU IP designers to implement their designs. The true differentiator that gives the Vega architecture its advantage comes down to the careful analysis and design of every nut-and-bolt in the GPU. This secret sauce is continuous optimization of the entire GPU micro-architecture and algorithms, to get the highest performance and complete feature set in the smallest die area and power to gain the best silicon PPA leadership built around Vivante’s motto of Smaller-Faster-Cooler. The Vega design analysis also takes it a step further by deep diving into the entire user experience from gaming, CAD, productivity apps, and innovative user interfaces, to the underlying system level optimizations between the GPU, CPU, VPU, ISP, SoC fabric, memory and display subsystems. In addition, the addition of GS and TS to the Vega GPU pipeline brings additional system level enhancements and power reduction, which are discussed below.

GPU Technology Evolution

1st Generation (pre-2002):

These products were based on fixed function graphics hardware that used transform and lighting (T&L) engines and designed specifically for graphics use. In many cases, the hardware was hard-coded with certain rendering algorithms to speed up performance. Basic rendering features included 3D geometry transformations, rasterization, fixed function lighting calculations, dot products, and texture mapping/filtering. These GPU cores were not able to support any form of programmability beyond basic graphics tasks, and many relied heavily on the CPU to aid in the rendering process. Application developers could create basic characters and animation using multi-pass rendering and multi-textures with some simple in-game artificial intelligence (AI) to make their games more realistic, but they were limited by what the hardware supported.


2nd Generation:

The next revolution in graphics hardware introduced the concept of dynamic graphics programmability using separate vertex (VS) and pixel (PS) shaders. The VS replaced the T&L engine and the PS calculated pixel color and textures to allow high quality details on an object.  This new model provided an additional programmable API layer that developers could tap into by writing shader assembly language to control the graphics pipeline and give them freedom to start customizing their applications. Knowledge of the GPU pipeline was necessary for developer modifications, so fancy visual effects were mostly found in AAA game titles.

Leading GPUs in this generation, like the early versions of Vivante GC Cores, allowed even more programmability through longer and more complex shader programs and high precision (32-bit) rendering. These cores also lowered CPU load by performing all vertex calculations inside the GPU, instead of offloading vertex calculations to the CPU like other designs. New effects in this generation include dynamic lighting, increased character count, increased realism, rigid bodies, and dynamic shading, which made game environments come to life.


3rd Generation:

The next iteration brought even more programmability to mobile GPUs, with Vivante leading the way with the first unified shaders and GPUs with compute capabilities (OpenCL, DirectCompute, and Renderscript). Initial GC Core unified shader designs combined the VS and PS into a single cohesive unit, enabling each shader block to perform vertex and pixel operations. Unified shaders allow each unit to be maximally utilized and load-balanced depending on workload, which minimizes bottlenecks for vertex or pixel bound operations. From a developer perspective, they could view the shader as a single unit instead of separate VS/PS blocks, and scheduling of VS or PS instructions is transparent to them and handled automatically by the GPU driver and hardware. New features introduced included basic game physics (explosions, water ripples, object collisions, etc.), game AI, procedural generation, and custom rendering and lighting, to add another level of realism to applications.

The next step forward in the Vivante line-up is the latest Vega GPU cores that go beyond the VS/PS unified shaders. The Vega version of the unified shader architecture builds on the success of the initial design by adding the geometry shader (GS) into the unified pipeline. The GS is an advanced feature which works alongside the VS to allow the shader unit to modify, create, or destroy primitive vertex data (line, point, triangle) without CPU intervention. As a comparison, earlier hardware without a GS only allowed the VS to process one vertex at a time, and the VS could not create or destroy a vertex. Any modifications to vertex data would require CPU-GPU coordination (overhead), resource intensive state changes, and creation of a new vertex stream, with the old method.

The GS also allows the graphics pipeline to access adjacent primitives so they can be manipulated as a closely knit group to create realistic effects where neighboring vertices interact with each other to create effects like smooth flowing motion (hair, clothes, etc.). The GS/VS/PS combination allows more autonomous operation of the GPU to handle state changes internally (minimize CPU-GPU interaction) by adding arithmetic and dynamic flow control logic to offload operations that were previously done on the CPU. The GPU also can support high level programming languages like C/C++, Java, and others to make it more CPU-like in terms of general programmability and branching.

Another important feature is Stream Out, where the VS/GS can output data directly to memory and the data can be accessed automatically and repeatedly by the shader unit or any other GPU block without CPU intervention. Stream Out is useful for recursive rendering (data re-use) on objects that require multiple passes, such as morphing of object surfaces and detailed displacement mapping. The Vega design also adds flexibility so that any stage of the rendering pipeline can output multi-format data and arrays directly to memory (multi-way pipeline) to avoid wasting processing power on intermediate vertices or pixels. Previously, primitive data needed to go through the entire pipeline and exit the PS before being written to memory, which wasted valuable clock cycles if the data was not used. Features using Stream Out go beyond the rendering and programming capabilities of the first unified shaders to include better physics AI where continuous calculations are performed to generate and destroy primitives for realistic effects that simulate waves/ripples, smoke blowing in the wind, blooms, and intense explosions.

Other improvements in the Vivante design include improved support of multicore system to take advantage of multiple threads and multiple processing units for higher performance. Overall system efficiency also improves with this new design that includes less API call overhead, minimal state changes, better runtime efficiency, and minimal CPU-intensive rendering calls (ex. Reflections and refractions handled on the GPU).


And Now…Introducing Vivante’s Latest Generation Vega Cores

4th Generation:

We showed the progression of GPU technologies and specifically Vivante GPU cores that include unified VS/GS/PS shader blocks. The latest licensable cores from the Vega series include the addition of the tessellation shader (TS) made up of the corresponding pipeline datapaths including the Hull Shader (known as the Tessellation Control Shader in OpenGL), fixed function Tessellator (Tessellation Primitive Generator in OpenGL), and Domain Shader (Tessellation Evaluation Shader in OpenGL), collectively referred to as the TS for simplicity.

The basic idea of tessellation is taking a polygon mesh or patch and recursively subdivide it to create very fine grained details without requiring a large amount of memory (or bandwidth) to create all the photorealistic details. The GPU receives data at a coarse/low resolution (small memory footprint, low bandwidth) and renders at a high resolution (ex. 4K games) based on a tessellation factor and LOD (Ievel-of-detail), as shown below from Unigine. The automatic subdivision is considered watertight (no “holes” as more vertices added with tessellation) and everything is performed inside the GPU on the lower resolution model, without CPU intervention. Working on a low resolution model also reduces calculation requirements and significantly cuts power by allowing the GPU to complete the task either by running at a lower frequency over the task execution time, or initially running at a higher frequency to complete the task faster then immediately powering down to keep average power low.


There is also a new primitive type called a “patch” that is only supported with Vivante TS enabled GPUs. A patch has no implied topology and can have between 1 and 32 control points which the TS blocks use to manipulate and detail an object surface. The TS consists of the following blocks:

  • Hull Shader (HS) is a programmable shader that produces a geometry (surface) patch from a base input patch (quad, triangle, or line) and calculates control point data that is used to manipulate the surface. The HS also calculates the adaptive tessellation factor which is passed to the tessellator so it knows how to subdivide the surface attributes.
  • Tessellator is a fixed function (but configurable) stage that subdivides a patch into smaller objects (triangles, lines or points) based on the tessellation factor from the HS.
  • Domain Shader (DS) is a programmable shader that evaluates the surface and calculates new vertex position for each subdivided point in the output patch, which is sent to the GS for additional processing.


Vivante’s GPU design also adds improved multithreaded rendering support that includes asynchronous resource loading/creation and parallel render list creation to optimize resource usage and increase performance to take advantage of simultaneous foreground/background processing and prevent rendering bottlenecks. Other optimizations, as shown in the image above, include dynamic realism through physics, soft bodies, high quality details, ray tracing, lighting, shadows, multimedia processing (to tie into video and ISP image processing pipelines), and much more. These special effects can leverage the GPU pipeline or the compute capabilities of the Vega GPU to substantially improve visual quality, lower power, reduce bandwidth, and cut system resource loading to give users an optimized and immersive gaming experience.


From the early days of Vivante in 2004, the company had the foresight to know the next major market for GPUs would be in mobile and embedded products driven by insatiable consumer appetite for the latest features, performance, and rendering HDR quality, similar to what occurred in the PC graphics card market. With this in mind, the initial 2004 Vivante GC architecture was built around the leading API of the time, DirectX 9 (SM 3.0), and was even forward looking to support OpenGL ES 3.0 (released in 2013) even before the ES 3.0 specification was released nine years later! The goal was to bring the latest desktop GPU quality, scalability, and features into ultra-low power mobile products constrained by battery power, thermals, and tiny form factors. So far in the mobile GPU market, Vivante has reached its milestones and continues to innovate and stay one step ahead of the industry to bring the best overall experience to consumers.

Introducing Vega…the latest, most advanced GPUs from Vivante

By Benson Tao

Breaking News…

One of the latest headlines coming out of IDF 2013 in San Francisco today is the unveiling of a next generation GPU product line from Vivante. This technology continues to break through the the limits of size, performance, and power to help customers deliver unique products quickly and cost-effectively. The first generation solutions were introduced in 2007 (Generation 1) and upgraded again in 2010 (Generation 2) with new enhancements that were shipped in tens of millions of products. Gen 2 solutions already exceeded PC and console quality graphics rendering, which is the standard other GPU IP vendors strive to reach today. The next version (Gen 3) successfully hit key industry milestones by becoming the first GPU IP product line to pass OpenCL™ 1.1 conformance (CTS) and the first IP to be successfully designed into real time mission critical Compute applications for automotive (ADAS), computer vision, and security/surveillance. The early Gen 3 cores, designed and completed before the OpenGL ES 3.0 standard was fully ratified, were forward looking designs that have already passed OpenGL ES 3.0 conformance (CTS) and application testing. Many of the latest visually stunning games can be unleashed on the latest Gen 3 hardware found in leading devices like the Samsung Galaxy Tab 3 (7″), Huawei Ascend P6, Google Chromecast, GoogleTV 2.0/3/0, and other 4K TVs.

With the unveiling of Vivante’s fourth generation (codenamed “Vega”) ScalarMorphic architecture, the latest designs provide a foundation for Vivante’s newest series of low-power, high-performance, silicon-efficient GPU cores. Vivante engineering continues to respond quickly to industry developments and needs, and continuously refines and enhances its hardware specifications in order to remain at the top of the industry through partnerships with ecosystem vendors.

ProductsSample of Vivante Powered Products

What is Vega?

Vega is the latest, most advanced mobile GPU architecture from Vivante. Leveraging over seven years of architectural refinements and more than 100 successful mass market SOC designs, Vega is the cumulation of knowledge that blends high performance, full featured API support, ultra low power and programmability into a single, well defined product that changes the industry dynamics. SOC vendors can now double graphics performance and support the latest API standards like OpenGL ES 3.0 in the same silicon footprint as the previous generation OpenGL ES 2.0 products. Silicon vendors can also leverage the Vega design to achieve equivalent leading edge silicon process performance in a cost effective mainstream process. This effectively means that given the same SOC characteristics, a TSMC 40nm LP device can compete with a TSMC 28nm HPM version, at a more affordable cost that opens up the market to mainstream silicon vendors that were initially shut out of leading edge process fabrication due to their high initial costs.

Vega is also optimized for Google™ Android and Chrome products (but also supports Windows, BB OS and others), and fast forwards innovation by bringing tomorrow’s 3D and GPU Compute standards into today’s mass market products. Silicon proven to have the smallest die area footprint, graphics performance boost, and scalability across the entire product line, Vega cores extend Vivante’s current leadership in bringing all the latest standards to consumer electronics in the smallest silicon area. Vega 3D cores are adaptable to a wide variety of platforms from IoT (Internet-of-Things) and wearables, to smartphones, tablets, TV dongles, and 4K/8K TVs.

Whether you are looking for a tiny single shader stand-along 3D core or a powerhouse multi-core multi-shader GPU that can deliver high performance 3D and GPGPU functionality, Vivante has a market-proven solution ready to use. There are several options available when it comes to 3D GPU selection: 3D only cores, 3D cores designed with an integrated Composition Processing engine, and 3D cores with full GPGPU functionality that blend real-life graphics with GPU Compute. Vivante already is noted in the industry as the IP provider with the smallest, full-featured licensable cores in every GPU class.

Now let’s dive into some of the Vega listed features to see what they mean…

Hardware Features

  • ScalarMorphic™ architecture
    • Optimized for multi-GPU scalability and multi-threaded, multi-core heterogeneous platforms. This makes the GPU and GPU Compute cores as independent or cohesive as needed, flexible and developer friendly as new applications built on graphics + compute come online.
    • The same premium core architecture as previous generations is still intact, but it has been improved over time to remove inefficiencies. This also allows the same unified driver architecture to work with Vega cores and previous GC cores, so there is no waste of previous developer resources to re-code or overhaul apps for each successive Vivante GPU core.
    • Advanced scheduler and command dispatch unit for optimized shader load balancing and resource allocation.
    • Dynamic branching and non-constant varying indexing.
  • Ultra-threaded, unified shaders
    • Maximize graphics throughput, process millions of threads in parallel, and minimize latency.
    • The GPU scheduler and cores can process other threads while waiting for data to return from system memory, hiding latency and ensuring the cores are being used efficiently with minimal downtime. Context switching between threads is done automatically in hardware which costs zero cycles.
    • These shaders are more than just single way pipelines with added features that make the GPU more general purpose with multi-way pipelines to benefit various processing required for graphics and compute.
  • Patented math units that work in the Logarithmic space
    • In graphics there are different methods to calculate math and get the correct results.  With this method Vega cores can reduce area, power, and bandwidth that speeds up the overall system performance.
  • Fast, immediate hidden surface removal (HSR)
    • Eliminates render processing time by an average of 30% since a more advanced method to remove back-facing or obscure surfaces is implemented on the fly so minimal or no pre-processing time is wasted. This also goes beyond past versions where the GPU was automatically removing individual pixels (ex. early Z, HZ, etc.).
  • Power savings
    • Saves power up to 65% over previous GC Cores using intelligent DVFS and incremental low power architectural enhancements.
  • Proprietary Vega lossless compression
    • Reduces on-chip bandwidth by an average of 3.2:1 and streamlines the graphics subsystem including the GPU, composition co-processsor (CPC), interconnect, and memory and display subsystems. This is important to make sure the entire visual pipeline from when an app makes an API call to the output on the screen is smooth and crisp at optimal frame rates, with no artifacts or tearing regardless of the GPU loading.
  • Built-In Visual Intelligence
    • ClearView image quality – Life-like rendering with high definition detail, MSAA, and high dynamic range (HDR) color processing. This improves image quality, clarity, and matches real life colors that are not oversaturated.
    • Large display rendering – Up to 4K/8K screen resolution including multi-screen support that makes sure the GPU pipelines are balanced.
    • New additions using color correction can be implemented to correct color, increase color space using shaders (or OpenCL/RS-FS) or FRC.
    • NUIs can also take advantage of visual processing for motion and gesture.
  • Industry’s smallest graphics driver memory footprint
    • For the first time, smaller embedded or low end consumer devices and DDR-cost constrained systems can now support the latest graphics and various compute applications that fit those segments. With a smaller footprint you don’t need to increase system BOM cost by adding another memory chip, which is crucial in the cost sensitive markets.
    • There are also Vivante options that support DDR-less MCU/MPUs in the Vega series where no external DDR system memory exists.

More About the Shaders

  • Dynamic, reconfigurable shaders
    • Pipelined FP/INT double (64-bit), single/high (32-bit) and half precision/medium (16-bit) precision IEEE formats for GPU Compute and HDR graphics.
    • Multi-format support for flexibility when running compute in a heterogeneous architecture where coherency exists between CPU-GPU, high precision graphics, medium precision graphics, computational photography, and fast approximate calculations needed for fast, approximate calculations (for example, some image processing algorithms only need to approximate calculations for speed instead of accuracy). With these options, the GPU has full flexibility to target multiple applications.
    • High precision pipeline with support for long instructions.
  • Gigahertz Shaders
    • Updated pipeline enables shaders to run over 1 GHz, while lowering overall power consumption.
    • The high speed along with intelligent power management allows tasks to finish sooner and keep the GPU in a power savings state longer, so average power is reduced.
    • Cores scalable from tens of GFLOPS to over 1 TFLOP in various multi-core GPU versions.
  • Stream-Out Geometry Shaders
    • Increases on-chip GPU processing for realistic, HDR rendering with stream-out and multi-way pipelines.
    • The GPU is more independent when using GS since it can process, create and destroy vertices (and perform state changes) without taking CPU cycles. Previous versions required the CPU to pre-process and load states when creating vertices.

Application Programming Interface (API) Overview

Some of the APIs supported by Vega are listed below. This is not an exhaustive list but includes the key APIs in the industry and show the flexibility of the product line.

  • Full featured, native graphics API support includes:
    • Khronos OpenGL ES 3.0/2.0, OpenGL 3.x2.x, OpenVG 1.1, WebGL
    • Microsoft DirectX 11 (SM 3.0, Profile 9_3)
  • Full Featured, native Compute APIs and support:
    • Khronos OpenCL 1.2/1.1 Full Profile
    • Google Renderscript/Filterscript
    • Heterogeneous System Architecture (HSA)

Product Line Overview

Please visit the Vivante homepage to find more information on the Vega product line.

  • GC400L – Smallest OpenGL ES 2.0 Core – 0.8 mm2 in 28nm
  • GC880 – Smallest OpenGL ES 3.0 Core – 2.0 mm2 in 28nm
GC400 Series GC800 Series GC1000 Series GC2000 Series GC3000 Series GC4000 Series GC5000 Series GC6000 Series GC7000 Series
Vega-Lite Vega 1X Vega 2X Vega 4X Vega 8X
Core Clock in 28HPM (WC-125) MHz 400 400 800 800 800 800 800 800 800
Shader Clock in 28HPM (WC-125) MHz 400 800 1000 1000 1000 1000 1000 1000 1000
Pixel Rate
(GPixel/sec, no overdraw)
200 400 800 1600 1600 1600 1600 3200 6400
Triangle Rate
(M tri/sec)
40 80 123 267 267 267 267 533 1067
Vertex Rate
(M vtx/sec)
100 200 500 1000 1000 2000 2000 4000 8000
Shader Cores (Vec 4)
High/Medium Precision
1 1 2 4 4/8 8 8/16 16/32 32/64
High/Medium Precision
3.2 6.4 16 32 32/64 64 64/128 128/256 256/512
API Support
OpenGL ES 1.1/2.0
OpenGL ES 3.0 Optional Optional
OpenGL 2.x Desktop
OpenVG 1.1
OpenCL 1.2 Optional Optional
DirectX11 (9_3) SM3.0 Optional Optional
Key: ✓  (Supported)   – (Not supported)

Exciting Updates @ SIGGRAPH 2013 HSA BoF

Please join us at the Heterogeneous System Architecture (HSA) Foundation’s BoF (Bird of Feather) talk at SIGGRAPH 2013. Phil Rogers, President of HSA and AMD Fellow will give the keynote speech and update us on the exciting progress they  have made to push the standard and technology forward. The BoF session will also have a Q&A section where you can get answers to some of your toughest questions.

Please look for us when you are there to ask us how we are innovating in this area, or you can just say “Hello” to us.

Event: HSA Foundation BoF
Date: July 24th
Time: 1 pm
Location: Anaheim Convention Center ( Room 202 B)