By Benson Tao (Vivante Corporation)
The rise of Huawei is a corporate success story (and MBA case study) of determination and will to do whatever it takes to make a difference in people’s lives. This goal is achieved through the creation of the best and most innovative products possible without cutting corners or taking short cuts. This perseverance has turned Huawei from a tiny company that started selling PBX (private branch exchange or telephone switches) to a global behemoth that recently rose to take the crown as the world’s largest telecoms equipment maker in the world, surpassing Ericsson last year. The rapid rise of Huawei has also made it one of the top global brands in the world and a household name in some parts of the world. As the 2010 Fast Company fifth most innovative company in the world, a natural extension of their product line was to develop cutting edge smartphones and tablets to complement their existing user and communications infrastructure base. In the past year, their fruits of labor have helped push the mobile market forward with a few leading (and surprising) innovations:
- Fastest Quad core smartphone (Ascend Quad D)
One of the first 1080p smartphones (Ascend Quad D2)
World’s fastest LTE smartphone (Ascend P2)
…AND the just announced thinnest smartphone in the world measuring only 6.18 mm thick (Ascend P6)
Ascend P6 (Source: Huawei)
What do all these leading Huawei products have in common, other than all being branded under the Ascend name? At the heart of each product is one of the best architected Quad Core CPU and GPU combinations the market has seen, all packaged in a Hisilicon K3V2 SoC. Hisilicon is the semiconductor division of Huawei and is one of the suppliers of products to their systems division that defines or builds final products. Huawei can source product from Hisilicon, Qualcomm, and others to fit product requirements (cost, performance, low power, etc.).
One reason the K3V2 was chosen to power their flagship phones was the capability, scalability, extreme low power, and performance of the Vivante GC4000 (Graphics and Compute) GPU. Extensive due diligence to qualify the Vivante architecture was done under a microscope to make sure the GPU could meet all relevant claims. In the end, the GC4000 met or exceeded the demanding test criteria for 3D graphics performance to play the most intense and detailed mobile 3D games, and the Vivante CPC (Composition Processing Core) helped accelerate their intuitive Emotion UI user interface for responsive, smooth, and fluid feedback.
The Ascend P6 is the best and most beautiful smartphone Huawei has designed, with intricate details going into its amazing design (look), high quality materials, metallic body (feel), and ease of use. For more information about the Ascend P6, please visit their minisite.
By Benson Tao (Vivante Corporation)
The world is very visual and thanks to the popularity of media processors (GPUs and VPUs), multimedia has become a ubiquitous part of life. Most of the things we interact with have some sort of screen, ranging from the usual consumer products to washing machines, thermostats, and any other product or gadget that needs to be upsold. Stick a screen on something, maybe add a camera, install Android, and you have a device that could be part of your Internet of Things (IoT) which you can probably sell for a premium…before the next product comes out the next day.
With the rise of multimedia that is standard across all screened devices, so has the explosive growth of media exchange that includes pictures, images, and videos. YouTube has 100 hours of video content uploaded every minute! Facebook has 250 million photos uploaded daily and over 100 billion (as of 2011) photos on its servers! These huge numbers have so many zeroes and something never seen before, and each day those numbers get larger and larger. The rise of visual media content is great, you can find clips of cats playing pianos, live, courtside highlights from the NBA playoffs, and pretty much anything that catches your attention. The flipside of the story is – How do you search for content without getting lost in the noise? Just going through links to find what you want can take a few seconds for simple, straightforward queries, to several minutes or hours for more involved searches. But this number is rising due to the massive amounts of data, and getting relevant search results fast is what counts. As more searches involve multimedia, text hyperlinks will not work so a new generation of visual search (as the new hyperlink) will become standard. QR codes tried to be different, but those have not been successful.
From a high level, VS/MVS is a way of interacting with the world around you. Those interactions can take the form of video, graphics or audio. Bits of these technologies are available, it’s just a matter of packaging them together in the most efficient way. As an example, there are apps that can “listen” to a song and tell you what it is (Shazam is one example). Google Googles is also capable of MVS. In general, a device takes a picture of an image and the image (ex. JPEG) is sent over the network to a server that compares the image and tries to return an answer. Different methods to reduce the transmission and response time have been used where pre-processing is done on the smartphone which reduces the search window to a smaller subset and possibly eliminates server interaction for some use cases.
GPUs in VS/MVS
This is where visual search or mobile visual search comes into play and where the GPU inside a smartphone/tablet or TV applications processor plays in the next generation of search. We have seen PC GPUs play an important role in identifying faces in photos (visual search) using OpenCL with software from Cyberlink or Corel that can identify and sort photos based on different queries. You can search for all photos with John Doe and the GPU searches through the image database and returns all photos with John Doe (assuming he doesn’t have a had and sunglasses on). The results are pretty good, not perfect, but nothing that has this much complexity is perfect in the real world.
Fast forward to today…mobile GPUs have lots of horsepower packed into such die area constrained designs that if not used, it just sits there idle and you remove the efficiency (and potential) of what the technology can do. Many groups are researching how to add mobile visual search optimizations to GPUs like those at Vivante, and the push forward will improve implementations in augmented reality (AR), GPU accelerated MVS, vision processing (can include automotive, consumer electronics, etc.) and many other applications.
The idea of using the GPU has a few areas of consideration:
Bandwidth efficiency – Compression or pre-computations on the device to minimize server-client transmissions. Part of this also involves network latency and possibly lower quality image if there are transmission limitations.
- Lower Power / Thermals on Servers Side – Servers need to compare an image vs. a database, which takes a lot of computational workload.
- Lower Power / Thermals on Client Side – GPUs are fast and efficient when dealing with parallel workloads. Many search algorithms use parallel computations which are best suited for the GPU.
- Higher Quality VS/MVS Results – Search results can scale with the GPU with increased accuracy (since GPUs are good at image/pixel processing).
MPEG Group Visual Search Development
I recently saw a new specification developed by the folks at MPEG or Motion Picture Experts Group related to Visual Search (Compact Descriptors for Visual Search) and this is an exciting topic where GPU technology can make it shine. If you have not heard of MPEG, they are responsible for creating the video/audio transmission standards used pretty much by whatever video content you watch. Things like MP3 (audio), MPEG-2, MPEG-4, H.264, and HEVC have all come from that consortium. Since they are the 800 lbs gorilla when it comes to video, they want to create a standard VS/MVS solution that will be interoperable across any device and platform and efficient (bandwidth, algorithm computational overhead, etc.). This is great news since a fragmented solution can break a superb idea, and Khronos is looking at how they can cooperate with MPEG to see how it can play a role in future APIs. This should open up the doors for exciting new applications on any device.
By Benson Tao (Vivante Corporation)
People spend a significant amount of time in their cars, whether commuting to work, going to the mall with the kids, or taking a road trip with loved ones. The car has evolved into an extension of our lives outside the home that blend driving fun with full featured electronics that give people a consumer device interface. In-car electronics have also moved beyond entertainment and fancy HMI (human-machine interface) displays to include intelligent safety monitoring and occupant protection systems. The automotive OEMs that build compelling consumer centric HMI / entertainment IVI (in-vehicle infotainment) and advanced safety features will be the ones driving higher automobile sales, and the best sellers will be the ones that create the most immersive in-car living room experience in a safe environment where the vehicle is “aware” of its surroundings. New GPU technologies enable automotive OEMs to realize both types of technologies with 3D graphics for visual eye candy and GPGPU (General Purpose Graphics Processing Unit) using OpenCL for safety applications. Upcoming vehicles shipped will have a predefined set of functions available on the HMI or IVI, but the car owner will be able to install different apps to customize the car interface. GENIVI (www.genivi.org) is an example of an open platform, industry consortium taking IVI to the next level.
3D graphics have been heavily used in the mobile market with the rapid expansion of Android and iOS. Each successive release of a mobile operating system or hardware technology pushes the visual envelope in terms of UI, 3D game play and captivating visual content. Since consumers are familiar with the look and feel of their existing mobile devices, the automotive market has taken note of this and started looking at in-vehicle platforms that display information in a similar manner. The first generation automobiles with embedded GPUs had basic graphics functionality which was limited in performance and capabilities since graphics was not an important requirement during the time that pre-dated the first iPhone shipments. Once the iPhone took off, adoption of GPU IP into system-on-chips (SOC) really took off and brought graphics into the spotlight where it could make-or-break a product. With this new paradigm shift, graphics proliferated into many important markets including the automotive industry where some SOC vendor designs are awarded based on the GPU inside.
Leveraging the mass market use of 3D graphics in mobile devices and building on the existing ecosystem around 3D graphics, dynamic and fancy UIs, and apps, automotive OEMs are using these building blocks to transform themselves from car makers to a new breed of consumer-focused automotive manufacturers that have the HW (car) and SW (apps, app store) to turn the driving experience upside down. One step to create this transition to their new business model is to bring the familiar graphical interfaces and user experiences found on tablets and TVs and transform the car into an entertainment hub powered through the IVI system and driver HMIs to add eye candy to console data displays. These displays need to scale to higher resolutions with higher DPI on HD screens that are crisp, clear, vivid, and responsive. The migration towards a visual-centric automobile console shows the importance of the GPU and how it has changed from a nice-to-have feature to a must-have requirement that sways technology decisions in the automotive ecosystem. The technology is available to put the pieces together in terms of hardware, software, middleware, and operating systems – it just comes down to putting the pieces together to make the final product and bringing the next generation graphics-centric solutions to a dealer near you, that goes beyond what is available today.
Safety is another major feature that influences purchase decisions. The term ADAS or Advanced Driver Assistance Systems describes the latest electronic technologies found in vehicles that focus on increasing safety for occupants, pedestrians, and surrounding vehicles. Features included in ADAS that monitor, predict, and try to prevent accidents include active safety monitoring, collision avoidance systems (CAS), object/pedestrian recognition, land departure warning, adaptive cruise control, and more. Current solutions use a combination of DSPs, CPUs, and in some cases FPGA with built-in computational units to perform safety monitoring. These solutions use hand written code for specific products, making them harder to port to new platforms or when changing components like DSPs or CPUs. With the GPU you can overcome these limitations by writing algorithms in OpenCL (described in more detail later) with some GPU based OpenCV libraries, and the code can be re-used across various platforms since it is cross platform compatible. In the near future, the compiler will be able to partition code to be executed on the most efficient compute element (GPU, CPU, DSP) in a platform to give the best overall performance. Parallel data will go to the GPU and serial data can go to the DSP or CPU.
Some automakers are looking at harnessing the massively parallel processing power of GPUs to reduce parallel algorithm execution times and speed-up real-time response in ADAS. Since the GPU is inherently fast at image and pixel processing, incoming pixel data from camera sensors and other sensors that are parallel in nature can be sent through the GPU to be processed. In addition GPGPU APIs like OpenCL can help process parallel data streams (sensor fusion) from cameras, GPS/WiFi positioning data, accelerometers, radar, and LIDAR to guide vehicles safely. Current solutions focus on computer vision (CV) as a first step, but moving forward data from other sensors can be sent to the GPU to offload other computational resources in a system. Autonomous cars like Google’s driverless car and those in DARPA competitions have already demonstrated what the future of ADAS will evolve into.
Entertainment and safety can be met with the latest semiconductor technologies like those found in the Freescale i.MX 6 automotive grade applications processors to enhance 3D/2D/VG graphics (HMI rendering, games, and user interface composition) and OpenCL (ADAS, computer vision, and gesture). So far the i.MX 6 is the only product that targets automotive with advanced graphics like OpenGL ES 3.0 and GPU compute with OpenCL 1.1.
Source: Mercedes Benz
The Evolution of GPUs from Graphics to General Purpose Computation Cores (GPGPU)
The GPU was originally designed for 3D graphics applications and image rendering during the rasterization process. Over time the computational resources of modern graphics processing units became suitable for certain general parallel computations due to the massively parallel processing capabilities native to GPU architecture. Graphics is one of the best cases of parallel processing where the GPU needs to execute on billions of pixels or hundreds of millions of triangle vertices in parallel.
GPU architectures process independent vertices, primitives and fragments in great numbers using a large number of graphics shaders, which are also known as arithmetic logic units (ALUs) in the CPU world. Each primitive is processed the same way, using the same program or kernel. Many computational problems like image processing, analytics, mathematical calculations and others map well to this single-instruction-multiple-data (SIMD) architecture. The calculation speed-up shown and proven on SIMD processors was quickly recognized by researchers and developers and another area of high performance computing built on the vast processing power of GPUs was born. Today and in the near future, the fastest supercomputers and processing units use or will use GPU technology for the highest compute performance, calculation density, time savings, and overall system speed-up. The GPU has morphed from a graphics processor into a general purpose co-processor that sits alongside the CPU in today’s platforms.
The Penalties That Come With Less Than Optimal Graphical Processing
When selecting a GPU, there are certain requirements that need to be met when it comes to performance, power, and capabilities. Performance not only includes graphics benchmark results and 3D games, but also testing different applications that mirror real world use cases so the applications processor and GPU give the best overall user experience. As screen resolutions increase in both mobile devices and in-vehicle screens, the pixel count and triangle count (3D complexity) go up, leading to higher demand on the GPU as more objects need to be rendered onscreen. An underpowered GPU will lead to low performance (dropped frames, low FPS, image artifacts, incorrect rendering) and pretty much a non-usable device as evidenced by some of the first generation tablets that shipped but never used extensively. To get to the latest consumer electronics product levels, the GPUs in cars need to be upgraded from OpenGL ES 1.1 graphics to ES 2.0 and ES 3.0 capable cores with added shader performance to create eye catching visuals. i.MX 6 was one of the first SoCs where graphics was specifically defined at the product planning stages as Freescale had a vision of the car as a node in the internet of things (IOT) and graphics as the interface that couples man and machine. Content for cars (streamed media, social, games, apps) is also increasing as they become digitally connected with the rest of the consumer ecosystem. i.MX 6 is currently the only automotive SoC to support the latest APIs including OpenGL ES 3.0 and OpenCL 1.1. Other SoCs from Texas Instruments, Renesas, and FPGA based solutions have graphics capabilities, but rely on other solutions for OpenCL
The evolution of in-vehicle graphics went from an afterthought to a must have feature, migrating from simple onscreen text (that either used the CPU or simple 2D engine), to 2D graphics and then basic 3D. Today, there is another transition to advanced GPU rendering as seen on consumer devices that show detailed 3D models of your car in the console to highlight parts of the car in an easier to see format, 3D maps with street and building details, customizable/configurable HMI consoles similar to personalizing our Android smartphone, and much more. The initial solutions were underpowered but over time consumer expectations have grown to match their mobile devices going from a UI that was either scaled down or limited (ex. less icons, less menu layers, and basic 3D graphics) by specific hardware, to products that blur the line between consumer and auto.
According to Richard Robinson, principal analyst for automotive infotainment at iSuppli, “Infotainment hardware has undergone a rapid evolution during the last 13 years, moving from the traditional approach of dedicated hardware blocks, to the advent of bus-connected distributed architecture systems in the 2000 time frame, to the highly-integrated navigation-centric systems of 2006, to the new user-defined systems of today.”1
“The traditional boundaries between home, mobile and automotive infotainment systems are quickly going away. Consumers are now expecting the same features and equal access to their data across all these platforms,” said Jim Trent, VP and GM at NEC Electronics America2.
An Overview of OpenCL and Its Benefits.
OpenCL (Open Computing Language) is an open industry standard application programming interface (API) used to program multiple devices including GPUs, CPUs, as well as other devices organized as part of a single computational platform. The standard targets a wide range of devices from consumer electronics (smartphone, tablets, TVs) to embedded applications like automotive ADAS and computer vision (CV). Applications that already take advantage of the OpenCL performance speedup include medical imaging, video/image processing, high performance computing (HPC), robotics, surveillance, “Big Data” analytics, augmented reality, and gesture (motion, NUI). We will focus on the GPU aspect of OpenCL below.
The evolution of GPU computing has gone through a few major milestones. Pre-OpenCL, a program would be specifically written for and executed on a target device. This limited the features, performance, and calculation throughput to the device characteristics and there was not much flexibility beyond the hardware’s capabilities. The next step forward was the introduction of OpenCL where a hardware abstraction layer is created that separates the application from what is “under-the-hood” for ease of use and cross-platform portability. The abstraction layer queries all computational resources in a platform and uses them in the best way as a single cohesive unit to leverage as much computing horsepower as possible. Moving forward as we progress from OpenCL 1.1/1.2 to 2.0, advanced API features will be added along with making the solution even easier for general purpose programming.
At a high level, OpenCL provides both a programming language and a framework to enable parallel programming. The programming language is based on ISO C99 with math accuracy based on the IEEE 754 standard. OpenCL also includes libraries and a runtime system to assist and support software development. A developer can write general purpose OpenCL programs that executes directly on a GPU without needing to know 3D graphics or 3D APIs like OpenGL or DirectX. OpenCL also provides a low-level hardware abstraction layer as well as a framework that exposes many details of the underlying hardware layer allowing the programmer to take full advantage of it.
OpenCL uses the parallel execution SIMD (single instruction, multiple data) engines to enhance data computation density by performing massively parallel data processing on multiple data items, across multiple compute engines. Each compute unit has its own ALUs, including pipelined floating point (FP), integer (INT) units, and a special function unit (SFU) that can perform computations as well as transcendental operations. The parallel computations and associated series of operations is called a kernel, and the Vivante cores can execute millions of parallel kernels at any given time.
A Deeper Discussion of Graphics and OpenCL Benefits using Freescale’s i.MX 6 As An Example
Freescale uses GPU technology from a leading GPU IP provider based in Sunnyvale, California called Vivante to provide the 3D graphics, OpenVG, and OpenCL compute capability in their automotive grade i.MX 6 product line3. The i.MX6 applications processor is the industry’s first scalable, multicore ARM Cortex-A9 product line that spans single, dual, and quad core CPU architectures that are pin and software compatible. Integrated into the i.MX 6 is the GC2000 3D and OpenCL GPU, GC355 for fast hardware OpenVG acceleration, and the GC320 composition processing core (CPC) to compose screen content which the user sees. The applications processor also integrates the image processing unit (IPU) that accepts multiple camera input streams into the i.MX 6 for processing (ex. 360 degree view, rear view camera, and blind-spot detection).
The 3D graphics core provides 200 million triangles per second rendering horsepower which rivals performance of some of the latest tablets and smartphones, enabling the i.MX 6 to render ultra-realistic graphics and connect to app stores to play the latest games and display 3D UIs4. With this built-in capability and performance, ecosystem partners like QNX, Green Hills, Adeneo, Mentor Graphics, Rightware, Electrobit, and others are optimizing their operating systems, middleware, and applications to efficiently run the full feature set of the i.MX 6 GPU. The Freescale development platforms also have BSPs (board support packages) for Android and Linux to aid in the development of platforms in similar markets.
The OpenCL support currently focuses on accelerating Embedded (Computer) Vision applications that rely on camera inputs for ADAS. Some example applications where OEMs are analyzing GPU OpenCL performance are:
- Feature Extraction – this is vital to many vision algorithms since image “interest points” and descriptors need to be created so the GPU knows what to process. SURF (Speeded Up Robust Features) and SIFT are examples of algorithms that can be parallelized effectively on the GPU. Object recognition and sign recognition are forms of this application.
- Image filtering with different kernel sizes to enhance images.
- Integral image for image acquisition can be spread across multiple i.MX 6 GPU shaders to cut down calculation time and parallelize execution.
- Resampling – the GPU can use texture sampling to perform bilinear or bicubic filtering.
- Point Cloud Processing – includes feature extraction to create 3D images to detect shapes and segment objects in a cluttered image. Uses could include adding augmented reality to street view maps.
- Line detection – uses Hough Transform to detect lines in the input image (creates edge maps) followed by Sobel or Canny algorithms to further enhance edge detection. This can be used for lane detection
- Pedestrian Detection – uses Histogram of Oriented Gradients (HOGS) to detect a person and automatically brake the car if the driver does not react in time.
- Face recognition – goes through face landmark localization (exl Haar feature classifiers), face feature extraction, and face feature classification. Another use could be eye recognition to detect drowsiness and keep the vehicle within its lane.
- Hand gesture recognition – separates hand from background (ex. color space conversion to the HSV color space) and then performs structural analysis on the hand to process motion.
- Camera image de-warping – GPU performs matrix multiplications to map wide-angle camera inputs onto a flat screen so images are corrected. OEMs can use different camera vendors and to de-warp images they would only need to use different camera coefficients making the GPU easy to program.
- Blind-spot detection – cameras can be used for blind spot detection using OpenCL to process stereo images. In this case, two cameras are needed per blind spot to detect depth so the GPU knows how far/close the other car is.
There applications listed above are examples of where OEMs are looking at using OpenCL on the GPU to speed up ADAS. There are many exciting
Background information: GPU vs. CPU for processing OpenCL
The best approach is to use a hybrid/heterogeneous platform (ex. HSA) to accelerate applications. CPU for task parallelism and serial computations & GPU for data parallel processing.
By Benson Tao (Vivante Corporation)
The Heterogeneous System Architecture (HSA) Foundation is a not-for-profit consortium that brings together some of the best minds (and companies) across the mobile, PC, consumer, HPC, Compute/Vision industries, along with leading academic institutions and anyone that wants to join in on the fun. The goal of HSA is to create a single architecture specification and standard programming interface (API) that developers can easily adopt to optimize distributed workloads across the GPU, CPU, DSP, and any other compute fabric element on the platform. From a high level view, the platform or system (with all the different components) can be viewed as one large, unified processor that executes a given workload. The main goal is to get the biggest bang for the buck or operational efficiency that includes the highest computational throughput (performance) at the lowest power and thermal envelope. Industry participants in HSA include SoC vendors, IP providers, OEMs, OSVs, and a full range of ISVs and application developers that want to make the best use of platform capabilities.
Vivante Contributes to Platform Innovation
Vivante joined HSA Foundation with the intention of pushing forward a defined specification that advances GPU Compute technologies in mobile, embedded, and consumer platforms. Many of our new and existing customers look to us for guidance on ways to improve their existing platforms and problems they are “stuck” on. Improvements can be as minor as performance gains, reduced BOM (or silicon) costs, and power savings, to re-architecting their designs (through GPU programmability) to fit new use cases and applications so they can extend product lifecycles without incurring major financial costs to replace/upgrade the existing infrastructure. These are some of the ways Vivante looks at defining solutions and future-proofing GPU/GPGPU IP cores to help its customers.
Vivante has multiple products targeting hybrid platforms from mass market cores that have the smallest silicon footprint with OpenGL ES 3.0 and OpenCL 1.1/RS-FS, to mid range and high performance multi-cluster configurable cores. The GPUs work directly with the CPU through a unified memory system, ACE-Lite™ cache coherency, or a native stream interface that connects directly to various compute fabrics. The Vivante HSA design, like the OpenGL ES graphics stack, supports a unified software and hardware package that provides a single architecture spanning multiple operating systems, platforms, and GPU cores. Vivante HSA software will also be backwards compatibility with all existing compute-enabled products and built around HSA APIs and tools that complement our current OpenCL™ and Google Renderscript™/Filterscript support. By simplifying the lives of application developers targeting heterogeneous architectures, programmers can create breakthrough use cases that take advantage of the new paradigm shift to hybrid computing. Real world applications that are already being accelerated by Vivante cores include computer vision, image processing, augmented reality, sensor fusion, and motion processing, with some examples being in the automotive ADAS sector (Advanced Driver Assistance Systems).
HSA Releases Ver. O.95 of the Highly Anticipated Programmers Reference Manual (PRM)
The fruits of hard labor of many technical discussions and architecture meetings over the last year since the consortium’s founding in June 2012 has finally come full circle with the release of version 0.95 of the PRM. This manual is a major milestone and lays the foundation for HSA to successfully move forward as it continues defining the platform of the future. The PRM also gives developers an early start as ecosystem partners create amazing applications, tools, libraries, and middleware programs that work best on HSA certified products.
Some features highlighted in the specification include:
1) Shared Coherent (Virtual) Memory Models
3) User Mode and GPU Queuing
4) Zero Copy
5) Low Latency Dispatch
The specification also includes HSAIL (HSA Intermediate Language), which abstracts away from the native instruction set of the hardware and can be compiled automatically, in real-time, to the native ISA of the underlying hardware without any developer involvement. The same OpenCL and Renderscript/Filterscript programs can be abstracted and run on HSA platforms also.
Link to HSA Foundation website: http://hsafoundation.com/
Link to HSA Foundation press release: http://hsafoundation.com/hsa-foundation-announces-first-specification/