By Benson Tao (Vivante Corporation)
The world is very visual and thanks to the popularity of media processors (GPUs and VPUs), multimedia has become a ubiquitous part of life. Most of the things we interact with have some sort of screen, ranging from the usual consumer products to washing machines, thermostats, and any other product or gadget that needs to be upsold. Stick a screen on something, maybe add a camera, install Android, and you have a device that could be part of your Internet of Things (IoT) which you can probably sell for a premium…before the next product comes out the next day.
With the rise of multimedia that is standard across all screened devices, so has the explosive growth of media exchange that includes pictures, images, and videos. YouTube has 100 hours of video content uploaded every minute! Facebook has 250 million photos uploaded daily and over 100 billion (as of 2011) photos on its servers! These huge numbers have so many zeroes and something never seen before, and each day those numbers get larger and larger. The rise of visual media content is great, you can find clips of cats playing pianos, live, courtside highlights from the NBA playoffs, and pretty much anything that catches your attention. The flipside of the story is – How do you search for content without getting lost in the noise? Just going through links to find what you want can take a few seconds for simple, straightforward queries, to several minutes or hours for more involved searches. But this number is rising due to the massive amounts of data, and getting relevant search results fast is what counts. As more searches involve multimedia, text hyperlinks will not work so a new generation of visual search (as the new hyperlink) will become standard. QR codes tried to be different, but those have not been successful.
From a high level, VS/MVS is a way of interacting with the world around you. Those interactions can take the form of video, graphics or audio. Bits of these technologies are available, it’s just a matter of packaging them together in the most efficient way. As an example, there are apps that can “listen” to a song and tell you what it is (Shazam is one example). Google Googles is also capable of MVS. In general, a device takes a picture of an image and the image (ex. JPEG) is sent over the network to a server that compares the image and tries to return an answer. Different methods to reduce the transmission and response time have been used where pre-processing is done on the smartphone which reduces the search window to a smaller subset and possibly eliminates server interaction for some use cases.
GPUs in VS/MVS
This is where visual search or mobile visual search comes into play and where the GPU inside a smartphone/tablet or TV applications processor plays in the next generation of search. We have seen PC GPUs play an important role in identifying faces in photos (visual search) using OpenCL with software from Cyberlink or Corel that can identify and sort photos based on different queries. You can search for all photos with John Doe and the GPU searches through the image database and returns all photos with John Doe (assuming he doesn’t have a had and sunglasses on). The results are pretty good, not perfect, but nothing that has this much complexity is perfect in the real world.
Fast forward to today…mobile GPUs have lots of horsepower packed into such die area constrained designs that if not used, it just sits there idle and you remove the efficiency (and potential) of what the technology can do. Many groups are researching how to add mobile visual search optimizations to GPUs like those at Vivante, and the push forward will improve implementations in augmented reality (AR), GPU accelerated MVS, vision processing (can include automotive, consumer electronics, etc.) and many other applications.
The idea of using the GPU has a few areas of consideration:
Bandwidth efficiency – Compression or pre-computations on the device to minimize server-client transmissions. Part of this also involves network latency and possibly lower quality image if there are transmission limitations.
- Lower Power / Thermals on Servers Side – Servers need to compare an image vs. a database, which takes a lot of computational workload.
- Lower Power / Thermals on Client Side – GPUs are fast and efficient when dealing with parallel workloads. Many search algorithms use parallel computations which are best suited for the GPU.
- Higher Quality VS/MVS Results – Search results can scale with the GPU with increased accuracy (since GPUs are good at image/pixel processing).
MPEG Group Visual Search Development
I recently saw a new specification developed by the folks at MPEG or Motion Picture Experts Group related to Visual Search (Compact Descriptors for Visual Search) and this is an exciting topic where GPU technology can make it shine. If you have not heard of MPEG, they are responsible for creating the video/audio transmission standards used pretty much by whatever video content you watch. Things like MP3 (audio), MPEG-2, MPEG-4, H.264, and HEVC have all come from that consortium. Since they are the 800 lbs gorilla when it comes to video, they want to create a standard VS/MVS solution that will be interoperable across any device and platform and efficient (bandwidth, algorithm computational overhead, etc.). This is great news since a fragmented solution can break a superb idea, and Khronos is looking at how they can cooperate with MPEG to see how it can play a role in future APIs. This should open up the doors for exciting new applications on any device.