AI
Images and videos have been at the forefront of digital media consumption for a long time. All aspects of video and image capture, transmission, and display have seen leaps of innovation in recent times. Content creation has evolved considerably over the years, for example, 2D to 3D capture in mobile devices, AR/VR capture, point clouds, 3D meshes, Generative AI based image/video generation, etc. Resolution has also increased many fold, from the SVGA (800x600) to HD, FHD, & now going beyond UHD. Additionally, display technology has evolved from LCD to OLED to AMOLED with newer capabilities of High Dynamic Range (HDR) and Wide Color Gamut (WCG). With these advancements, even the smallest of visual artefacts can lead to a poor user experience. The definition of end user of encoded videos has also changed over time. Use cases where content is consumed by Machine Learning (ML) algorithms rather than human eyes have also emerged. A human eye may not notice very low-level details, but an ML algorithm might find them to be highly crucial for performing the desired tasks.
Video codecs have to adapt the huge variation in all aspects of media consumption. Traditional video codecs have relied on statistics and image processing techniques at their core for long. The success of AI in generic image processing tasks make the AI-powered video codecs a promising proposition. Exploration of such techniques in industry as well as standard bodies such as Moving Picture Experts Group/ Alliance for Open Media (MPEG/AOM) has shown good evidence for its viability, but has also revealed the challenges or unsolved technical problems. AI based pre- and post-processing have already been developed and deployed in many commercial solutions and with standardization of JPEG AI (Neural Network based Image compression) [1], the prospects of AI in video compression seem to be optimistic.
Popularity of a video codec is heavily reliant on its global standardization. Adoption of a common set of specific rules for interpreting a compressed media bitstream is achieved through rigorous collaborative development and consensus among global stakeholders. As a result, the standardization of video codecs has led to mass adoption and usage across the industry. H.264/AVC and HEVC are two of the most widely used codecs in video compression applications.
Video codecs have seen an evolution from the days of H.261 to the most recent VVC to cater to new kinds of content, need for a richer user experience and display devices as shown in Fig. 1. The earliest video codecs were limited to handling 2D media. Over the years, contents like AR/VR, Screen Content, Point Clouds and Multi-camera capture have given rise to newer and more robust codecs with wide variety of tools. Combined with this is the ever-increasing video consumption rate. According to estimates [2], roughly 75% of internet traffic is caused by video content. With the advent of consumer centric video delivery platforms, especially short video apps, there is a bigger need for achieving higher quality at extremely low bitrate.
Figure 1. Display Resolutions, Display types and content variations over recent past
Advancements in related domains also feed into the constant evolution of newer compression methodologies. Quality consideration for bit budget allocation is an important factor for any compression method. Some quality metrics can even highlight the inefficiencies in terms of subjective quality, necessitating coding tools that improve the perceptual video quality. Increased compute power and advancements in hardware over the years has also enabled experimentation and implementation of coding tools with higher complexity.
Combining AI, which has proven to work in these aspects, and traditional compression, which has a proven record of achieving huge levels of compression, seems the way forward.
The need, compulsion and the opportunity opened up by the use of deep learning for video coding has made it a promising proposition.
The spectrum of using deep learning in video codec can be classified into three main categories as shown in Fig. 2:
Figure 2. Perforation of Deep Learning in and around Traditional Video Codec
As the standardization bodies are still exploring the avenues of NN based codec standard, many in the industry have already started looking into ways to get an early advantage. Solutions like AI ScaleNet, which down-scales the input to the encoder and up-scales the output of the decoder using AI methods are already deployed in consumer use-cases to enable high resolution video calling at low resolution network bandwidth. End-to-End codecs with Variational Auto Encoder (VAE) at their core are in early exploratory stages as they come with significant complexity.
Figure 3. Major AI tools being explored in Hybrid codecs
The most promising way of combining AI and video coding currently is the hybrid approach, wherein few of the modules of the codec are replaced or augmented by using AI. As these methods do not aim to replace the codec, they can be lightweight and the developers have better control and understanding of such modules. Some of the technologies being explored using Neural Network [3] are shown in Fig. 3.
Though NN based video coding seems to be the obvious choice going forward, it has its own challenges. To name a few,
Though these challenges may seem daunting at first, the future seems bright for AI Codec. Recent trends and explorations clearly show that there is huge opportunity for this technology with a strong desire from the standardization and product perspective. Standardization of JPEG AI, development of newer and more robust quality metrics, data independent models, the incessant advancements in hardware capabilities and hardware friendly design consideration of newer approaches are great leaps towards solving each of the challenges.
[1] JPEG AI Common Training and Test Conditions, ISO/IEC JTC 1/SC29/WG1, 98th JPEG Meeting, 100421, Sydney, Australia, Jan. 2023.
[2] Video streaming to the extreme, https://www.ericsson.com/en/reports-and-papers/mobility-report/articles/streaming-video
[3] E. Alshina, F. Galpin, Y. Li, D. Rusanovskyy, M. Santamaria, J. Ström, R. Chang, Z. Xie, “EE1: Summary report of exploration experiment on neural network-based video coding”, JVET-AG0023, Joint Video Exploration Team (JVET), January 2023