AI
In recent years, rapid advancements in realistic image and video synthesis have figuratively “opened the floodgates” for applications in numerous areas. One of the most well-known examples is digital humans, which have already started to transform the way we create, communicate, and in the future may even start being used in the special effects industry.
We call these artificial characters the avatar of a particular person. The primary defining characteristic of avatars is the ability for a user to impose an arbitrary motion on a static image or a 3D model. By this definition, the computer graphics models of humans used in the movies are also avatars, but with a very costly creation and animation process.
Figure 1. Examples of head avatars produced by our model [5]. They are animated using keypoints and obtained using a single source image.
Modeling the human appearance is one of the most well-studied problems in computer graphics. One of its prime applications has always been special effects, with computer-generated humans now regularly appearing in feature films.
To create these scenes, special effects artists use 3D mesh models, a versatile representation that can be put into 3D scenes, relighted, and even used for physics simulations. However, the price for that versatility is the lengthy creation process, which typically requires days or even weeks of highly trained professional labor. It is true that modern techniques, like photogrammetry and motion capture, have allowed us to significantly speed up the design and animation process compared to the early years of computer graphics imagery. However, creating these models, especially for the actors from the past, may still be prohibitively costly, with the price tag that only major blockbuster movies or video game makers can afford. This price stems from the necessity to have specialized and expensive equipment, as well as a large team of special effects artists since many steps of the avatar creation process are still not automated.
This labor intensiveness of the creation process for computer-generated humans is also associated with the so-called Uncanny Valley hypothesis [1]. It states that people prefer unrealistic depictions of themselves more than imperfect realistic ones. The lack of mimics or expressiveness, the glassy look of the eyes, or plastic skin texture will make synthetic images of humans uncanny or even revolting. This effect is purely psychological and, unfortunately, is more pronounced and amplified in videos. Therefore, artists spend a lot of time and effort in the design process of such characters to make them as life-like as possible. Thus, the problem of Uncanny Valley is a significant roadblock on the way to genuinely realistic virtual humans.
Also, while full-body modeling is the ultimate goal of all avatar systems, solving this problem for the head separately from the rest of the body has a lot of practical sense. Human heads have high photometric, geometric, and kinematic complexity, which mainly stems from modeling a mouth cavity and hair. These areas have always been challenging for traditional mesh-based modeling. Another complicating factor is the acuteness of the human visual system towards even minor mistakes in the appearance modeling of humans, especially their heads, which we have discussed earlier. Therefore, in our research, we decided to focus on the problem of modeling talking heads. These models are quite useful on their own since they already allow us to build communication systems. But, in contrary to the mesh-based approaches, they are comparably easy to obtain using novel techniques which apply neural networks for image synthesis.
The main goal of using approaches based on neural networks and computer vision is their ability to generalize, given enough training data. Some models are even able to work with (arguably) the hardest type of data, in-the-wild images, which are taken with unspecified cameras, lighting conditions, and potentially poor quality. This allows such methods to utilize large amounts of image and video data available online instead of relying on scarce data obtained in controlled environments. Also, it makes the usage of these models easier since the data they require can be obtained with a camera using any modern smartphone.
A lot of previous works [2] in computer vision has been devoted to statistical modeling of the appearance of human faces using mesh-based models, with remarkably good results [3] obtained both with classical techniques and, more recently, with neural networks [4]. While modeling faces is a highly related task to talking head modeling, the two tasks are not identical. The latter also involves modeling non-face parts such as hair, neck, mouth cavity, and often shoulders/upper garment. These non-face parts cannot be handled by some trivial extension of the face modeling methods since they are much less amenable for registration and often have higher variability and higher complexity than the face part. In principle, the results of face or lips modeling results can be stitched into an existing head video. Such design, however, does not allow complete control over the head rotation in the resulting video and therefore does not result in a fully-fledged talking head system.
The design of our first system [5], which we developed in 2019, borrows a lot from the progress in generative modeling of images. In recent years, alternative approaches to creating digital humans started to emerge, which utilize the direct synthesis of images via neural networks. The advancements in the so-called generative adversarial networks (GANs) allowed these networks to produce highly realistic portrait images, close to being indistinguishable from the real photos. Furthermore, the same system could also work across multiple diverse domains and synthesize natural scenery, cats, dogs, and human faces.
Figure 2. Example pipeline of motion transfer using a direct synthesis system [6]. Key points are detected from a source video, and then the system is directly producing the image with a posed avatar. (Source.)
These advancements later lead to the development of models [6] that could train an avatar of a person given multiple posed images and enough time for training. The systems typically used key points, encoding face or body parts, to guide the animation. The drawback that most of these systems shared was the necessity to use large amounts of data (hundreds of frames per new person) and lengthy training process, which could take hours on the modern graphical processing units (GPUs).
Figure 3. The scheme of our approach. Few available images are first fed into the embedder and are encoded into adaptive parameters of the generator. Then, key points (landmarks) are processed by a generator network to produce an output image. We use both perceptual, GAN-based losses, and auxiliary objectives for training, which help us improve fine-tuning (they are depicted on the rest of the scheme). (Source of the images.)
To alleviate that issue while keeping a high degree of personalization of the avatars, we employed a simplified meta-learning strategy for training. During training, we showed the neural network a small sample of images of the same person and asked it to recreate an image with that individual, given only its pose, encoded in the facial key points. We used modulation via adaptive instance normalization layers to initialize some weights in our network in a way that gets us the best quality possible. After that, during test time, we additionally fine-tune all the weights of the generator using all available images. The key insight of our work was that the generator network did not lose its generalization capabilities after fine-tuning on a tiny set of images while improving the quality of the results.
The results that we obtained encouraged us to continue our research, but there were multiple problems with the model that we had. The quality was still poor, the inference time was slow, and we had to do fine-tuning, which limited the devices we could run this model on to desktops or servers with a dedicated GPU. While the fine-tuning scheme was a really cheap way to improve the quality of our system, we had to come up with a model that would perform well without it.
Our main goal was the development of a neural network that would work on mobile devices, specifically Samsung devices. At that time, the fastest available framework to run these models was Snapdragon Neural Processing Engine (SNPE) [7]. While being extremely efficient, it supported a very limited set of layers, which are the building blocks of neural networks. Moreover, there was no way to add any additional layers that we may require to run our models. Competitor frameworks, like TFLite [8], while being flexible, were significantly slower in our measurements (up to two times, compared to SNPE).
Therefore, we decided to design a new architecture that would allow efficient real-time inference under the constraints mentioned above. This meant that we had to rule out some of the popular competitor models, like Siarohin et al. [9], and come up with a different pipeline.
The key approach that would allow speeding up the inference, and differentiate us from the competitors, was allocating as much computation as possible for the initialization of our model. For us, it is possible to spend ten seconds on the avatar creation process if that would reduce per-frame inference time. We employed two ideas to achieve that. Firstly, we propose to split the single generator network into two parts: one would be computationally expensive and run once per avatar, while another would be lightweight and run once per frame. Secondly, we employ an iterative strategy for updating the parameters of our model during test-time, which was inspired by the learned gradient descend [10] method. Essentially, we still want to do fine-tuning, but in a way that could be ported to mobile devices.
Figure 4. In our Bi-Layer Model [11], the input image is first encoded by the embedder into a set of adaptive parameters in both of the generators. We then predict a high-frequency “texture” via a heavy texture generator network, which concludes person-specific initialization. For inference, we do a forward pass through the inference generator to obtain pose-specific warping of the texture and a low-frequency component of the output. The final image is obtained by a simple summation of the two components. (Source of the images.)
In order to split the generator, we also split the output image into two layers: low-frequency and high-frequency. The low-frequency layer is predicted directly by a lightweight generator, while the high-frequency layer is predicted by warping the output of a computationally heavy generator. The lightweight (inference) generator is driven by the output pose, while the heavy generator is pose agnostic. We call the output of a heavy pose-agnostic network a “texture”, and the heavy generator a “texture generator.”
Figure 5. Detailed steps of the generation process. (Source of the images.)
After training the base model, we also obtain the network that does iterative texture enhancement. It works in a feedforward way, and we train it using the unrolling technique (“back-propagation through time”). Specifically, during training, we train it by backpropagation through all iterations. Then, during inference, we can use it to enhance the estimate of the texture obtained by the texture generator.
Figure 6. Animation results. (Source of the images.)
This method allows us to build a system capable of the creation and inference of realistic avatars on devices with a Snapdragon chip, but the approach, due to the simplicity of its components, can, in theory, be ported to other frameworks. Rendering took us on average 42 ms per frame, as measured on Adreno 640 GPU, FP 16 mode, using the SNPE benchmarking tool.
Our Neural Talking Heads was the first system that could initialize from a few images and produce realistic-looking avatars. We achieved that using a proposed fine-tuning pipeline and a simplified meta-learning scheme. Our second work has proposed a simpler and more computationally efficient architecture compared to our competitors. This allowed realistic avatars to run on Samsung mobile devices in close to real-time settings. We hope that, in the future, these approaches could find their way into more mobile applications or even products, which require real-time human head synthesis.
Reference
[1] M. Mori, “The uncanny valley”, Energy, 1970. [2] V. Blanz et al., “A morphable model for the synthesis of 3D faces”, SIGGRAPH, 1999. [3] J. Thies et al., “Face2Face: Real-Time Face Capture and Reenactment of RGB Videos”, CVPR, 2016. [4] S. Lombardi et al., “Deep appearance models for face rendering”, ACM Transactions on Graphics, 2018. [5] E. Zakharov et al., “Few-Shot Adversarial Learning of Realistic Neural Talking Head Models”, ICCV, 2019. [6] C. Chan et al., “Everybody Dance Now”, ICCV, 2019. [7] https://developer.qualcomm.com/docs/snpe/overview.html [8] https://www.tensorflow.org/lite [9] A. Siarohin, “First Order Motion Model”, NeurIPS, 2019. [10] M. Andrychowicz, “Learning to learning by gradient descent by gradient descent”, NeurIPS, 2016. [11] E. Zakharov et al., “Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars”, ECCV, 2020.