AI
Face restoration focuses on improving the quality of facial images by removing complex degradation and enhancing details. Face restoration is inherently a highly ill-posed task because a single low-quality input can correspond to many potential high-quality counterparts scattered throughout the high-quality image space. Most advanced face restoration methods cannot guarantee identity consistency.
Recently, some studies have used high-quality reference images of the same identity to enhance identity consistency in restored facial images. However, the feature alignment-based methods would achieve low quality of the restoration if the features are not well aligned. The diffusion model based fine-tuning methods typically requires 5~20 reference images and can be significantly effected by the quality of them.
In this paper, we proposed FaceMe, a fine-tuning-free personalized blind face restoration method based on the diffusion model. Given a low-quality input and either a single or a few high-quality reference images of the same identity, FaceMe restores high-quality facial images and maintains identity consistency within seconds. Remarkably, changing identities does not require fine-tuning, and the reference images can have any posture, expression, or illumination. Furthermore, the quality of the reference image does not significantly impact the quality of the restored image. To our knowledge, this is the first approach that leverages diffusion prior for personalized face restoration tasks, which does not require fine-tuning when changing identity.
Let X,Y,D,and G denotes the degraded facial image, the corresponding high-quality facial image, the degradation function, and the generation function, respectively. The objective of personalized face restoration is to generate Y ̂=G(X│Ref), while ensuring the following 3 constraints are satisfied: Consistency: D(Y ̂) ≡ X,Realness: Y ̂∼q(Y),Identity Consistency: Y ̂∼ID(Y), where q(Y) denotes the distribution of high-quality facial images, Ref and ID(Y) denote the reference images and the distribution of high-quality facial images of the same identity as Y respectively.
Identity encoder. We combine the CLIP [2] image encoder ϵ and the ArcFace [1] facial recognition module ψ, to extract identity features from facial images. We employ MLPs to align them, then merge them using MLPs, resulting in si∈Rd, where i denotes the index of N reference images, d is the dimension of cross-attention in the diffusion model.
Combining and replacing. We concatenate all si as s. We use a simple prompt: “a photo of
Figure 1. Overview of proposed FaceMe (left) and training data construction pipeline (right).
face.”, during the training and test phases. Let ctext={e1,…,e5} represent the embedding of the text. Then we replace the embedding e4 corresponding to “face” with s as cid={e1,e2,e3,s,e5}
Training strategy. The model consists of two trainable modules, i.e., ControlNet and ID encoder (Identity Encoder in Fig. 1). In this work, we propose a two-stage training strategy. In training stage I, we simultaneously train ControlNet and ID encoder, but only save ID encoder’s weights. In training stage II, we fix the ID encoder and only train ControlNet. In this training process, we randomly replace identity embedding cid with non-identity embedding ctext with a 50% probability.
Inference strategy. We embed the low-quality input directly into the initial random Gaussian noise according to the training noise scheduler. We use Classifier-free guidance (CFG) [3] for personalized guidance. In addition, to mitigate the possibility of color shift, we apply wavelet-based color correction to the final result.
To our knowledge, no publicly available facial dataset can support diffusion models for training with multiple reference images of the same identity. In this study, we employ synthetic facial images as reference facial images to construct our training data pool.
We synthesize multiple reference facial images of the same identity as the given facial image using Arc2Face, equipped with Control-Net. Given a pair of reference and pose, Arc2Face can synthesize facial images that maintain the identity of reference image and the given pose.
Pose reference data pool. We extract the pose attribute and the expression attribute for each image in FFHQ. We conduct K-Means clusters on these images based on the two attributes to get c1 and c2 cluster centers, resulting in c1*c2 disjoint subsets, forming the pool.
Same identity. For each image, we randomly sample an pose image from any subset in the pool, Using them as input for Arc2Face, we get one reference image. We then assess the identity similarity between the input image and the generated reference one. If the similarity falls below δ, we re-sample pose image for regeneration. If an acceptable result is not obtained after 3 attempts, we stop generating it. We name the synthesized reference images for the FFHQ dataset as FFHQRef.
Table 1. Quantitative comparison. The bold numbers represent the best performance.
Training datasets. Our training dataset consists of FFHQ dataset and our synthesized FFHQRef dataset, with all images resized to 512×512. The corresponding degraded image is synthesized using a degradation model (see in our paper).
Implementation details. We employ SDXL model stable-diffusion-xl-base-1.0 as our base diffusion model and the CLIP image encoder as part of our identity encoder, both of which are fine-tuned by PhotoMaker. We use the Adam optimizer to optimize the network parameters with a learning rate of 5×10-5 for two training stages. The training process is implemented using the PyTorch framework and is conducted on eight A40 GPUs, with a batch size of 4 per GPU. The two training stage are trained 130K and 210K iterations, respectively.
Testing datasets. We use one synthetic dataset CelebRef-HQ and three real-world datasets: LFW-Test, WebPhoto-Test, and WIDER-Test for test. For the synthetic dataset, we randomly select 150 identities and select one image per identity as the ground truth, using 1∼4 images of the same identity as reference images. For the real-world datasets, due to the lack of reference images with the same identity, we first use face restoration method, i.e., Codeformer, to restore low-quality input. The restored images are then used as input to Arc2Face to generate reference images.
Tab. 1 shows the performance of FaceMe on the synthetic dataset CelebRef-HQ. As shown, our method achieves the best performance in PSNR, FID, LMD, and IDS, and the second-best performance in LPIPS. Additionally, it is worth noting that FaceMe has significantly improved in IDS, which demonstrates its ability in personalization.
As presented in Tab.1, our FaceMe achieves the best FID score on the LFWTest and Wider-Test datasets. LFW-Test and Wider-Test are mildly degraded and heavily degraded real-world dataset, respectively. The excellent performance on both datasets indicates that FaceMe is capable of adapting to complex degradation scenarios in the real world, demonstrating exceptional robustness.
The visualization results are shown in Fig.2. It can be observed that the compared methods either produce many artifacts in the restored images or restore high quality images but fail to
Figure 2. Qualitative comparison on CelebRef-HQ.
Figure 3. Qualitative comparison on real-world faces.
maintain identity consistency with the ground truth. In contrast, our FaceMe can restore high quality images while preserving identity consistency.
Fig. 3 shows the visual comparisons of different methods. It is clear that, compared to state-of-the-art methods, our method can handle more complex scenes and restore high-quality images without introducing unpleasant artifacts.
We propose a method to address the issue of identity shift in blind facial image restoration. Based on diffusion model, we use identity-related features extracted by identity encoder to guide the diffusion model in recovering face images with consistent identities. Our method supports any number of reference images input through simple combine identity related features. In addition, the strong robustness of the identity encoder allows us to use synthetic images as reference images for training. Moreover, our method does not require fine-tuning the model when changing identities. The experimental results demonstrate the superiority and effectiveness of our method.
https://arxiv.org/abs/2501.05177
[1] Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 4690–4699..
[2] Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In ICML, 8748–8763. PMLR.
[3] Ho, J.; and Salimans, T. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.