Multimodal interaction in talking face generation and analysis in MCR

Introduction

Along with the development of deep learning, the AI industry has attracted a lot of investment and enabled advancement of various applications such as ChatGPT, Gemini and SegTalker [1]. SegTalker, which is a model that can generate a talking face video based on a single image and audio file, contributes to success of broader applications including digital reporter and video dubbing by modeling different modalities adequately. However, although it can capture lip synchronization well and enable local editing with improvement on overall texture, current generative techniques [5, 6, 7, 8, 9, 12] face challenges in reflecting complex and interactive scenarios. For example, it is impossible to generate a video about a reporter getting wet under the rain or snow, frowning and explaining about the current weather at the same time. So, it is necessary to solve this issue for realizing virtual reality and avatar by knowledge fusion in modality space [2, 3, 4].

Analysis

In recent years, there has been a lot of effort in talking face generation, typically focusing on synthesizing photo-realistic video and synchronizing lip movements with audio [5, 6, 7, 8, 9, 22, 23, 24]. However, they have still the limitation requiring a massive training dataset to satisfy somewhat of performance. Furthermore, they do not demonstrate the capability of complicated interactions such as generating a talking avatar surprised under thunder lightning. To address this problem, we can leverage large and pretrained multimodal models [10, 11] and combine their ability to expand representations of model, inheriting the superiors and enhancing capacity without additional dataset by C-MCR method [2]. In fact, if expanding the knowledge of model, it can achieve higher performance on audio-visual, audio-text, visual-text retrieval tasks than existing works [3].

In these methods, the authors have effectively encoded different modalities into a semantically aligned shared space by establishing inter- and intra- Multimodal Contrastive Representations (MCR) connections [2] or aligning multiple existing MCRs into the same based MCR [3]. Meanwhile, they are specifically designed for one and only one shared modality, which restricts the utility. Hence, to mitigate the issue, it needs to consider extra modality connections. Therefore, it seems indispensable to leverage displacement and combination bond of different modal encoders, improving multimodal understanding of unified model [4].

FreeBind [4] surpasses the advanced audio-text and image-text expert spaces but cannot capture the temporal consistency of data. In fact, the existing methods in knowledge fusion do not have the capacity to handle video dataset. However, most of prior works in talking face generation have considered the temporal awareness of generative model as handling sequences [12] or intermediates [6, 7, 9], which is essential in the task. Due to this reason, it seems inevitable to inject temporal attention layers into a main model for video training [16]. However, it needs to investigate another methodology transferring condition (e.g., getting wet under the rain) into frames appropriately because we would like to utilize pretrained models without additional training. AdaIN [13] operation, which is one of the representatives in style transfer on deep learning, would probably resolve the concern. Furthermore, the method of adjusting cross-attention layers between other models has been studied to reflect a style image into a ground image flexibly [15] and ControlNet [14] suggests how to fine-tune a model with an extra condition efficiently and effectively.

To the best of our knowledge, there is no standard methodology to deal with complex interactions in talking face generation task either with training or without training. Even if video editing techniques have advanced in the most versatile generation model including GAN [19] and Diffusion [20], it is not trivial to edit talking face video due to difficulty in lip synchronization, texture quality, and preserving identity of avatar after reflecting additional conditions. So, we suggest the method of injecting copied layers, where adjusting conditions, into a segmentation-based diffusion model by leveraging knowledge fusion encoders for conditional generation [1, 4, 17, 18]. Then, it would be able to flexibly project multimodal prompts into our model and maintain lip and head movements adequately.

Conclusion

A lot of AI related applications and technologies have been developed these days. Especially, researchers have investigated methods in talking face generation, focusing on preserving identity, lip sync and video realistic. In recent, the segmentation-based method demonstrates capability to edit talking face video. Moreover, many experiments in coherence to multimodal prompts have been also conducted and outperformed in zero-shot situations, which implies that we can leverage these techniques for complicated interactions to generative model. Furthermore, the image-to-video editing technique such as diffusion inversion [18, 21] enables a model to be aware of conditional prompts (e.g., text or audio). Therefore, it will be a valid attempt to insert temporal attention layers with unified multimodal encoders and apply the diffusion inversion at inference step for reflecting complex scenarios and interactions without additional dataset and training in talking face generation. If it is successful, then it can accelerate the development of applications such as editable digital avatar and virtual animation. In the future, we might see a virtual reporter talking about regional weather, in which his or her hair is fluttering when raining and wind blowing.

References

[1] Xiong, Lingyu, et al. “SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing.” Proceedings of the 32nd ACM International Conference on Multimedia. 2024.

[2] Wang, Zehan, et al. “Connecting multi-modal contrastive representations.” Advances in Neural Information Processing Systems 36 (2023): 22099-22114.

[3] Wang, Zehan, et al. “Extending multi-modal contrastive representations.” arXiv preprint arXiv:2310.08884 (2023).

[4] Wang, Zehan, et al. “FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.” In Forty-first International Conference on Machine Learning.

[5] Shen, Shuai, et al. “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[6] Zhou, Yang, et al. “Makelttalk: speaker-aware talking-head animation.” ACM Transactions On Graphics (TOG) 39.6 (2020): 1-15.

[7] Zhong, Weizhi, et al. “Identity-preserving talking face generation with landmark and appearance priors.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[8] Prajwal, K. R., et al. “A lip sync expert is all you need for speech to lip generation in the wild.” Proceedings of the 28th ACM international conference on multimedia. 2020.

[9] Zhang, Wenxuan, et al. “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[10] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[11] Wu, Yusong, et al. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.” ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.

[12] Xu, Sicheng, et al. “Vasa-1: Lifelike audio-driven talking faces generated in real time.” arXiv preprint arXiv:2404.10667 (2024).

[13] Huang, Xun, and Serge Belongie. “Arbitrary style transfer in real-time with adaptive instance normalization.” Proceedings of the IEEE international conference on computer vision. 2017.

[14] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. “Adding conditional control to text-to-image diffusion models.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[15] Chen, Chun-Fu Richard, Quanfu Fan, and Rameswar Panda. “Crossvit: Cross-attention multi-scale vision transformer for image classification.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[16] Chen, Zhiyuan, et al. “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions.” arXiv preprint arXiv:2407.08136 (2024).

[17] Guo, Yuwei, et al. “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.” arXiv preprint arXiv:2307.04725 (2023).

[18] Mokady, Ron, et al. “Null-text inversion for editing real images using guided diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[19] Goodfellow, Ian, et al. “Generative adversarial networks.” Communications of the ACM 63.11 (2020): 139-144.

[20] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[21] Ceylan, Duygu, Chun-Hao P. Huang, and Niloy J. Mitra. “Pix2video: Video editing using image diffusion.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[22] Guo, Yudong, et al. “Ad-nerf: Audio driven neural radiance fields for talking head synthesis.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[23] Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. “Synthesizing obama: learning lip sync from audio.” ACM Transactions on Graphics (ToG) 36.4 (2017): 1-13.

[24] Shen, Shuai, et al. “Learning dynamic facial radiance fields for few-shot talking head synthesis.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • 딥러닝 공부 자료 모음집