Real-Time, High-Fidelity Face Identity Swapping With a Vision Foundation Model

Real-Time, High-Fidelity Face Identity Swapping With a Vision Foundation Model

Nov 2025

Abstract

Many recent face-swapping methods based on generative adversarial networks (GANs) or autoencoders achieve strong performance under constrained conditions but degrade significantly in high-resolution or extreme pose scenarios. Moreover, most existing models generate outputs at limited resolutions (128 × 128), which fall short of modern visual standards. Diffusion-based approaches have shown promise in handling such challenges, but are computationally intensive and unsuitable for real-time applications. In this work, we propose FaceChanger, a real-time face identity swap framework designed to enhance robustness across various poses and outputs at 256 × 256 (double the linear resolution of typical 128 × 128 baselines). While maintaining compatibility with conventional GAN- and autoencoder-based pipelines, FaceChanger uniquely incorporates a vision foundation model (VFM) to extract richer semantic features, which can enhance identity preservation, attribute control, and robustness to variations. In this work, we employ the Contrastive Language-Image Pre-training (CLIP) model to obtain the features. These features guide identity preservation and attribute control through newly designed VFM-based visual and textual semantic contrastive losses. Extensive evaluations on benchmarks such as the FaceForensics++ (FF++) dataset, the Multiple Pose, Illumination, and Expression (MPIE) dataset, and the large-pose Flickr face (LPFF) dataset demonstrate that FaceChanger matches or exceeds state-of-the-art performance under standard conditions and significantly outperforms them in high-resolution, pose-intensive scenarios.


Keywords

Faceidentity swap, face swap, vision foundation model, contrastive learning


Key Contributions

  • Demonstrates that VFM-driven contrastive supervision dramatically improves swapped-face realism.

  • Achieves robust performance under ±90° facial poses while preserving identity consistency.

  • Provides a scalable real-time architecture compatible with GAN/AE pipelines.

  • Establishes strong benchmarks across identity retrieval, pose error, and expression error evaluations.


View Paper