PVSDNet: Joint Depth Prediction and View Synthesis Via Shared Latent Spaces in Real-Time
Mid Sweden University, TU Berlin, HTW Berlin
TL;DR
Abstract
Recent advances in real-time view synthesis have significantly enhanced immersive augmented telepresence applications by providing new viewpoints. Accurate depth estimation enables precise placement of virtual objects and improves spatial understanding and interaction. Nevertheless, state-of-the-art monocular depth estimation methods, when applied independently to each synthesized view, often result in geometric inconsistencies and visual artifacts such as flickering. To address these limitations, we propose a unified multimodal network capable of jointly synthesizing new views and predicting consistent depth maps from a single input image. Our framework integrates an additional depth prediction module into a state-of-the-art view synthesis architecture by leveraging a shared latent representation, thereby ensuring geometric coherence between synthesized views and their depth maps. We introduce a two-stage training strategy, initially freezing the view synthesis branch of the network while training the depth decoder individually to establish accurate depth estimation. Subsequently, we fine-tune the combined architecture employing low-rank adaptation (LoRA), facilitating rapid convergence and improved multimodal accuracy. Extensive experiments validate that our proposed approach achieves superior viewpoint consistency, visual realism, and depth accuracy while maintaining real-time inference speeds, highlighting its suitability for live interactive augmented telepresence environments.
Bibtex
If you use our work in your research, please cite our publication:
@ARTICLE{11348070,
author={Gond, Manu and Zerman, Emin and Knorr, Sebastian and Sjöström, Mårten},
journal={IEEE Access},
title={PVSDNet: Joint Depth Prediction and View Synthesis Via Shared Latent Spaces in Real-Time},
year={2026},
volume={14},
number={},
pages={9021-9037},
keywords={Real-time systems;Depth measurement;Training;Geometry;Rendering (computer graphics);
Accuracy;Three-dimensional displays;Telepresence;Neural radiance field;Visualization;Augmented reality;
depth image;low-rank adaptation fine-tuning;monocular depth estimation;telepresence;view synthesis},
doi={10.1109/ACCESS.2026.3653905}}