Real-Time View Synthesis with Multiplane Image Network using Multimodal Supervision

IEEE 27th International Workshop on Multimedia Signal Processing

TL;DR

We introduce a framework that directly predicts multi-plane-image (MPI) parameters from a single RGB image for real-time view synthesis. To guide the network for estimating correct parameters, we introduce a training strategy that leverages joint supervision from both view synthesis and depth estimation losses to ensure visual fidelity.

Abstract

Recent advances in view synthesis from a single image have increased visual quality of the newly synthesized viewpoints significantly. However, the high computational cost of state-of-the-art methods remains a critical bottleneck, limiting their adoption in real-time applications such as immersive telepresence. To address this limitation, we present a multiplane image (MPI) network that achieves real-time view synthesis. Unlike existing approaches that often rely on a separate depth estimation network to guide the network for estimating MPI parameters, our framework directly predicts the MPI parameters from a single RGB image. To guide the network for estimating correct parameters, we introduce a training strategy that leverages joint supervision from both view synthesis and depth estimation losses to ensure visual fidelity. During inference, our method exclusively utilizes the optimized view synthesis branch, while the depth decoder is only used for training. Our end-to-end approach renders views from a single input image in realtime. Extensive experiments validate that our method provides a compelling rendering speed with visual quality in par with stateof- the-art methods, highlighting its suitability for live, interactive applications.

Bibtex

If you use our work in your research, please cite our publication:

@inproceedings{Gond1989561,
author = {Gond, Manu and Shamshirgarha, Mohammadreza and Zerman, Emin and Knorr, Sebastian and Sj{\"o}str{\"o}m, M{\aa}rten},
booktitle = {2025 IEEE 27th International Workshop on Multimedia Signal Processing (MMSP), Beijing, China, Sept 21-23, 2025 : },
institution = {Mid Sweden University, Department of Computer and Electrical Engineering (2023-)},
institution = {Technical University of Berlin},
institution = {HTW Berlin - University of Applied Sciences},
note = {Accepted version of paper that will be published in forthcoming IEEE conference proceeding.},
title = {Real-Time View Synthesis with Multiplane Image Network using Multimodal Supervision},
keywords = {View Synthesis, Rendering, Depth Estimation, Multimodal Vision},
abstract = {Recent advances in view synthesis from a single image have increased visual quality of the newly synthesized viewpoints significantly. However, the high computational cost of state-of-the-art methods remains a critical bottleneck, limiting their adoption in real-time applications such as immersive telepresence. To address this limitation, we present a multiplane image (MPI) network that achieves real-time view synthesis. Unlike existing approaches that often rely on a separate depth estimation network to guide the network for estimating MPI parameters, our framework directly predicts the MPI parameters from a single RGB image. To guide the network for estimating correct parameters, we introduce a training strategy that leverages joint supervision from both view synthesis and depth estimation losses to ensure visual fidelity. During inference, our method exclusively utilizes the optimized view synthesis branch, while the depth decoder is only used for training. Our end-to-end approach renders views from a single input image in real-time. Extensive experiments validate that our method provides a compelling rendering speed with visual quality in par with state-of-the-art methods, highlighting its suitability for live, interactive applications. },
year = {2025}
}