AI

CheapNVS: Real-Time On-Device Novel View Synthesis for Mobile Applications

By Mehmet Kerim Yucel Samsung R&D Institute United Kingdom

By Albert Saa-Garriga Samsung R&D Institute United Kingdom

Motivation

Novel View Synthesis (NVS)—the task of generating new perspectives of a scene from a single image—has vast potential in augmented reality (AR), robotics, and immersive media. NVS is a notoriously ill-posed problem, as it is not only comprised of reverse-projecting an image onto the 3D space, but it also requires the completion of missing data in occluded regions. Despite the recent advances in the literature, NVS still suffers from several critical bottlenecks:

• Computational Overhead: Traditional pipelines rely on explicit 3D reconstruction or per-scene optimization (e.g., NeRF-based methods (Mildenhall, 2021)), which makes them computationally complex, and consequently, lowers their feasibility to be deployed on-device for real-time applications. More recent multi-view diffusion model-based solutions (Bourigault, 2024), as accurate as they might be, are quite expensive to deploy as well.

• Limited Generalization: Many approaches only work on the specific camera baselines they are trained on (Yang Zhou, 2023), or require scene-specific training (Kerbl, 2023) (Mildenhall, 2021), which limits both scalability and practicality. Having to train a new model for each scene is not desirable, and definitely not practical in real-time scenarios.

CheapNVS tackles these challenges by reimagining NVS as an efficient, end-to-end task. It leverages lightweight modules to approximate 3D warping and performs inpainting in parallel to deliver real-time performance on mobile hardware, while maintaining competitive accuracy.

Novel View Synthesis – Reimagined

Single-view Novel View Synthesis aims to synthesize a new image ӏ_t from a given target camera pose T by using the input image ӏ_s and depth map of the input image D. Formally, it can be written as

where w(⋅;θ_w ) is the function that implements image warping, M is the occlusion mask indicating the areas to be inpainted in the warped image, and f(⋅;θ_f ) is the function that implements inpainting.

The issue with the equation above is that f(⋅;θ_f ) requires the output of w(⋅;θ_w ), which makes the NVS process inherently sequential. This sequential nature of NVS creates a performance bottleneck, no matter how performant f(⋅;θ_f ) or w(⋅;θ_w ) might be.

We hypothesize that we can resolve this bottleneck by performing inpainting and warping in parallel. We then propose our “reimagined” novel view synthesis formulation, which is written as

where ϕ(⋅;θ_ϕ ) is the function that outputs the occlusion mask. This formulation performs inpainting and warping in parallel, which alleviates the bottleneck. Furthermore, note that f(⋅;θ_f ), w(⋅;θ_w ) and ϕ(⋅;θ_ϕ ) take the same inputs, which can be exploited for further computational savings.

CheapNVS

Based on the reimagined NVS formulation, we propose to implement

where S, P, F and w ̂(⋅;θ_w ̂ ) are the predicted flow, inpainting, the shared embedding and the flow predictor, respectively. The novel view is then synthesized as

where gs is the grid sampling operation that shifts each pixel by the offset values in S. We implement δ(⋅),∇(⋅),w ̂(⋅),f(⋅) and ϕ(⋅) with neural networks by learning their parameters θ_{δ,ϕ,∇,f,w ̂}.

Architecture

RGBD Encoder: We leverage an off-the-shelf depth estimation model (Rene Ranftl, 2021) to generate D, and leverage a MobileNetv2 (Sandler, 2018) as the encoder.

Extrinsics Encoder: The goal of this module is to condition the warping on a target camera pose, and to generalize across different target camera poses. This encoder takes in a camera transformation matrix, and lifts it to a 256-dimensional latent vector using a lightweight, two-layer MLP.

Figure 1. CheapNVS embeds target camera pose and RGB information into a shared latent space, which is used by flow, mask and inpainting decoders to perform “learned” warping and inpainting, to produce the synthesized novel view.

The outputs of both encoders are concatenated into the shared latent F, allowing decoders to jointly reason about scene content and camera geometry.

Flow Decoder: Flow decoder leverages the shared latent F to predict a shift-map, which contains the offset values to be applied on each pixel to perform the warping. Once the shift map is applied to the input image, we get the warped image which we then multiply with the mask M to contain only the pixels coming from the source image. Unlike traditional 3D warping (which projects pixels using depth and camera matrices), this decoder learns to approximate warping directly.

Mask Decoder: This decoder uses the shared latent F and generates the mask M, with which we perform the blending between the warped input image and the generated inpainting.

Inpainting Decoder: Inpainting decoder leverages the shared latent F to generate the inpainting output P, which is then blended with the inverse of mask M to produce the final synthesized view.

All these decoders run in parallel and share the same architecture, which is formed of several decoder blocks of bilinear sampling, convolution and ELU activations.

Multi-stage Training

We train our model in a two-stage regime, where we first train all encoders, as well as flow and mask decoders for a few epochs, and then activate the inpainting decoder and train the entire network. This phased approach helps inpainting decoder start from a point where the network has learned warping up to a degree. This stabilizes the overall training and improves the results.

Experimental Results

We train our models with a combination of several losses; L1 loss for the flow decoder, cross-entropy loss for the mask decoder and L1 loss for the inpainting decoder. We train CheapNVS on COCO (Lin, 2014) to be comparable against AdaMPI (Yuxuan Han, 2022), but also propose to use a random 174K subset of OpenImages (Kuznetsova, 2020). We evaluate and compare our method using SSIM, PSNR and LPIPS metrics, and use DPT (Rene Ranftl, 2021) (COCO training/evaluation) and Marigold (Ke, 2024) (OpenImages training/evaluation) to generate depth maps.

Quantitative Results

We compare CheapNVS against AdaMPI (Yuxuan Han, 2022), which is the leading 3D photography method. We perform comparisons using both COCO and OpenImages as the training set.

Table 1. Results on Open Images. The first two rows are trained on OpenImages, whereas the last two are trained on COCO. † indicates training on OpenImages by us. AdaMPI performs the conventional 3D warping we use as ground-truth, so we do not report its warping results.

Table 2. Results on COCO. The first two rows are trained on OpenImages, whereas the last two are trained on COCO. † indicates training on OpenImages by us. AdaMPI performs the conventional 3D warping we use as ground-truth, so we do not report its warping results.

Tables 1 and 2 show that CheapNVS outperforms AdaMPI on inpainting, and also performs warping successfully. Another finding is that OpenImages training produces better models – based on COCO and OpenImages evaluation– which shows the value of OpenImages and its diversity.

Runtime performance

Table 3. Runtime analysis on a desktop GPU (RTX 3090), mobile GPU (Samsung Tab 9+) and memory consumption.

Table 3 shows the runtime and memory consumption performance of AdaMPI and CheapNVS. CheapNVS runs 10 times faster than AdaMPI on a consumer GPU. Furthermore, while AdaMPI can not be ported to mobile due to external dependencies (used for 3D warping) , we manage to port CheapNVS to device and achieve ~30FPS runtime. We also consume slightly less memory during inference compared to AdaMPI, showing the value of CheapNVS.

Qualitative Results

Figure 2 shows that CheapNVS accurately mimics 3D warping, and learns to smooth occlusion masks, which is necessary for better blending. CheapNVS performs quite competitively in inpainting, even outperforming AdaMPI in removal of object boundary artefacts (see person on 1^st and 3^rd rows.).

Figure 2. Left to right: Input, ground-truth occlusion mask, CheapNVS occlusion mask, ground-truth warped RGB, CheapNVS warped RGB, AdaMPI inpainting and CheapNVS inpainting.

Future Directions

•

Larger baselines: CheapNVS is trained on narrow baselines, and naturally struggles to warp properly in larger camera baselines. We aim to train CheapNVS on a more diverse interval of camera transformation matrices, which should address the larger baseline issue.

•

Depth dependency: Our method relies on an external depth estimator, which might introduce errors in the pipeline. We aim to use higher quality depth estimators, or even integrate depth estimation itself into our pipeline.

•

Inpainting teacher: We aim to improve inpainting accuracy even further by using a diffusion-based inpainting teacher, potentially finetuned with warp-back data.

Conclusions

CheapNVS, by replacing traditional 3D warping with learnable modules and executing tasks in parallel, achieves real-time single-image novel view synthesis on mobile devices, without sacrificing quality. Its success on Open Images highlights the value of scalable training data, while phased training ensures robust optimization.

Publication

Our paper “CheapNVS: Real-Time On-Device Narrow-Baseline Novel View Synthesis” will appear at International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025.

https://ieeexplore.ieee.org/document/10888972

References

Bourigault, E. a. (2024). MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.

Ke, B. O. (2024). Repurposing diffusion-based image generators for monocular depth estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Kerbl, B. G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics .

Kuznetsova, A. R.-T. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision.

Lin, T.-Y. M. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision .

Mildenhall, B. a. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 99-106.

Rene Ranftl, A. B. (2021). Vision Transformers for Dense Prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision.

Sandler, M. A.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Yang Zhou, H. W. (2023). Single-view view synthesis with self-rectified pseudostereo. International Journal of Computer Vision, 2032-2043.

Yuxuan Han, R. W. (2022). Single-view view synthesis in the wild with learned adaptive multiplane images. ACM SIGGRAPH .

#3DReconstruction #VirtualReality

AI