AI

Finding Waldo: Towards Efficient Exploration of NeRF Scene Spaces

By Mehmet Kerim Yucel Samsung R&D Institute United Kingdom
By Albert Saa-Garriga Samsung R&D Institute United Kingdom
By Bruno Manganelli Samsung R&D Institute United Kingdom

Motivation


Neural Radiance Fields (NeRF) [1] have taken the 3D computer vision field by storm in the last few years, due to their ability to compactly represent complex scenes and their impressive results in novel view synthesis. NeRF methods quickly found use in various application areas, such as augmented reality, robotics, content creation and multimedia production [1] [2] [3].

NeRFs are 3D representations that encode a scene into the weights of a neural network. Just like we can perform computer vision tasks on 2D images, we can perform them on NeRFs as well; object detection [4], segmentation [5] and style transfer methods [6] are proposed for NeRFs. These methods, however, have two major drawbacks; they work for only one task (e.g. segmentation), or they are tailored for a specific NeRF method.

In this work, we aim to address these two restrictions and focus on a more general task. First, we formally introduce the scene exploration framework, where the aim is to find inputs to the NeRF model, such that with these inputs, one can render images that adhere to some criteria provided by the user. A practical example for this is Finding Waldo; assume you have your scene encoded with a NeRF, and you have Waldo somewhere in it. If we want to Find Waldo, the aim of the framework would be to find NeRF inputs, which would render an image of the scene, in which Waldo is visible.

Second, since the scene exploration framework has not been explored before, we propose three methods: Guided Random Search (GRS), Pose Interpolation-based Search (PIBS) and Evolution-Guided Pose Search (EGPS), which tackle the problem of scene exploration. Finally, we perform experiments on three scenarios and show that EGPS performs favourably compared to others.

Figure 1. The scene exploration framework aims to find the camera poses from which one can render novel views that adhere to some user-provided criteria; e.g. including an object, improving photo-composition, maximizing object saliency, etc. Efficiently exploring scenes in 3D can be imperative for content creation, multimedia production and VR/AR applications.

Exploring NeRF Scene Spaces


To understand what the scene exploration framework is, we first need to understand how NeRF methods work.

NeRF Preliminaries
NeRF methods estimate a 5D function f∶(x,y,z,θ,∅)  (c, σ), where (x,y,z) is the camera pose, (θ,∅) is the viewing angle, c is the color and σ is the volume density. NeRF networks are evaluated for samples along each ray – and the produced color values are integrated ray-wise to produce the final color for each ray/pixel. Using a set of images of a scene, NeRF learns to approximate this 5D function via the reconstruction loss.

Scene Exploration Framework
Let f(·;ζ) be the NeRF model (parameterized by ζ) that models a scene and (x,y,z,θ,∅) be the 5D input pose to the NeRF model. Once we query the NeRF model with this input pose, assume we get the image I. Assume that I has an appearance A_I and geometry G_I, estimated by functions A(·) and G(·). We define the scene exploration framework as

where ε_A and ε_G are small constants, and ∇ indicates the difference between appearance and geometry of image I. In simple words, we want to find camera poses with which we can render an image with desired appearance and/or geometry. Note that using both A(·) and G(·) is the most general case; we can focus only on appearance or geometry, depending on the specific scenario.

A(·)and G(·) can be implemented by various models; they can be face detectors, saliency segmentation models, object detectors and many others. Furthermore, f(·;ζ) can be any NeRF model (as long as it takes in the 5D input). With our formulation, we address the two key drawbacks of existing NeRF-based methods.

Practical Scenarios
Our formulation is flexible and thus can be used for various practical scenarios; e.g. finding the viewpoint that includes or excludes an object, maximization of image quality, maximization of number of faces, etc. Such scenarios would naturally lend itself to real-life use cases, such as optimal insertion of objects into a scene to interactive gaming/educational content in AR environments.

The Proposed Methods


Having defined our framework, we now introduce three different approaches aimed at addressing the scene exploration problem efficiently.

Guided Random Search
Since we are searching for camera poses in the 3D space of a scene, the most naïve way to find these poses is – you guessed it – randomly choosing them. However, this becomes infeasible quickly as the scene becomes bigger. Instead, we propose Guided Random Search (GRS) - where we constrain the random sampling of the poses between the poses of the training images. Note that NeRF methods inherently overfit to training images – therefore sampled poses close to training images (in 3D space) are likely to be better candidates – regardless of the use case we aim for.

GRS can be briefly summarized as follows:

    
1-     
Take all the available images of the scene and compute their scores via A(·) and/or G(·).
    
2-     
Add the images and their scores to our population P.
    
3-     
Find the max and min coordinates of the available images.
    
4-     
At each epoch:
    
a.     
Randomly sample a set of coordinates and viewing angles between min and max.
    
b.     
Render images with these poses as input (using the NeRF method).
    
c.     
Compute the scores of the rendered images via A(·) and/or G(·).
    
d.     
Add the new images and their scores to the P.
    
5-     
Sort P with respect to the scores – and return population members that satisfy our criteria.


Pose Interpolation-based Search
Our second method is the Pose Interpolation-based Search (PIBS). PIBS does away with the pure randomness of GRS, and generates new poses between pairs of poses by simply interpolating between them (via SLERP [7]). However, PIBS introduces a different layer of randomness in the sampling stage of the pairs of poses to interpolate between.

PIBS can be briefly summarized as follows:

    
1-     
Take all the available images of the scene and compute their scores via A(·) and/or G(·).
    
2-     
Add the images and their scores to our population P.
    
3-     
Initialize an empty list of pose pairs E.
    
4-     
At each epoch:
    
a.     
Sort P with respect to the scores and sample pose pairs X (excluding pairs in E).
    
b.     
Add X to E to avoid sampling the same pair again.
    
c.     
Interpolate between X to generate new poses.
    
d.     
Render images with these new poses as input (using the NeRF method).
    
e.     
Compute the scores of the rendered images via A(·) and/or G(·).
    
f.     
Add the new images and their scores to the P.
    
5-     
Sort P with respect to the scores – and return population members that satisfy our criteria.


Evolution-Guided Search
GRS and PIBS are not concerned about being optimal or efficient. Since our problem at hand is essentially a constrained optimization problem, we look for better and more efficient alternatives. Although search methods are well known in the literature, there is virtually no method which is agnostic to key factors, such as search space convexity and search criteria differentiability – which are important factors if we want our method to work regardless of A(·)/ G(·), or the underlying NeRF method. To this end, we propose Evolution-Guided Pose Search (EGPS) – which is based on genetic algorithm [8] and meets the above criteria – while being efficient and more accurate.

EGPS can be briefly summarized as follows:

    
1-     
Take all the available images of the scene and compute their scores via A(·) and/or G(·).
    
2-     
Add the images and their scores to our population P.
    
3-     
At each epoch:
    
a.     
Sort P with respect to the scores and sample a parent pose pair X. This sampling favours higher-scored members of the population.
    
b.     
Perform crossover and mutation on the parent pose pair X to generate new poses.
    
c.     
Render images with these new poses as input (using the NeRF method).
    
d.     
Compute the scores of the rendered images via A(·) and/or G(·).
    
e.     
Add the new images and their scores to the P.
    
4-     
Sort P with respect to the scores – and return population members that satisfy our criteria.


Experiments


We experiment using several real-life scenes [9] [10], and evaluate the success of the search methods via CVIR and mCVIR metrics – they measure the rate of improvement of newly generated images over the already available images of the scene. In short, if they are higher than zero, it means we managed to find better poses in the scene.

We assess the methods in two distinct regimes; low-pose and high-pose. As the names imply, low-pose regime means we produce a limited number of poses for each method, whereas in high-pose regime we produce a high amount of poses.

Assessing these methods in our framework requires two choices; 1) NeRF method and 2) A(·) and/or G(·). For 1), we choose the popular Instant-NGP [11] framework due to its strong results and fast convergence. For 2), we choose three distinct use cases; photo-composition improvement, image quality maximization and saliency maximization.

Photo-composition improvement requires us to find poses, such that once we render images from these poses, we will have high photo-composition scores. We use SAMP-Net [12] as our A(·)/ G(·) to provide the photo-composition scores. Image quality maximization aims to find poses, such that once we render images from these poses, we will have images with high no-reference image quality assessment (IQA) scores. We use CONTRIQUE [13] as our A(·)/G(·) to provide the IQA scores. Saliency maximization aims to find poses, such that once we render images from these poses, the number of pixels predicted to be salient in the images will be high. For saliency prediction, we use BASNet [14] as our A(·)/G(·).

Quantitative Results – Photo-Composition Improvement
Low-pose regime results show that PIBS and EGPS are close to each other in terms of accuracy, where they both lead the pack in two scenes. In high-pose regimes, EGPS cements its lead and we see improvements on all scenes, as methods have more time to improve on the available images.

Table 1. Photo-composition improvement use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.

Quantitative Results – Image Quality Maximization
In this use case, we see that two scenes experience no improvement, either in low or high-pose regimes. We find the reason for that to be the inherent NeRF artefacts we encounter in generated novel views, and the surprising ability of CONTRIQUE to detect them. In the scenes where methods actually manage to introduce improvements, we see EGPS leading the pack as the previous use case, further showing its effectiveness.

Table 2. Image quality maximization use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.

Quantitative Results – Saliency Maximization
In saliency maximization, the results are much clearer since we have much higher and consistent improvements for all methods even in the low-pose regime. EGPS leads the pack here as well, but this time it is much more pronounced and consistent across all scenes and metrics.

Table 3. Saliency maximization use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.

Qualitative Results
Figure 2 shows examples for saliency maximization (see our paper for more qualitative results). From top to bottom, the rows show the best available images, the best images found by GRS, PIBS and EGPS. The images show that PIBS images lack diversity, whereas the other two do a good job. The results are interpretable here; saliency maximization aims to increase the number of salient pixels. The search methods optimize for this goal, which in practice takes the form of a “zooming-in” effect, as bigger objects often mean larger salient areas. This is visible especially for EGPS.

Figure 2. Qualitative results for saliency maximization use case. Overlaid on images in red are the number of salient pixels.

Limitations
One potential limitation of our framework is that although it is NeRF-agnostic, since it relies on NeRF accuracy to render novel views, it is bound by its effectiveness. Furthermore, the scene exploration framework aims to find poses that are “better” than the already available images, and our metrics naturally reflect that. However, for some scenes, this simply might not be the case, as we might have the “best” images already. In other words, it might very well be impossible to find better images than the available images. Therefore, it is imperative to distinguish these cases from real failure cases, where the search methods fail to find good poses although such poses do exist.

Conclusion


Inspired by the recent use cases of NeRF, we introduce the scene exploration framework, where the goal is to find camera poses in a scene, such that images rendered from these poses (via a NeRF method) would adhere to some user-provided criteria. To alleviate the lack of adequate baseline methods, we first propose two baseline methods GRS and PIBS and then propose a third optimization-based approach EGPS that lets us find these camera poses in a successful and efficient way. Evaluation on three distinct use cases show us that EGPS outperforms the baselines, while leaving substantial room for future research for further improvements.

Publication


Our paper “Finding Waldo: Towards Efficient Exploration of NeRF Scene Spaces” will appear at ACM Multimedia Systems Conference (MMSys), 2024.

arXiv link: https://arxiv.org/abs/2403.04508

Bibliography


[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” in European Conference on Computer Vision, 2020.
[2] M. Adamkiewicz, T. Chen, A. Caccavale, R. Gardner, P. Culbertson, J. Bohg and M. Schwager, “Vision-only robot navigation in a neural radiance world,” IEEE Robotics and Automation Letters, pp. 4606-4613, 2022.
[3] S. Li, C. Li, W. Zhu, B. Yu, Y. Zhao, C. Wan, H. You, H. Shi and Y. Lin, “Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction,” in Annual International Symposium on Computer Architecture, 2024.
[4] B. Hu, J. Huang, Y. Liu, Y.-W. Tai and C.-K. Tang, “NeRF-RPN: A general framework for object detection in NeRFs.,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[5] Z. Fan, P. Wang, Y. Jiang, X. Gong, D. Xu and Z. Wang, “Nerf-sos: Any-view self-supervised object segmentation on complex scenes,” in arXiv preprint, 2022.
[6] K. Zhang, N. Kolkin, S. Bi, F. Luan, Z. Xu, E. Shechtman and N. Snavely, “Arf: Artistic radiance fields,” in European Conference on Computer Vision, 2022.
[7] K. Shoemake, “Animating rotation with quaternion curves,” in Annual conference on Computer graphics and interactive techniques, 1985.
[8] J. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence., MIT Press, 1992.
[9] A. Knapitsch, J. Park, Q.-Y. Zhou and V. Koltun., “Tanks and Temples: Benchmarking large-scale scene reconstruction,” ACM Transactions on Graphics , pp. 1-13, 2017.
[10] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng and A. Kar, “ Local light field fusion: Practical view synthesis with prescriptive sampling guidelines,” ACM Transaction on Graphics, pp. 1-14, 2019.
[11] T. Müller, A. Evans, C. Schied and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transaction on Graphics, pp. 1-15, 2022.
[12] B. Zhang, L. Niu and L. Zhang, “Image composition assessment with saliency-augmented multi-pattern pooling,” in arXiv preprint, 2021.
[13] P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli and A. C. Bovik, “Image quality assessment using contrastive learning.,” IEEE Transactions on Image Processing\, p. 4149–4161, 2022.
[14] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan and M. Jagersand, “Basnet: Boundary-aware salient object detection,” in IEEE/CVF conference on computer vision and pattern recognition, 2019.