AI
Neural Radiance Fields (NeRF) [1] have taken the 3D computer vision field by storm in the last few years, due to their ability to compactly represent complex scenes and their impressive results in novel view synthesis. NeRF methods quickly found use in various application areas, such as augmented reality, robotics, content creation and multimedia production [1] [2] [3].
NeRFs are 3D representations that encode a scene into the weights of a neural network. Just like we can perform computer vision tasks on 2D images, we can perform them on NeRFs as well; object detection [4], segmentation [5] and style transfer methods [6] are proposed for NeRFs. These methods, however, have two major drawbacks; they work for only one task (e.g. segmentation), or they are tailored for a specific NeRF method.
In this work, we aim to address these two restrictions and focus on a more general task. First, we formally introduce the scene exploration framework, where the aim is to find inputs to the NeRF model, such that with these inputs, one can render images that adhere to some criteria provided by the user. A practical example for this is Finding Waldo; assume you have your scene encoded with a NeRF, and you have Waldo somewhere in it. If we want to Find Waldo, the aim of the framework would be to find NeRF inputs, which would render an image of the scene, in which Waldo is visible.
Second, since the scene exploration framework has not been explored before, we propose three methods: Guided Random Search (GRS), Pose Interpolation-based Search (PIBS) and Evolution-Guided Pose Search (EGPS), which tackle the problem of scene exploration. Finally, we perform experiments on three scenarios and show that EGPS performs favourably compared to others.
Figure 1. The scene exploration framework aims to find the camera poses from which one can render novel views that adhere to some user-provided criteria; e.g. including an object, improving photo-composition, maximizing object saliency, etc. Efficiently exploring scenes in 3D can be imperative for content creation, multimedia production and VR/AR applications.
To understand what the scene exploration framework is, we first need to understand how NeRF methods work.
NeRF Preliminaries
NeRF methods estimate a 5D function f∶(x,y,z,θ,∅) (c, σ), where (x,y,z) is the camera pose, (θ,∅) is the viewing angle, c is the color and σ is the volume density. NeRF networks are evaluated for samples along each ray – and the produced color values are integrated ray-wise to produce the final color for each ray/pixel. Using a set of images of a scene, NeRF learns to approximate this 5D function via the reconstruction loss.
Scene Exploration Framework
Let f(·;ζ) be the NeRF model (parameterized by ζ) that models a scene and (x,y,z,θ,∅) be the 5D input pose to the NeRF model. Once we query the NeRF model with this input pose, assume we get the image I. Assume that I has an appearance A_I and geometry G_I, estimated by functions A(·) and G(·). We define the scene exploration framework as
where ε_A and ε_G are small constants, and ∇ indicates the difference between appearance and geometry of image I. In simple words, we want to find camera poses with which we can render an image with desired appearance and/or geometry. Note that using both A(·) and G(·) is the most general case; we can focus only on appearance or geometry, depending on the specific scenario.
A(·)and G(·) can be implemented by various models; they can be face detectors, saliency segmentation models, object detectors and many others. Furthermore, f(·;ζ) can be any NeRF model (as long as it takes in the 5D input). With our formulation, we address the two key drawbacks of existing NeRF-based methods.
Practical Scenarios
Our formulation is flexible and thus can be used for various practical scenarios; e.g. finding the viewpoint that includes or excludes an object, maximization of image quality, maximization of number of faces, etc. Such scenarios would naturally lend itself to real-life use cases, such as optimal insertion of objects into a scene to interactive gaming/educational content in AR environments.
Having defined our framework, we now introduce three different approaches aimed at addressing the scene exploration problem efficiently.
Guided Random Search
Since we are searching for camera poses in the 3D space of a scene, the most naïve way to find these poses is – you guessed it – randomly choosing them. However, this becomes infeasible quickly as the scene becomes bigger. Instead, we propose Guided Random Search (GRS) - where we constrain the random sampling of the poses between the poses of the training images. Note that NeRF methods inherently overfit to training images – therefore sampled poses close to training images (in 3D space) are likely to be better candidates – regardless of the use case we aim for.
GRS can be briefly summarized as follows:
Pose Interpolation-based Search
Our second method is the Pose Interpolation-based Search (PIBS). PIBS does away with the pure randomness of GRS, and generates new poses between pairs of poses by simply interpolating between them (via SLERP [7]). However, PIBS introduces a different layer of randomness in the sampling stage of the pairs of poses to interpolate between.
PIBS can be briefly summarized as follows:
Evolution-Guided Search
GRS and PIBS are not concerned about being optimal or efficient. Since our problem at hand is essentially a constrained optimization problem, we look for better and more efficient alternatives. Although search methods are well known in the literature, there is virtually no method which is agnostic to key factors, such as search space convexity and search criteria differentiability – which are important factors if we want our method to work regardless of A(·)/ G(·), or the underlying NeRF method. To this end, we propose Evolution-Guided Pose Search (EGPS) – which is based on genetic algorithm [8] and meets the above criteria – while being efficient and more accurate.
EGPS can be briefly summarized as follows:
We experiment using several real-life scenes [9] [10], and evaluate the success of the search methods via CVIR and mCVIR metrics – they measure the rate of improvement of newly generated images over the already available images of the scene. In short, if they are higher than zero, it means we managed to find better poses in the scene.
We assess the methods in two distinct regimes; low-pose and high-pose. As the names imply, low-pose regime means we produce a limited number of poses for each method, whereas in high-pose regime we produce a high amount of poses.
Assessing these methods in our framework requires two choices; 1) NeRF method and 2) A(·) and/or G(·). For 1), we choose the popular Instant-NGP [11] framework due to its strong results and fast convergence. For 2), we choose three distinct use cases; photo-composition improvement, image quality maximization and saliency maximization.
Photo-composition improvement requires us to find poses, such that once we render images from these poses, we will have high photo-composition scores. We use SAMP-Net [12] as our A(·)/ G(·) to provide the photo-composition scores. Image quality maximization aims to find poses, such that once we render images from these poses, we will have images with high no-reference image quality assessment (IQA) scores. We use CONTRIQUE [13] as our A(·)/G(·) to provide the IQA scores. Saliency maximization aims to find poses, such that once we render images from these poses, the number of pixels predicted to be salient in the images will be high. For saliency prediction, we use BASNet [14] as our A(·)/G(·).
Quantitative Results – Photo-Composition Improvement
Low-pose regime results show that PIBS and EGPS are close to each other in terms of accuracy, where they both lead the pack in two scenes. In high-pose regimes, EGPS cements its lead and we see improvements on all scenes, as methods have more time to improve on the available images.
Table 1. Photo-composition improvement use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.
Quantitative Results – Image Quality Maximization
In this use case, we see that two scenes experience no improvement, either in low or high-pose regimes. We find the reason for that to be the inherent NeRF artefacts we encounter in generated novel views, and the surprising ability of CONTRIQUE to detect them. In the scenes where methods actually manage to introduce improvements, we see EGPS leading the pack as the previous use case, further showing its effectiveness.
Table 2. Image quality maximization use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.
Quantitative Results – Saliency Maximization
In saliency maximization, the results are much clearer since we have much higher and consistent improvements for all methods even in the low-pose regime. EGPS leads the pack here as well, but this time it is much more pronounced and consistent across all scenes and metrics.
Table 3. Saliency maximization use case. The results are given for all search methods in CVIR and mCVIR metrics (higher the better). The first three rows are results of low-pose regime, and the last three are of high-pose regime.
Qualitative Results
Figure 2 shows examples for saliency maximization (see our paper for more qualitative results). From top to bottom, the rows show the best available images, the best images found by GRS, PIBS and EGPS. The images show that PIBS images lack diversity, whereas the other two do a good job. The results are interpretable here; saliency maximization aims to increase the number of salient pixels. The search methods optimize for this goal, which in practice takes the form of a “zooming-in” effect, as bigger objects often mean larger salient areas. This is visible especially for EGPS.
Figure 2. Qualitative results for saliency maximization use case. Overlaid on images in red are the number of salient pixels.
Limitations
One potential limitation of our framework is that although it is NeRF-agnostic, since it relies on NeRF accuracy to render novel views, it is bound by its effectiveness. Furthermore, the scene exploration framework aims to find poses that are “better” than the already available images, and our metrics naturally reflect that. However, for some scenes, this simply might not be the case, as we might have the “best” images already. In other words, it might very well be impossible to find better images than the available images. Therefore, it is imperative to distinguish these cases from real failure cases, where the search methods fail to find good poses although such poses do exist.
Inspired by the recent use cases of NeRF, we introduce the scene exploration framework, where the goal is to find camera poses in a scene, such that images rendered from these poses (via a NeRF method) would adhere to some user-provided criteria. To alleviate the lack of adequate baseline methods, we first propose two baseline methods GRS and PIBS and then propose a third optimization-based approach EGPS that lets us find these camera poses in a successful and efficient way. Evaluation on three distinct use cases show us that EGPS outperforms the baselines, while leaving substantial room for future research for further improvements.
Our paper “Finding Waldo: Towards Efficient Exploration of NeRF Scene Spaces” will appear at ACM Multimedia Systems Conference (MMSys), 2024.
arXiv link: https://arxiv.org/abs/2403.04508