Using sound to add scale to computer vision

By Zhijian Yang Samsung AI Center - New York

Imagine seeing an image of a bright disk on a dark background. Is it a picture of a planet or just a ping pong ball? This example demonstrates that even though we might infer the shape of an object from an image, we cannot obtain the exact scale from a single image.

Figure 1. A single image does not contain scale information. In this toy example, the picture of a shorter person closer to the camera would look the same as the picture of a taller person who is further away from the camera. We show that one can use acoustic signals to supplement the camera image and obtain metric scale 3D reconstruction (and hence determine which of these people was imaged).

Scale ambiguity can cause problems in robotics tasks, where we need the exact 3 dimensional (3D) location and pose of a target. For example, consider a home robot scenario where the robot wants to hand over a cup to a person. Without knowing the exact 3D physical location of the human’s hand, the hand over can fail. Moreover, guessing the scale from semantic information is also not possible here because the person can be an adult of more than 1.8m, or a kid less than 1.2m. In this work, we focus on the problem of obtaining this scale information.

How can we get scale?

There are specialized sensors which can provide scale information by merging information from multiple sources. Stereo systems use multiple cameras. RGB-D cameras actively send out signals and estimate pixel level time of flight (ToF) from the reflected signal.

In this paper, rather than using specialized sensors we ask the questions of whether we can utilize existing smart home infrastructure to recover metric scale 3D human pose? Specifically, we show that we can use a regular camera along with audio to obtain metric information.

With automatic speech recognition and natural language understanding techniques becoming increasingly mature, it is now very common different smart appliances to have their own speakers and microphones. We aim to use these speakers and microphones to serve as complementary for computer vision because they can offer scale estimation. The principle is simple: sound signals emitted from the speakers will be bounced by human body surface and then arrive at the receiver side. By estimating the time-of-flight of sound, we can get the person’s location, even without depth camera. Even though this idea is intuitive, coupling sound with the semantic information coming from the camera while addressing issues regarding sound and multipath propagation in indoor environments is challenging.

In the rest of this article, we will start with a technical formulation of the problem, and explain how we address these challenges and fuse sound with vision for metric scale 3D human pose estimation.

Problem formulation

We cast the problem of 3D human pose lifting as one of learning a function  gθ  that predicts a set of 3D heatmaps given an input image where is the likelihood of the ith landmark over a 3D space, W and H are the width and height of the image, respectively, and N is the number of landmarks. In other words,


where gθ is a learnable function parametrized by its weights θ that lift a 2D image to the 3D pose. Given the predicted 3D heatmaps, the optimal 3D pose is given by

so that is the optimal location of the ith landmark. In practice, we use a regular voxel grid to represent P.

We extend Equation (1) by leveraging audio signals to reconstruct a metric scale human pose, i.e.,


where is the time domain audio feature heard from the ith microphone.

Time domain impulse response (pose kernel) as the audio feature

We use the time domain impulse response as the audio feature input to our model. We call it the pose kernel . Intuitively, the received the signal are the sum of a bunch of delayed and attenuated copy of the original sound source signals. Each reflection path would contribute to one of these copies. Pose kernel represents the reflection profile, that is, how the signals are delayed and attenuated. In this way, is irreverent of source signal, and only a function of environmental geometry

More mathematically, the received signal is the time convolution between the pose kernel and the transmitted signal

Through a deconvolution operation, we can recover the pose kernel .

In the presence of room multi-path reflections, we first measure the empty room response , and then subtract out the empty room response from the full room response to get the target related pose kernel .

Spatial encoding of pose kernel

The pose kernel is a superposition of impulse responses from the reflective points on the body surface, i.e.,

where is the Dirac delta function at  .  is the arrival time of the audio signal reflected by the point on the body surface . is the reflection coefficient (gain) at .

This means, for each peak at in the pose kernel , it actually corresponds to a potential reflector on an ellipsoid, where the total propagation time from speaker to reflector and finally to microphone equals to In other words,

where    is the distance between the speaker and reflector,    is the distance between the reflector and the receiver, and is the velocity of sound. Figure 2 shows the visualization of our pose kernel spatial encoding, and how we deal with the room multi-paths.

Figure 2. Visualization of the spatial encoding (left column) of time-domain impulse response (right column) through a sound simulation. The elliptical patterns can be observed by the spatial encoding where their focal points coincide with the locations of the speaker and microphone. (a) We visualize the empty room impulse response. (b) When an object is present, a strong impulse response that is reflected by the object surface can be observed. We show full responses that include the pose kernel. (b) Due to the object rotation, the kernel response is changed. (c) We observe delayed pose kernel due to translation.

Audio and vision fusion for metric scale 3D human pose estimation

After understanding the principles of signal propagation and spatial encoding, now let us revisit the 3D human pose estimation problem and see how we can fuse audio and vision information together using a learning based approach.

Figure 3 is the overview of our system. We assume there are a bunch of speakers, microphones and one single view camera in the environment. We perform 2D key point detection on the single view image to get the 2D joint locations. We also obtain the pose kernel from the audio signals. Our system takes 2D joint locations as well as pose kernel as input, and outputs the 3D human pose in metric scale.

Figure 3. System overview. (Left) Audio signals traverse in 3D across a room and are reflected by objects including human body surface. (Middle) Given the received audio signals, we compute a human impulse response called pose kernel by factoring out the room impulse response. (Right) We spatially encode the pose kernel in 3D space and combine it with the detected pose in an image using a 3D convolutional neural network to obtain the 3D metric reconstruction of human pose.

We calibrate audio and vision sensors so that they share the voxel space, and encode both information onto the voxel space. Specifically, for audio information, we spatially encode the pose kernel as illustrated above. We visualize the audio encoding in the video below. For vision information, we use inverse projection to encode 2D key points onto rays in the 3D space.

Figure 4. Video illustration of pose kernel spatial encoding. The encoding is following the person's movement very well.

We design a 3D convolution neural network (3D CNN) that fuse audio and vision encodings together. Inspired by the design of the convolution pose machine, the network is composed of six stages that can increase the receptive field while avoiding the issue of the vanishing gradients, as shown in Figure 5.

Figure 5. Network Architecture: We combine audio and visual features using a series of convolutions (audio features from multiple microphones are fused via max-pooling.). The audio visual features are convolved with a series of 3 × 3 × 3 convolutional kernels to predict the set of 3D heatmaps for joints.

Results and visualization

We collect a new dataset called PoseKernel dataset. It is composed of more than 6,000 frames of synchronized videos and audios from six locations including living room, office, conference room, laboratory, etc. We train and test our network in different environments. Figure 6 visualizes some of our results. Our solution works well in estimating 3D human pose in different environments, while still fails in terms of severe occlusion. We will focus on occlusion cases as a next step.

Figure 6. Qualitative results. We test our pose kernel lifting approach in diverse environments including (a) basement, (b) living room, (c) laboratory, etc. The participants are asked to perform daily activities such as sitting, squatting, and range of motion. (d) A failure case of our method: severe occlusion.

We compare our solution with a baseline vision-only solution. The scale for baseline solution is obtained by assuming the person is 1.7m in height. The video below shows the comparison between our solution and the baseline. It is very evident that the baseline estimates the person’s location with a severe offset, while ours estimates the absolution 3D location of human joints very accurately.

Figure 7. Video visualization of our results. Left: raw video, middle: our estimation, right: baseline


We proposed a new method to reconstruct 3D human body pose with the metric scale from a single image by leveraging audio signals. We hypothesized that the audio signals that traverse a 3D space are transformed by the human body pose through reflection, which allows us to recover the 3D metric scale pose. We learned a 3D convolutional neural network that can fuse the 2D pose detection from an image with the pose kernels to reconstruct 3D metric scale pose.

Our work has potential applications in smart home, AR/VR, and robotics where metric scale is important. With our solution, smart home robotic assistants can better understand where the user is and what the user is doing, to facilitate more seamless interaction. In the AR/VR domain, our work adds real-world geometry information to the existing services, to give user a real spatial feeling. With more and more applications requires shared physical world between user and machines, we believe metric scale scene understanding has great potentials.

In our future work, we plan to extend the current work into multiple people metric scale 3D poses estimation, environmental 3D reconstruction, as well as human object interaction understanding using audio and vision fused approach.

Link to the paper


Zhijian Yang, Xiaoran Fan, Volkan Isler, and Hyun Soo Park. "PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound." arXiv preprint arXiv:2112.00216 (2021). (To appear in CVPR 2022)