AI
Human gesture interaction enables more natural and intuitive forms of communication. Unlike traditional input methods such as keyboards or touchscreens, gestures allow users to express intentions and commands through movements and poses closer to a person’s everyday interaction. Moreover, gesture-based interaction is inclusive, offering accessibility opportunities for individuals with physical disabilities who may have difficulty with more traditional input methods.
Hand gesture recognition is becoming an increasingly important area of research, especially in the development of human-computer interaction systems, human-robot interaction systems, sign language interpretation, augmented reality (AR), virtual reality (VR) [1], and remote control applications [2]. It can be separated by the following approaches: marker-based or glove-based approaches [3], [4], depth sensor-based approaches [5], [6], vision-based approaches [7]–[9].
In the Visual Intelligence Team of SRUKR, we developed Gesture Interaction Solutions for TV [15] based on a vision-based approach. Our solution allows you to control the TV using the camera connected to the TV.
During the development of our AI solution to recognize gestures, we prioritized the following requirements:
This blog post explores the architecture and operation flow of an on-device hand gesture recognition. The proposed NN architecture resolves the multimodal task for general hand understanding improvement. Also, we propose a novel multicascade architecture to improve 2D joints accuracy and then resolve other tasks like kinematic model construction, segmentation, etc. This model can be used as a feature extractor for other tasks, for example, gesture recognition can be resolved separately using the model output. Additionally, we compare the performance of the proposed model with competitors [10]–[12] across our hand gesture dataset.
A light-weight on-device real-time solution for hand gesture recognition is presented. Our solution for hand gesture recognition contains the following components: body detector, body skeleton model, hand tracking, hand skeleton model and gesture classification (Fig. 1).
Figure 1. Operation flow
The body detector and body skeleton model components are required for hands localization purposes. A special low-weight tracker tracks hand positions, especially for fast movement cases. An optional body skeleton also used to track hands positions in case of extra fast movement, when hand tracker or hand skeleton is inefficient and fails. The main component is the hand skeleton model. This model was created for full hand understanding.
Fig. 2 presents the model's basic structure, consisting of the hand skeleton model block (self-developed base features extractor and 2D hand skeleton joints detection cascades), a kinematic model with a verification block, a hand with objects classification block, and a gestures classification.
Figure 2. Proposed hand skeleton model
The main part of the proposed model is the Base feature extractor. Different blocks reuse this part. Detection of 2D hand joints Fig. 3 is implemented by CNN-based joints detection. Multiple cascades are used, where each next cascade of CNN improves the previous one. The number of cascades can be configured. Each following cascade decreases 2D error, but also increases processing time.
Figure 3. Hand skeleton joints
The proposed model is also trained to return the segmentation mask. The segmentation block improves the quality of segmentation based on an internal 24x24 pixels (px) segmentation mask and extends the segmentation to 96x96 px resolution. A semi-automatic labeling of the dataset was prepared. On the first step, a graph cut [13] algorithm was applied to prepare hand segmentation using labeled hand skeleton joints. The next step was to filter the bad cases manually. After filtering, only simple cases (good background, light, simple poses) were selected. Less than 1% of the samples have segmentation labeling. Adding segmentation to the model increases the accuracy of the hand skeleton, especially in the case of a complicated background (Fig. 4).
Figure 4. Hand segmentation results examples
An additional block computes the kinematic hand model and contains 39 parameters (including rotations in each joint, finger sizes, hand side, position, etc)
Using this kinematic model, 2D and 3D hand joints can be computed. These joints will be close to 2D joints from heatmaps. However, the kinematic model is limited by the possible hand configuration and has restrictions on possible hand positions.
The output of the hand verification block classifies whether a hand is in the image. Sometimes the user can move a hand outside the frame, or the tracker can lose a hand. In these cases, we need to stop tracking and find the hand again.
This block allows an additional output to classify a hand with an object and a hand without an object. This allows for rejecting false cases when a user grabs an object in his hand, the pose may look like a gesture.
For gesture classification, we use a separate additional NN. This model has a simple fully-connected architecture. It uses 2D and 3D skeleton joints, skeleton joints scores, 3D angles of joints, Euler angles of hand rotation in 3D as 740 input values. On the output layer there are score values in an amount equal to the classes of gestures which need to be classified. Since architecture doesn’t contain convolutions and has few inputs and outputs, the model is very light for inference on embedded devices and takes less than one millisecond (ms) on Samsung TV's CPU. The inputs are normalised relative to the Wrist joint, see Fig. 3, making the model robust to shifts and scale.
Just a small dataset is enough to train gesture classification. Fig. 5 presents nine different gestures supported by our model. Then average model accuracy for gestures exceeds 98.4% detection rate and 0.6% false positives.
Figure 5. Recognized gestures
By analyzing the trajectory, dynamic gestures such as counterclockwise/clockwise gestures, swipes, shakes (Wave) can be detected. To detect the movement trajectory, hand tracking history can be used. An additional model to classify dynamic gestures is used. The Wave gesture is detected if the Palm gesture is detected, but the hand shakes left/right. Dynamic trajectory analysis has been added to recognize this gesture.
For training, we use our own dataset collected in a special collection studio with a different light emulation. It allows the collection of different types of datasets depending on distance, camera types and lighting conditions. Additionally, we prepared multi devices tool to synchronize dataset collection from multiple webcams and smartphones.
The dataset for the hand skeleton model contains more than 2,8 million images (∼2745.7 Gb) for train/validation and 190,000 images (∼76 Gb) for testing. Each image is accompanied by a label containing 2D and/or 3D labels.
Expanding the dataset increases accuracy and resolves issues for specific cases. Among the hand skeleton model dataset, over 600,000 images (∼182.53 Gb) were collected in low light conditions under conditions of five lux. The lux was measured by a luxmeter by placing the photocell in the direction of the palm, thus measuring the illumination that fell perpendicular to the palm. The camera has extracted brightness and the photos looks bright, but at the same time there were specific noises that managed to improve the results in poor lighting.
The dataset for the gestures classification model contains 90,000 (∼2.3 Gb) images for training/validation of all nine hand poses (Fig. 5) with various poses of the human hand. This dataset doesn’t require hands skeleton labeling. The proposed model can be prepared using other datasets and with fewer samples. A large dataset helps to cover specific cases with extra rotations, light or specific poses.
2D and 3D ground truth labeling (Fig. 6) were used to train the hand skeleton model. Several cameras simultaneously were used to prepare the 3D labeling ground truth. To train our model, several loss functions were used with different weights to train a multi-task model simultaneously and set task priorities. These parameters are selected experimentally.
Figure 6. Training scheme for hand skeleton model
The main loss to train the kinematic model is L2 based on 2D joints. For this special kinematic model block, 38 parameters from the NN model output are converted to the kinematic model and 3D joints. 3D joint projects to 2D and compares with 2D ground truth be an additional loss function. L3 loss is not mandatory, but it helps improve cases when 2D joints projection is similar, but gesture is different. Several experiments show that L2 loss is major, and if we remove 3D loss L3, the results will be almost similar.
During training several augmentations were implemented to generalize data and improve performance in real-world cases, like pixels multiplication, scale change, blur, special cutout [14], random shift, rotate image around center on random angle, crop image.
We offer to reuse the hand skeleton features from the kinematic model to train the gesture classification model. The full model will be used during the training step, but unnecessary blocks labeled by grey in Fig. 2 will be removed during the conversion.
A metric PCK@0.2 was used for a performance evaluation. More precisely, PCK@0.2 is calculated as the percentage of correct joints with a threshold of 20% of the hand width metric, where the joint is correct if the distance from the ground truth is less than 20% of the hand width, see Fig 3. The optimal conditions for the model are a bright environment over 150 lux using a 90-degree field of view camera at a distance of 2.5 m to 4 m. The poor conditions for the model are a dark environment with less than 20 lux using a wide view angle camera on long distance (over 4 m).
An evaluation and comparison with MediaPipe [10], [11] model on PC were provided (Linux OS, CPU inference), see Table 1.
Table 1. Comparison with MediaPipe
The MediaPipe model shows similar results in typical cases, where the user show different gestures (Fig. 5) at a distance of less than 1 m under normal light conditions. The dataset contains 1,288 samples to test this case. In the case of low light under 15 lux conditions for 1 − 2 m our solution shows better results. This dataset contains more simple cases such as Palm and Fist, but with higher variations of the base hand rotation. The dataset size is 1840 images. Our solution has much better performance in low light condition. Fig. 7 shows several fail cases of the MediaPipe model. There are fail cases with wrist joint in the upper image and the fingers are not bent enough in the lower image.
Figure 7. Comparison with MediaPipe in low light conditions
Model inference time on TV is 15 ms, on Samsung Galaxy S20 GPU - 22ms, Qualcomm Snapdragon 865 DSP - 8ms, Exynos 990 NPU - 7ms.
The robustness of our model is verified at the test step. For testing, we use various conditioned images which contains various hand poses, frontal or egocentric view, different hand rotations and different gestures, light conditions from 3 to 300 lux, location (indoor with different light direction and random light sources configurations), distance (from 1 m to 5 m), image blur (no blur or sigma 3) conditions.
A novel neural network model was introduced for robust and efficient gesture recognition. Our model addresses a number of key challenges in the field, including the need for accurate recognition across diverse light and distance conditions, including case unintentional gestures, while maintaining real-time performance on resource-constrained devices. Our model outperforms existing approaches in terms of PCK@0.2 metric in scenarios involving complex gestures and poor light conditions. Furthermore, our approach doesn’t require specialized hardware, making it accessible for deployment on a wide range of consumer devices, including smartphones, TVs, AR/VR systems.
https://ieeexplore.ieee.org/document/10888950
[1] V. Olshevsky, I. Bondarets, O. Trunov, and A. Shcherbina, “Realistic occlusion of virtual objects using three-dimensional hand model,” in HCI International 2021 - Posters, C. Stephanidis, M. Antona, and S. Ntoa, Eds. Cham: Springer International Publishing, 2021, pp. 295–301.
[2] Samsung. (2022) How to use palm gesture to take selfie on samsung mobile? [Online]. Available: https://www.samsung.com/sg/support/mobile-devices/how-to-use-palm-gesture-to-take-selfie-on-samsung-mobile-device/
[3] D. Sturman and D. Zeltzer, “A survey of glove-based input,” IEEE Computer Graphics and Applications, vol. 14, no. 1, pp. 30–39, 1994.
[4] L. Dipietro, A. Sabatini, and P. Dario, “A survey of glove-based systems and their applications,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 38, pp. 461–482, 08 2008.
[5] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in CVPR 2011, 2011, pp. 1297–1304.
[6] H. Tang, Q. Wang, and H. Chen, “Research on 3d human pose estimation using rgbd camera,” 07 2019, pp. 538–541.
[7] P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single rgb frame for real time 3d hand pose estimation in the wild,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 03 2018, pp. 436–445.
[8] X. Zhang, H. Huang, J. Tan, H. Xu, C. Yang, G. Peng, L. Wang, and J. Liu, “Hand image understanding via deep multi-task learning,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11 261–11 272.
[9] N. Santavas, I. Kansizoglou, L. Bampis, E. Karakasis, and A. Gasteratos, “Attention! a lightweight 2d hand pose estimation approach,” IEEE Sensors Journal, vol. 21, no. 10, pp. 11 488–11 496, 2021.
[10] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann, “Mediapipe: A framework for perceiving and processing reality,” in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, 2019. [Online]. Available: https://mixedreality.cs.cornell.edu/s/NewTitle May1 MediaPipe CVPR CV4ARVR Workshop 2019.pdf
[11] G. Research, “mediapipe", https://github.com/google-ai-edge/mediapipe, 2019.
[12] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, ser. FAT* ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 220–229. [Online]. Available: https://doi.org/10.1145/3287560.3287596
[13] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based image segmentation with connectivity priors,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
[14] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” 2017. [Online]. Available: https://arxiv.org/abs/1708.04552
[15] Samsung, Gesture Interaction, https://www.samsung.com/levant/support/tv-audio-video/control-your-samsung-tv-with-gesture-interaction/