AI
In advanced applications like autonomous driving and surveillance systems, different vision tasks need to work together [1]-[3]. These vision task models usually have two main parts: a feature extractor and a prediction head. Most computations occur in the feature extractor, so sharing the feature extractor among different tasks, each task having their own task-specific prediction head, is a commonly used format in multi-task learning. This structure can significantly speed up the inference and also produce more general visual representations.
However, multi-task learning faces two major challenges: balancing the training progress of various tasks with different natures and the cost of labeling the plenty of images for all tasks.
Several methods have been introduced to balance the training progress. Loss scale-based methods [4]-[6] aim to align the loss scales across different tasks. However, simply matching the scales may not sufficiently balance the gradients due to different derivatives of distinct functions. Gradient-based methods [7]-[10] directly adjust the back-propagated gradients but they don’t always ensure balanced training progress as difficulty of each task may vary. Easier tasks converge rapidly, whereas more challenging ones progress slower even though the same amount of gradients are delivered.
In this paper, we propose a novel multi-task loss that can effectively balance the training progress of various tasks without relying on loss and gradients. Our approach controls the training progress based on accuracy achievement, defined as the ratio of current accuracy to single-task accuracy. Additionally, the proposed loss is composed of a weighted geometric mean of individual task losses, instead of the conventional weighted sum, to address training imbalances caused by different derivatives and scales of loss functions.
To tackle the challenge of high-cost label annotation, we suggest constructing a large-scale partially annotated multi-task dataset by combining multiple task-specific datasets. This larger dataset can complement the smaller fully-annotated union dataset. However, inconsistency in the number of labels for individual tasks may worsen the imbalance issue in training progress. Our newly proposed loss effectively mitigates the variance in loss caused by label absence in partially annotated datasets. We empirically validated that our method achieved superior multi-task accuracy on both union and partially-annotated datasets compared to other benchmark losses.
The main contributions of this paper are as follows:
Multi-task loss: Most multi-task losses are generally represented as the weighted sum of task losses as follows:
where w_t and L_t mean the task weight and task loss of t-th task, N_T is the number of tasks, and L_total denotes the total multi-task loss. The task weights directly affect the accuracy of corresponding task, so figuring out the best weights for each task is crucial but also prohibitively expensive. Thus, extensive research has been conducted to determine the task weights automatically.
RLW [4] chooses random task weights, while DWA [5] modulates task weights to decrease task losses evenly. Simply defining multi-task losses as the geometric mean of task losses, GLS [6] effectively addresses the scale variance. GradNorm [9] and IMTL [10] adjust the task weights to equalize the task gradients at the last shared layer. However, as mentioned above, even if the same amount of gradient is delivered for all tasks, the training speed may vary depending on the difficulty of the tasks.
DTP [11], an accuracy-based method, introduces the concept of task difficulty to multi-task learning and estimates task difficulty based on the current task accuracy. Regarding tasks with low current accuracy as difficult, DTP increase task weights to expedite their training and vice versa. However, if easy and difficult tasks have the same current accuracy, DTP assumes their training progress to be the same, regardless of how much task accuracy can be improved in the future.
Figure 1. Our proposed achievement-based multi-task loss focuses on the single-task accuracy, considered as task potential, and speeds up the progression of tasks with low achievement, while slowing down the early converged ones.
Achievement-based multi-task loss: The proposed multi-task loss is inspired by focal loss [12], which was introduced to resolve class imbalance in object detection. In general, there are numerous background samples in images, while foreground objects that we aim to detect are only a few. As a result, most detection losses are from the easily-detected background, even though hard-to-detect foreground objects are critical. To focus on objects, focal loss modulates cross-entropy with focal weighting term, 〖 (1-p_c)〗^γ:
where γ means the focusing factor and p_c denotes the probability of foreground. Through focal weighting, focal loss diminishes the contribution of easy samples while enhancing the influence of the difficult ones.
We introduce focal weighting as task weights for multi-task learning to address the imbalance of training progress across tasks. We define task achievement as the ratio of current to single-task accuracy and utilize it as follows:
where 〖acc〗_t denotes current accuracy of task t and p_t means task potential, defined as single-task accuracy. Like the focal loss, the achievement-based task weight encourages tasks with low achievement to expedite their training while slowing down the early converged ones.
Finally, to resolve the scale imbalance of individual task losses, the proposed achievement-based multi-task loss employs the weighted geometric mean instead of the conventional weighted sum as follows:
Unlike the weighted sum, which is directly influenced by magnitudes, the weighted geometric mean calculates the average based on ratios, thus possessing a scale-invariant property.
The accuracy and training speed of the proposed multi-task loss were compared to recent multi-task losses on both union and partially annotated datasets: scale-based losses [4]-[6], gradient-based losses [7]-[10], and accuracy-based loss [11]. As a baseline, we simply added all task losses (Uniform).
Figure 2. The proposed multi-task loss achieved similar multi-task accuracy to the state-of-the-art loss (IMTL-G), without incurring training time overhead. “Achievement(%)” indicated on the y-axis represents the average accuracy drop compared to single-task baselines.
Comparison on the union Dataset (NYU v2) We evaluated the performance of the proposed and benchmark multi-task losses on the NYU v2 dataset for three different tasks: semantic segmentation, depth estimation and surface normal. NYU v2 has fully-annotated labels for all tasks on 795 training images. We used DeepLabV3 [13] as the baseline architecture and dilated ResNet50 [14] as a feature extractor.
RLW [4] and DWA [5], scale-based methods, showed similar accuracy to the multi-task baseline. GradNorm [9] and IMTL-G [10] have similar or better accuracy than scale-based methods, but incur overhead in training time for computing task gradients. Our proposed loss effectively balanced the training progress of various tasks and as a result, achieved the best multi-task accuracy without impeding training.
Table 1. Our main ideas, achievement-based task weighting and weighted geometric mean, both contribute to the improvement of multi-task accuracy.
Effectiveness We also evaluated the effectiveness of the proposed achievement-based weight and weighted geometric mean (Table 1). Compared to DTP which does not consider task potential, the achievement-based weight improved multi-task accuracy from 0.3660 to 0.3745. The weighted geometric mean further improved accuracy to 0.3847.
Figure 3. To address high cost of label annotation, we propose constructing a large scale multi-task dataset by merging task-specific datasets.
Comparison on the partially annotated dataset (VOC+NYU) We constructed the large-scale partially annotated multi-task dataset that consist of task-specific datasets (PASCAL VOC [15] and NYU depth [16]) as shown in Figure 3. Its total number of training images is 39,446, which contains 15,215 (38.57%) for object detection, 10,477 (26.56%) for semantic segmentation, and 24,231 (64.43%) for depth estimation. Some images from PASCAL VOC have labels for both detection and segmentation. We used EfficientDet [17] as the baseline architecture to address object detection, and EfficientNet-V2-small [18] as a feature extractor.
Figure 4. Our proposed method outperformed all benchmarks on the partially annotated dataset because the achievement, the ratio of current and single-task accuracy, was not disturbed by the imbalance in task labels. As mentioned above, “achievement(%)” indicated on the y-axis represents the average accuracy drop compared to single-task baselines.
Employing the achievement-based task weights, the proposed multi-task loss further improved and achieved the best multi-task accuracy without requiring additional training time. Empirical validation also demonstrates that utilizing a large-scale partially annotated dataset enables a multi-task model to learn more general and powerful representations, surpassing the accuracy of single-task models.
In summary, we addressed the high cost of annotating labels for all tasks by constructing a large-scale partially annotated multi-task dataset that integrates task-specific datasets. However, the disparity in the number of tasks labels may escalate the imbalance in training progress among different tasks. We proposed a novel achievement-based multi-task loss to balance training across various tasks with different difficulty. Furthermore, we formulated the proposed loss as the weighted geometric mean to capture the variations of task losses regardless of their magnitude, effectively preventing any task from dominating. The proposed loss achieved superior multi-task accuracy on both the traditional fully annotated multi-task dataset and the newly composed partially annotated dataset.
[1] Sauhaarda Chowdhuri, Tushar Pankaj, and Karl Zipser. Multinet: Multi-modal multi-task learning for autonomous driving. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1496–1504. IEEE, 2019.
[2] Keishi Ishihara, Anssi Kanervisto, JunMiura, and Ville Hautamaki. Multi-task learning with attention for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2902–2911, 2021..
[3] Keval Doshi and Yasin Yilmaz. Multi-task learning for video surveillance with limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3889–3899, 2022.
[4] Baijiong Lin, YE Feiyang, Yu Zhang, and Ivor Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research, 2022.
[5] Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019.
[6] Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, and Samir A Rawashdeh. Multinet++: Multi-stream feature aggregation and geometric loss strategy for multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) Workshops, pages 0–0, 2019.
[7] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
[8] Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878–18890, 2021.
[9] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pages 794–803. PMLR, 2018.
[10] Liyang Liu, Yi Li, Zhanghui Kuang, J Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. In International Conference on Learning Representations, 2021.
[11] Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 270–287, 2018.
[12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2980–2988, 2017.
[13] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[15] Mark Everingham, SM Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
[16] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 746–760. Springer, 2012.
[17] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
[18] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.