Samsung's new breakthrough in the field of Speech and Image signal processing

7 papers published by Samsung R&D Institute China - Beijing (SRC-B) have been recently accepted by the International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2023, which is the IEEE Signal Processing Society’s flagship conference on signal processing and its applications. In recent years, the annual number of participants has exceeded 3000. It covers audio and acoustic signal processing, image, video and multi-dimensional signal processing, and signal processing of the Internet of Things.

Researchers from Samsung R&D Institute China-Beijing (SRC-B)

Read further to learn more details about SRC-B's papers accepted by ICASSP

1. "Context-Aware Face Clustering with Graph Convolutional Networks"

Face clustering is a necessary tool in the field of face-related algorithm research, which is widely used in album management and unlabeled data management. Recent works which use Graph Convolution Network (GCN) to extract the global features have achieved impressive results in the face clustering task. However, these works have a main drawback that they ignore the influence of the local features. In this paper, we propose a Context-Aware Graph Convolutional Network (CAGCN) to explicitly consider both the global and local information. The CAGAN module composed of GCN-Local and GCN-Global enhances the expression ability of features by extracting and fusing global and local features, and further improves the accuracy and robustness of clustering. In addition, we propose a deduplication algorithm based on Jaccard Similarity to improve the efficiency of face clustering. Specifically, we save 25% of the time and almost no performance degradation.

Proposed CAGCN’s Overall Architecture

2. "MRNet: Multi-Refinement Network for Dual-pixel Images Defocus Deblurring"

Defocus blurring is an inevitable phenomenon in cameras. Though many methods have been proposed, the problem is still challenging because of their low deblurring performance and long processing time. To solve this problem, we propose an efficient Multi-Refinement Network (MRNet) for dual-pixel images defocus deblurring. We design a Siamese Pyramid Network (SPN) as alignment module to alleviate the misalignment problem of left and right views. At the same time, a Multi-Scale Residuals Group Module (MSRGM) is proposed in the reconstruction module, which can extract and fuse features from different scales to obtain better deblurring performance. Experimental results on the popular benchmarks show that the proposed method can significantly improve the performance of defocus deblurring. Besides, our MRNet won the 1st place in NTIRE 2021 challenge for Defocus Deblurring Using Dual-pixel Images (@CVPR2021).

Proposed MRNet’s Overall Architecture

3. "Class-Aware Contextual Information for Semantic Segmentation"

The overall architecture of CACINet

In this paper, we propose a Class-Aware Contextual Information Network (CACINet) , which consists of a Semantic Affinity Module (SAM) and a Class Association Module (CAM), to generate class-aware contextual information among pixels on a fine-grained level. SAM analyzes the affiliation of any two given pixels belonging to the same or different class. It produces intra-class and inter-class pixel contextual information. CAM classifies the image into different class regions globally and then it encodes the pixel based on the degree of affiliation of the pixels with each class in the image. In this way, it augments the class affiliation of the pixels into the corresponding context calculation. Comprehensive experiments demonstrate that the proposed method achieves competitive performance on two semantic segmentation benchmarks: ADE20K and PASCAL-Context.

4. "Heuristic Masking for Text Representation Pretraining"

Masked language model pretraining provides a standardized way to learn contextualized semantic representations, which reconstructs corrupted text sequences by estimating the conditional probabilities of randomly masked tokens given the context. We attempt to exploit language knowledge from the model itself to boost its pretraining in a lightweight and on-the-fly fashion. In this paper, a heuristic token masking scheme is studied, in which those tokens that deep networks and shallow networks have inconsistent predictions for are more likely to be masked. The proposed method can be applied to BERT-like architectures, and its training approach is consistent with BERT, which guarantees training effects and efficiency. Extensive experiments show that the masked language model pretrained with the heuristic masking scheme consistently outperforms previous schemes in various downstream tasks.

On GLUE, it achieves 1.3% and 1.0% absolute average accuracy improvement in sub-word and whole-word masking settings, respectively. And our model obtains 91.7% and 83.8% F1 on SQuAD v1.1 and v2.0, respectively, outperforming other schemes.

5. "SEMI-SUPERVISED SOUND EVENT DETECTION WITH PRE-TRAINED MODEL"

In order to solve the problem of sound detection under a large number of unlabeled data, this paper which is cooperation with Beijing Institute of Technology introduces a Mean teacher based model with pre-trained model. The main innovations of this work are as follows: 1) Extract both features from CRNN and pre-trained model, propose two fusion methods to improve the effectiveness of model. 2) Propose a weight raised temporal contrastive loss function to improve the temporal coherence of the prediction results. We have evaluated the effectiveness of the proposed method through extensive experiments and ablation studies, which prove that the model’s performance has increased by 8% after joining the pre-trained model. In addition, model proposed in this paper also outperforms the DCASE2021 task4 winning model.

6. "DISTORTION-AWARE CONVOLUTIONAL NEURAL NETWORK-BASED INTERPOLATION FILTER FOR AVS3"

Motion compensation is a key technology in video coding for removing the temporal redundancy between video frames and interpolation based sub-pixel level prediction is a key factor to the accuracy of motion compensation. This study proposed a distortion-aware convolutional neural network-based interpolation filter (DA-NNIF) to improve the interpolation prediction accuracy of sub-pixels with one model. Distortion parameters are introduced into the proposed network to reflect the quantization noise of reference and current frames. Experimental result on AVS3 video codec shows that the proposed method achieves average 1.47 % BD-rate reduction on Y component for Class B/C/D sequences under random access configuration and Common conformance test condition.

DA-NNIF

7. "Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework"

The details of proposed VE-VE framework

In this paper, a novel target speaker extraction framework was introduced. An RNN-based voice extractor is used to extract speaker characteristics. This voice extractor network structure and weights are same for both the enrollment and the extraction steps. Speaker characteristics are carried by the RNN state. This design solves the fusion problem of speaker characteristics and speech features while also supporting ultra-short enrollment speech. Experiments show that it achieves a new state-of-the-art (SOTA) performance on public WSJ0-2mix datasets. Furthermore, our approach has the capacity to support ultra-short reference speech requirement. 17.7dB SI-SDRi is achieved for 0.2s reference speech.