The 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) has accepted five research papers from Samsung R&D Institute China-Beijing (SRC-B). ICASSP, organized by the Institute of Electrical and Electronics Engineers (IEEE), represents a premier international conference in signal processing with substantial international influence. The conference took place in Seoul, South Korea, from April 14 to 19, showcasing SRC-B’s technological and academic achievements, demonstrating the team’s research capabilities and innovation potential.
This paper proposes a novel cross-search method for obtaining qualified parallel in-domain corpus. This method encompasses two distinct approaches: antagony-cross search and similarity-cross search. The former generates domain-specific monolingual data that closely aligns with the target domain through token-level control, while the latter maintains source-target sentence alignment via similarity scoring in back translation, enhancing semantic correspondence. The implementation of this proposed method has generated millions of high-quality parallel in-domain corpus entries from low-resource monolingual data, yielding improvements of approximately 0.5–4 BLEU points across various domains.
Figure 1. An example of using antagony-cross-search to generate a medical domain sentence
The authors of the paper
Link to the paper
https://ieeexplore.ieee.org/abstract/document/10447171
This study presents a test time augmentation (TTA) method, dubbed SuccessTTA that predicts successive augmentations iteratively within a single forward pass. The approach consistently improves model robustness against various corruptions, outperforming state-of-the-art (SOTA) TTA methods on public CIFAR-100-C and ImageNet-C datasets while maintaining high efficiency even at its short runtime.
(a) Pipeline of SuccessTTA (b) Main architecture of our successive loss predictor
Figure 2. (a) Illustration of SuccessTTA pipeline at the inference stage. (b) The main architecture of our successive loss predictor in SuccessTTA.
Recent studies on learnable TTA methods predict transformations that are supposed to provide the best performance gain on each test sample. However, these methods are either restricted to predicting one single transformation for each sample or require multiple forward passes of the transformation predictor. To address these challenges, the methodology employs an RNN-based architecture to generate multiple transformations by predicting their induced losses, enabling sequential predictions through novel representation learning augmented samples. These predicted transformations are applied to the input image successively for a more robust result. Our method reduces the mean error rate at most by 2.22% and 4.03% on CIFAR-100-C and ImageNet-C datasets, respectively, while achieving nearly 24-fold faster processing than SOTA methods.
The authors of the paper
Link to the paper
https://ieeexplore.ieee.org/document/10448390
The proposed deformAble receptive Super-Resolution (ArcSR) method aims to address the challenges of blind image super-resolution, which involves reconstructing high-resolution images from low-resolution versions affected by unknown degradation. While current approaches manage isotropic or anisotropic Gaussian blur, performance diminishes with motion blur. ArcSR addresses this by introducing pixel-specific treatments through deformable receptive fields and parameters. Two novel modules, Deformable Mutual Convolution (DMconv) and Kernel Guided Convolution (KGconv), are employed for blur kernel estimation and super-resolution, respectively. DMconv facilitates channel-wise correlation and adaptive kernel generation, while KGconv applies kernel-based attention to rewrite high-frequency details. Experimental results underscore the method’s superiority.
Figure 3. Overview of the proposed ArcSR
The authors of the paper
Link to the paper
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446639
This research presents an unsupervised relapse detection system developed for the 2nd E-Prevention Challenge, focusing on psychotic and non-psychotic relapse detection through wearable-based digital phenotyping. The methodology leverages user identification as an auxiliary task to enhance behavioral pattern recognition from biosignals. The system implements three distinct Transformer-based feature extraction architectures coupled with patient-specific Elliptic Envelope models for anomaly detection. The system achieved third place in the challenge’s second track, attaining an average AUC score of 0.4964.
Figure 4. Framework of our relapse detection system
The authors of the paper
Link to the paper
https://sigport.org/documents/unsupervised-relapse-detection-using-wearable-based-digital-phenotyping-2nd-e-prevention
Figure 5. FSPEN framework and subband allocation
The authors of the paper
This study presents FSPEN, an ultralightweight network designed for real-time speech enhancement, achieving superior performance under resource constraints with only 89M multiply-accumulate operations per second and 79k parameters. The architecture implements a dual-stream approach, combining full-band and subband processing for comprehensive feature extraction. The sub-band network employs frequency-dependent band allocation, prioritizing low-frequency signals to align with human auditory perception. The implementation achieves a PESQ score of 2.97 on the VoiceBank+Demand dataset while maintaining minimal computational complexity, presenting potential applications in Samsung TWS headsets and IoT devices.
Link to the paper
https://ieeexplore.ieee.org/document/10446016?denied=