AI
The rapid advancements in artificial intelligence (AI) have led to the development of increasingly sophisticated and powerful artificial neural networks (ANNs). While these models have achieved groundbreaking performance across various domains [1 3], their escalating size and computational demands render them impractical for resource-constrained environments. This issue is particularly concerning for real-time, energy-sensitive applications in consumer electronics, such as smartwatches, augmented reality (AR) glasses, and wireless earbuds.
Despite the growing processing power of microcontroller units (MCUs), their ability to run state-of-the-art ANNs remains severely limited. This challenge is further compounded by the slowdown of Moore's Law [4], which traditionally predicted consistent improvements in hardware performance. The deceleration of this trend [5] underscores a widening gap between the computational requirements of cutting-edge neural networks and the capabilities of compact, low-power devices [6].
Moreover, the pursuit of maximal model accuracy in the AI research community often comes at the expense of efficiency, resulting in solutions that are unsuitable for real-world, latency-sensitive systems. Current optimization techniques, such as pruning [7] and quantization [8], fall short of bridging this gap without introducing significant trade-offs in model performance.
Our paper is motivated by the need to address these challenges directly. We aim to develop a method that balances computational efficiency with model accuracy, enabling the deployment of high-performance neural networks on energy-constrained devices. By leveraging the inherent continuity and predictability of time-series data, we introduce Scattered Online Inference (SOI), a novel approach that reduces computational complexity through partial state predictions and efficient compression. SOI aligns with the growing demand for environmentally sustainable and economically viable AI solutions, pushing the boundaries of what is possible in compact, real-time systems.
The key principle of SOI is leveraging compression and extrapolation along the time axis. Instead of recalculating every layer of the network for each incoming data point, SOI compresses the data using strided convolutions and reconstructs missing states through extrapolation techniques, such as frame duplication. These operations are applied selectively to specific layers of the network, allowing portions of the computational graph to remain static across consecutive inferences. By caching and reusing partial network states, SOI significantly reduces the frequency of full model updates, thereby optimizing computational efficiency. To enhance understanding of the SOI algorithm in CNNs, Figure 1 defines three types of convolutional layers utilized in our method. For comparison, Figure 1 also includes standard convolution and strided convolution.
Figure 1. SOI for convolutional operations. For visualization purposes we show data as frames in time domain. A) Standard convolution. B) Strided convolution. C) Strided-Cloned Convolution. D) Shifted convolution. E) Shifted Strided-Cloned Convolution.
SOI operates in two primary modes: Partially Predictive (PP) and Fully Predictive (FP). In partially predictive mode, the model predicts the next state based on a combination of newly computed and cached partial states. This approach reduces the average computational cost without increasing the system's peak demand. In contrast, FP mode predicts multiple future states ahead of time, allowing entire sections of the model to be precomputed and eliminating the need for on-the-fly calculations. While FP mode is more complex to implement, it offers greater reductions in both latency and computational load, particularly for tasks with highly predictable patterns. The inference patterns for both modes are illustrated in Figure 2.
Figure 2. Inference patterns of each type of SOI based on U-Net architecture. A) Unmodified causal U-Net. B) Partially predictive (PP) SOI. C) Even inference of PP. D) Odd inference of PP. E) Fully predictive (FP) SOI. F) Even inference of FP. G) Odd inference of FP.
Another defining feature of SOI is the integration of skip connections, which maintain the flow of information from compressed input data to deeper layers of the network. These connections bridge the outputs of strided convolution layers and their corresponding reconstruction layers, ensuring that the model preserves critical context and causality. Skip connections also mitigate the risk of performance degradation caused by partial state predictions by allowing the model to incorporate new data into the computation graph without fully recalculating intermediate layers.
To evaluate the effectiveness of SOI, we conducted experiments on three distinct tasks: speech separation, acoustic scene classification (ASC), and video action recognition. These tasks were selected to demonstrate SOI's ability to process time-series data across diverse domains. The performance of SOI was compared to baseline models, traditional optimization techniques such as pruning and resampling, and the Short-Term Memory Convolution (STMC) method [9], which served as a foundational technique for SOI. The results highlight the versatility of SOI in reducing computational complexity while maintaining acceptable levels of accuracy and efficiency.
For the speech separation task, the U-Net architecture was employed to separate clean speech from noisy backgrounds. The experiments compared the partially predictive and fully predictive SOI variants against the baseline and STMC models. Metrics included computational complexity (measured in MMAC/s) and scale-invariant signal-to-noise ratio improvement (SI-SNRi). The results are shown in Figure 3.
Figure 3. Results of speech separation experiment with A) the partially predictive SOI, and B) fully predictive SOI.
The partially predictive SOI model demonstrated an ability to reduce computational complexity by up to 64%, with only a minor impact on performance, measured as a decrease of 0.017 dB in SI-SNRi for every 1% complexity reduction. Similarly, the FP SOI model achieved a 50% reduction in complexity by precomputing 83.7% of the network's operations, making it particularly well-suited for latency-sensitive applications.
The ASC task utilized the GhostNet architecture, with the objective of classifying urban acoustic scenes. Models incorporating SOI were compared to baseline and STMC variants across seven model sizes. Top-1 accuracy and computational complexity were evaluated, with the results summarized in Table 1.
Table 1. Results of ASC experiment.
SOI achieved an average complexity reduction of 16% compared to the STMC method. In some configurations, SOI also improved model accuracy, likely due to partial state predictions enhancing the network’s generalization capabilities. Although the introduction of skip connections in SOI slightly increased the number of model parameters, this adjustment significantly reduced the overall computational load.
To explore SOI's applicability to non-audio time-series data, the ResNet-10 architecture and MoViNets were tested on the HMDB-51 dataset for the action recognition task. SOI was applied to 3D convolutional layers and evaluated on regular, small, and tiny ResNet-10 variants, as well as the A0 and A1 MoViNet variants. The results are presented in Table 2.
Table 2. Results of video action recognition experiment.
SOI achieved complexity reductions of 10% to 17%, with little to no loss in accuracy. In some cases, SOI even improved model accuracy by expanding the receptive field through the use of strided convolutions.
The application of SOI combined with pruning in the STMC model surpasses the effect of pruning alone. The addition of SOI enabled a further reduction in computational complexity by approximately 300 MMAC/s for the same model performance, representing about 16% of the original model's complexity. Interestingly, the “SOI 2|6” model outperformed the “SOI 1” model at around 6 dB SI-SNRi. The experimental results are shown in Figure 4.
Figure 4. Pruning of STMC, SOI and 2xSOI models. Unpruned models are indicated by markers.
One of the critical objectives of SOI is to enhance the efficiency of neural network inference by reducing not only computational complexity but also inference time and memory usage. These aspects are particularly crucial in real-time applications, where latency and hardware limitations significantly affect system performance. We measured these metrics for SOI applied to the speech separation task, using U-Net as the test architecture. The results are summarized in Table 3.
Table 3. Results from experiments with partially predictive SOI for speech separation.
The experiments revealed that SOI models consistently achieved lower average inference times compared to baseline and STMC models. From the collected results, we observed a drop in inference time from 9.93 ms in the STMC model to as low as 5.28 ms, representing a reduction of nearly 47%. Additionally, the peak memory footprint of the models decreased significantly as computational complexity was reduced. The memory footprint dropped from 27.2 MB in the STMC model to 14.6 MB, a reduction of over 46%.
The relationship between these efficiency improvements and the model’s performance metrics was also analyzed. As shown in Figure 5, the inference time scaled linearly with the complexity reduction factor, demonstrating that SOI efficiently balanced the computational workload across the network.
Figure 5. Average inference time and peak memory footprint.
In this work, we presented a method for reducing the computational cost of convolutional neural networks (CNNs) by reusing partial network states from previous inferences, allowing these states to generalize over longer time periods. We discussed the effects of partial state prediction imposed by our method on the neural model and demonstrated its gradual application to balance model quality metrics and computational cost.
Our experiments highlight the significant potential for reducing the computational cost of CNNs, particularly for tasks where outputs remain relatively constant, such as event detection or classification. We achieved a 50% reduction in computational cost without any loss in metrics for the ASC task, and a 64.4% reduction in computational cost with a relatively small 9.8% decrease in metrics for the speech separation task. Additionally, we demonstrated SOI’s ability to control the trade-off between model quality and computational cost, enabling resource- and requirement-aware tuning.
The presented method offers an alternative to the STMC solution for strided convolution. While SOI reduces network computational complexity at the expense of some performance, STMC maintains performance metrics but at the cost of exponentially increased memory consumption. SOI is similar to methods like network pruning but does not rely on special sparse kernels for inference optimization. Importantly, these methods are not mutually exclusive. The STMC strided convolution handler, SOI, and pruning can coexist within a neural network to achieve the desired balance of model performance and resource efficiency.
https://arxiv.org/abs/2410.03813
[1] Thai Son Nguyen, Sebastian Stueker, and Alexander H. Waibel. Super-human performance in online low-latency recognition of conversational speech. In Interspeech, 2020.
[2] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
[3] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, dec 2020.
[4] Gordon E Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86 (1):82–85, 1998.
[5] Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. There’s plenty of room at the top: What will drive computer performance after moore’s law? Science, 368(6495):eaam9744, 2020.
[6] Xiaowei Xu, Yukun Ding, Sharon Xiaobo Hu, Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. Scaling for edge inference of deep neural networks. Nature Electronics 2018 1:4, 1:216–222, 4 2018. ISSN 2520-1131. doi: 10.1038/s41928-018-0059-3.
[7] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touretzky (ed.), Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989.
[8] R.M. Gray and D.L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6): 2325–2383, 1998.
[9] Grzegorz Stefański, Krzysztof Arendt, Paweł Daniluk, Bartłomiej Jasik, and Artur Szumaczuk. Short-term memory convolutions. In The Eleventh International Conference on Learning Representations, 2023.