AI
Wearable photoplethysmography (PPG) has quietly become one of the most information-dense and ubiquitous sensing modalities in modern health and wellness. A single optical waveform encodes cardiac mechanics, vascular tone, autonomic regulation, respiratory coupling, and longer-horizon behavioral rhythms, all while being collected passively in free-living settings. Yet despite this richness, most machine learning approaches treat PPG as if there were a single “correct” temporal scale at which meaning resides. Signals are resampled, windowed, and flattened into representations that implicitly assume that one resolution fits all tasks.
This assumption is increasingly untenable. An arrhythmia might manifest as a subtle perturbation in beat-to-beat morphology over tens of milliseconds, whereas sleep stage transitions unfold over longer horizons and are expressed through slow changes in rhythm and variability. These observations motivate what we call the resolution hypothesis [1]: different health outcomes depend on distinct temporal resolutions of the same underlying physiological signal.
The resolution hypothesis reframes representation learning for wearables. Resolution is no longer a preprocessing hyperparameter to be tuned away, but a structural axis along which predictive information is organized. Architectures that collapse temporal scale into a single latent vector risk discarding task-relevant structure. Conversely, models that explicitly preserve and expose multi-resolution representations can align more naturally with the physiology they seek to model.
In this article, we argue that hierarchical convolutional architectures, particularly U-Net–style encoders, are a principled and efficient way to operationalize the resolution hypothesis. We contrast this approach with transformer-based models that rely on long-range attention, and show how U-Nets can capture long-range dependencies implicitly while preserving fine-scale detail, all at a fraction of the computational cost. Using PPG as a concrete example, we illustrate how different health outcomes align with different layers of a multi-resolution hierarchy.
We consider a PPG sequence as a discrete-time signal $x∈\Bbb{R}^{C×L}$, where $C$ denotes channels and $L$ the number of time steps. The central design choice is how representations aggregate information across time. Transformer-based models achieve this via self-attention [2], computing pairwise interactions between all time steps with $O(L^2)$complexity. While expressive, this mechanism is agnostic to the inductive structure of physiological signals and expensive for long sequences.
In contrast, a hierarchical convolutional encoder constructs representations through successive downsampling. Each layer applies local convolutions followed by strided aggregation, increasing the effective receptive field exponentially with depth. If k denotes kernel size and s the stride, the receptive field at depth d grows roughly as
$R_d=R_{d-1}+(k-1)∏_\limits{\rm i=1}^{\rm d-1}s_i , $
so that deeper layers integrate information over increasingly long temporal spans. Importantly, this growth occurs without explicitly computing global interactions.
A U-Net architecture [3] pairs this hierarchical encoder with a symmetric decoder and skip connections (Figure 1). Shallow layers retain high-resolution information about waveform morphology, while deeper layers encode coarse dynamics such as heart rate trends or circadian modulation. Skip connections concatenate features across resolutions, ensuring that global context does not overwrite local structure. The resulting latent space is not a single vector, but a stack of embeddings indexed by temporal scale.
Figure 1. HiMAE pre-training and evaluation pipeline. (1) Physiological sequences are split into temporal patches. (2) Selected patches are masked randomly or contiguously. (3) A U-Net–style CNN encoder–decoder reconstructs missing values, with loss applied only to masked regions. (4) Multi-resolution embeddings feed linear probes for classification and regression benchmarking. (5) Three categorized task-lists are evaluated.
Self-supervised masked autoencoding [4] provides a natural training objective for this architecture. Randomly masking segments of the PPG signal and reconstructing them from context forces the model to learn both local continuity and long-range dependencies. Crucially, reconstruction is only one means to an end. The more interesting outcome is that intermediate activations become probes of resolution-specific structure. Linear models trained on different layers implicitly ask which temporal scale carries predictive signal for a given outcome.
From a computational perspective, this design yields a favorable tradeoff. Hierarchical convolutions operate in $O(L)$ time and memory, enabling long-context modeling on resource-constrained devices. Long-range dependencies are not learned through explicit attention, but emerge from progressive aggregation. In effect, the U-Net approximates global context through depth rather than width.
When applied to PPG-based health prediction, multi-resolution representations reveal striking task-specific patterns (Figure 2). Cardiovascular outcomes such as hypertension or arrhythmia detection tend to peak in performance at deeper layers of the hierarchy. These layers correspond to coarser temporal resolutions, consistent with the fact that cardiovascular risk is expressed through sustained trends rather than isolated beats.
Figure 2. HiMAE discovers task-specific structures for downstream tasks. AUROC across layers shows that tasks rely on distinct temporal scales, highlighting HiMAE as a tool for discovering the most informative resolution in clinical machine learning.
Sleep staging exhibits a similar preference for intermediate-to-coarse resolutions. While individual heartbeats carry little information about sleep state, slower oscillations in heart rate variability and vascular tone align closely with transitions between wakefulness, REM, and non-REM sleep. Representations that integrate over seconds to minutes naturally capture this structure.
In contrast, laboratory abnormalities such as hemoglobin or electrolyte imbalance show maximal predictability at much finer resolutions. These outcomes appear to depend on subtle, high-frequency distortions in PPG morphology that are washed out when signals are aggressively downsampled. Fine-scale embeddings retain sensitivity to these micro-patterns, supporting the resolution hypothesis in a domain where clinical intuition is less established.
Table 1. Linear probing classification performance comparison against self-supervised baselines on different tasks. AUROC is reported in percent with 95% confidence intervals. The best performance is bold, the second best model is underscored. ∗ denotes p < 0.05, ∗∗ denotes p < 0.01 from a two-sided z-test comparing HiMAE with the second-best model.
Across tasks, hierarchical convolutional models achieve competitive or superior performance relative to transformer-based alternatives, despite being orders of magnitude smaller (Table 1). This suggests that architectural alignment with signal structure can substitute for brute-force capacity. The model does not need to attend everywhere if physiology itself is organized hierarchically.
The central lesson is that resolution matters. Physiological signals are inherently multi-scale, and health outcomes interrogate different slices of this temporal spectrum. Treating resolution as a first-class object of representation learning yields both practical and scientific benefits. Practically, it enables compact models that can run on-device, supporting continuous, private health monitoring. Scientifically, it offers a new lens for interpretability: not which feature matters, but at what temporal scale information emerges.
U-Net–style architectures provide a compelling alternative to transformers for wearable sensing. By encoding long-range dependencies implicitly through receptive field expansion, they sidestep the quadratic costs of attention while preserving fine detail through skip connections. This is not merely an efficiency argument. It is a statement about inductive bias. When the structure of the model mirrors the structure of the signal, learning becomes easier and representations become more meaningful.
Looking forward, the resolution hypothesis suggests a shift in how we evaluate and design foundation models for health. Rather than collapsing representations into monolithic embeddings, we should expose and interrogate their internal hierarchies. Different tasks, populations, and disease processes may live at different resolutions. Models that acknowledge this heterogeneity will be better positioned to translate wearable data into actionable clinical insight.
The research described here is work conducted at Samsung Research America's Digital Health Team. The following researchers contributed to this work: Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, and Sharanya Arcot Desai. We would also like to thank participants who contributed their data for this study.
[1] Lee, S. A., Tanade, C., Zhou, H., Lee, J., Thukral, M., Han, M., ... & Desai, S. A. (2025). Himae: Hierarchical masked autoencoders discover resolution-specific structure in wearable time series. arXiv preprint arXiv:2510.25785.
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[3] Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Cham: Springer international publishing.
[4] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000-16009).