AI

Sentimental and Object-Preserving On-Device AI Wallpaper

By Juyong Song Samsung Research
By Somin Kim Samsung Research
By Hyeji Shin Samsung Research
By Saemi Choi Samsung Research
By Jungmin Kwon Samsung Research
By Abhinav Mehrotra Samsung AI Center - Cambridge
By Young Dae Kwon Samsung AI Center - Cambridge
By Sourav Bhattacharya Samsung AI Center - Cambridge

Introduction


Samsung Galaxy S24, followed by Galaxy Z Fold 6 and Z Flip 6, are the first series of AI phones in 2024 to bring many new AI-based features that enrich the everyday lives of millions of Samsung users. A new modality of searching, real-time translation of calls across 13 different languages, effortless photo editing, and text summary, are just a few examples of AI features packed in the latest Samsung AI phones.

Figure 1. The illustrations of the ambient AI wallpapers generated on Samsung Galaxy S24 Ultra under different weather conditions and times of the day.

Photo ambient wallpaper service was first introduced in Galaxy S24, as a lab feature allowing users a new experience. Since its initial introduction, the service has been improved via extensive evaluations and updates, resulting in its official launch as a feature on the Galaxy Z series Flip6/Fold6 as an official functionality. The image editing model for the service runs entirely on the S24/Fold6/Flip6 hardware, without requiring any cloud connectivity. The AI-based animation, developed by Samsung Mobile eXperience (MX) Business, conveys outdoor ambient context seamlessly to the flagship AI phone users enhancing the overall aesthetical experience. For example, Figure 1 illustrates the ambient AI Wallpaper feature, where a wallpaper image is modified on-device as the weather changes from a sunny noon, to a snowy sunset, and to a rainy night (see Figure 2 for weather changes from a rainy day to a snowy day on a Galaxy Flip-6 phone).

Figure 2. The illustrations of the ambient AI wallpapers generated on Samsung Galaxy Z Flip 6 under different weather conditions.

Figure 3. Introducing AI Wallpaper during Galaxy unpacking (Paris, July 2024).

Our journey started from CES 2023, where Samsung introduced a vision for emotionally engaging AI services through generative AI. In 2024, at the Samsung Galaxy Unpacked event in Paris, AI-powered Galaxy Z series phones were introduced with personalized Wallpaper as a full feature (see Figure 3). Wallpapers on Samsung AI phones now adapt to current times and weather conditions using on-device generative AI technology developed in SR. This post introduces a Text-guided Image-to-Image translation (TI2I) model, Sentimental object-preserving On-device AI Wallpaper (SOWall), which is the core foundation of the Photo Ambient AI Wallpaper feature on Samsung AI phones. SOWall visualizes the time and weather of the moment on the background phone screen – wallpaper, enabling the users to experience the ambient outdoor scene of the moment.

Image-to-image translation tasks are challenging as they require translating only the elements of interest, while preserving all non-target elements accurately. Broadly, TI2I tasks can be categorized into object-editing and object-preserving tasks. In object-editing, the target object is only modified, while preserving the background. However, in object-preserving tasks, the identity, structure, and detailed features of the target object are maintained, while the background and style of the source image are modified. The AI Wallpaper is an example of an object-preserving image-to-image translation task. To preserve target objects in an image, we propose Seg-scale, a technique which combines segmentation and masked editing. It preserves the identity of the target object, while the background is modified naturally using a diffusion model. To maintain the high quality of generated Wallpapers, we developed a novel CLIP-based metric and a textual inversion method; together, they capture the emotions of an image and the intent of an edit-prompt well. To run the entire model on-device with limited memory and computations, we developed an efficient tiling-based diffusion inferencing-scheme that reduces the device resource requirements significantly, while generating very high quality Wallpaper images.

Figure 4. The overall inference pipeline of SOWall. To preserve the important object, we develop Seg-scale, which uses a segmentation map as an importance map during the classifier-free guidance (CFG) process. Only a more CFG (very light) calculation is added to apply Seg-scale. Image and text embedding are injected into the wallpaper U-Net model as the conditions. The text embedding is decided based on the information of time/weather. Our model has the form of latent diffusion, so the VAE encoder and decoder are also introduced.

Models

Diffusion models


Diffusion models are new state-of-the-art image generative models, which have been developed based on the mathematics of the diffusion process and the Markov processes. Using this model, image editing [1] and image translation based on the description [2] are also possible. Figure 4 demonstrates an exemplary pipeline of the wallpaper editing process. An on-device wallpaper editing is performed in three main steps:

1. Encoding / 2. De-noising (repeated multiple times, T) / 3. Decoding.

During the encoding step, the original wallpaper image and the weather text are transformed into latent space embeddings using a variational autoencoder (VAE) and text encoders, respectively. These embeddings are used by the U-Net for conditioning the image generation during the de-noising step, which runs repeatedly T times on-device. Classifier-free guidance [3] has been used for generative models, allowing control of the application intensity of the conditions. Finally, in the last step, the predicted de-noised image is passed to the VAE decoder, which generates the edited wallpaper.

Object preservation


Image translation using time and weather conditions may result in unintentional transformation. To resolve these issues, we developed a method, Seg-scale (orange box in Figure 4), that separates the segmented foreground and background, providing different guidance scales to them so that we can preserve the identity of the primary object in the images. Note that concurrent work [7] exists, whereas our implementation on the Galaxy S24 was an earlier version.

As depicted in Figure 5, the Seg-scale method effectively mitigates changes to the target area while successfully reflecting the specified conditions in other regions. For snowy conditions (see Figure 5a), when Seg-scale is not applied, clothes in the images tend to change to the winter seasonal ones, such as down or leather jackets which fit with the weather which are not expected. With Seg-scale, we can effectively apply less modification to the segmented region; the clothes are conserved from the weather changes. For night conditions (see Figure 5b), the main objects are not expected to be excessively darkened. The diffusion models darken the entire image regardless of our intention so that we inject our intention via Seg-scale. After applying the Seg-scale, the main object is maintained as bright.

Figure 5. Seg-scale examples. Diffusion-based text-guided image-to-image models translate the images by considering the context so that the clothes can be changed for the ‘snowy’ condition, which is only possible in winter. Because the results were not our expectations, we developed Seg-scale to prevent unwanted changes.

We also developed Structure Loss (SL) which is designed to assess how well the transformed image maintains the original structure. XDoG [4] filter is applied to both the original and edited images to create a sketch-like image with only the contours remaining. These images are then processed through a visual feature extractor such as VGGNet, and the mean squared error (MSE) between the two sets of features is computed.

Sentimental and Emotional outputs


Prompt Engineering is also one of the most important parts of efficiently improving the quality of generated images without additional training. On top of our dedicated prompt engineering, we adopted Textual Inversion (TI) [5] to target the ambience of sample photos to generate the image following our intention. Although our approach is TI2I, the TI tokens trained on text-to-image (T2I) models are also working well.

We also developed and used Customized CLIP Score (CC), which measures how well a modified image reflects a positive text prompt and does not reflect a negative prompt. This dual assessment allows for a more nuanced evaluation of how text prompts influence visual modifications.

Enabling On-Device Deployment of the AI Wallpaper Model

To ensure users’ data privacy and support AI features without an internet connection, we deploy the entire AI wallpaper model on-device. This ambitious task goes beyond working in a controlled environment of a research lab, and requires improvement in both the efficiency of the core diffusion methodology and on-device engineering. Below, we highlight some of the challenges that Samsung AI Center – Cambridge (SAIC-C) had to address to bring AI wallpaper to Galaxy S24, Z Fold 6, and Flip 6 devices.

Low latency and limited memory requirement


To run a wallpaper image translation model on a smartphone with limited system resources, key challenges include keeping the overall memory and latency as low as possible for better usability. Our on-device AI wallpaper feature on the GenAI platform of Samsung R&D Institute India-Bangalore (SRI-B) has been carefully optimized for memory and latency by considering a novel quantization scheme. While quantizing the model, we use a limited amount of data to learn accurate quantization statistics, which allows us to generate a very high-quality AI wallpaper image on Samsung smartphones.

Support for arbitrary high-resolution wallpapers


To facilitate a fully-fledged ambient wallpaper generation, our model must support the processing of arbitrary high-resolution images as provided by mobile phone users. At the same time, we must run the model under strict memory and latency constraints as discussed above. Interestingly, memory becomes an important bottleneck when supporting high-resolution wallpaper generation. Figure 6 illustrates the problem, where we conducted experiments of running a quantized AI wallpaper model, for varying input resolutions, on a Samsung Galaxy S23’s Hexagon NPU. Our measurements show that the latency, as well as, memory requirements both become major bottlenecks to generating high-resolution wallpaper, especially for input images larger than 512x512 resolution.

Figure 6. On-device measurements of the latency and the memory footprint of running the quantized AI wallpaper model on Snapdragon 8 Gen2 HW for various input image sizes.

To overcome the bottlenecks, we developed an on-device tiling-based solution based on [6] that allows the generation of high-resolution AI wallpapers with limited memory and latency. As shown in Figure 7, the tiling approach divides the input wallpaper image into a series of low-resolution tiles with fixed width and height. Each tile is processed by our model to generate the edited tiles, which are then stitched together to produce a modified version of the AI wallpaper. With this tiling approach, we allow users to edit/modify their own photos of any arbitrary resolution into beautiful AI wallpapers.

Figure 7. Overview of tiling-aware ambient AI wallpaper generation for high-resolution images.

Ensuring responsiveness


Even with optimized models and software stack, synthesizing high-resolution images with limited memory is not an instantaneous process. Given this characteristic and the fact that the ambient AI wallpaper functionality needs to run as a background service, it is likely that the execution of the pipeline might overlap with other activities the user is performing on the phone. In the worst-case scenario, this overlap might even lead to saturation of the available computational resources, significantly affecting user experience. This problem is particularly profound in the case of high-priority, real-time functionality, such as an incoming phone call, especially if combined with the aforementioned real-time translation. To ensure sufficient responsiveness of the phone even in the worst-case scenario, we developed an efficient and swift pausing/cancelling mechanism within our on-device GenAI framework. Thanks to this, the wallpaper pipeline is capable of suspending its execution and releasing memory when a high-profile request from the OS comes. Later, it can resume its execution from where it was suspended.

Abstracting hardware details away from GenAI functionality


A unified platform, GenAI, is designed and developed to support all plausible generative AI features by seamlessly enabling inference of different use cases such as the on-device ambient AI wallpaper. This framework and our AI wallpaper feature are developed in C++ for lower latency and better memory management. Additionally, they provide abstraction over the different hardware platforms used throughout Samsung phones to provide a consistent user experience.

Results

Qualitative results


Our model SOWall supports 8 different themes, combining 3-time conditions of day (noon, evening, night) and 3-weather conditions (clear, rainy, snowy). The input image is assumed to be a photo taken on a clear day during the daytime (See Figure 8 for qualitative results).

Figure 8. Examples of our TI2I model, SOWall. It reflects time and weather in real-time and takes the information from the time/weather apps on mobile.

Quantitative results


A correlation analysis with qualitative evaluation results was performed to find the quantitative metrics which is the most suitable for the image translation task for wallpaper. Evaluators assessed by comparing the conversion images of the old model and a new model, giving points to the better side. The possible qualitative assessment criteria that could be replaced with quantitative evaluation involve the following aspects. (1) Is the weather or time condition appropriately reflected? (2) Is the structure of the objects in the image well maintained? As shown in Table 1, IS, CC and SL show a high Kendall correlation, while DC has a low correlation.

Table 1. Kendall rank correlation of each quantitative metric with qualitative evaluation metrics. Quantitative metrics: Image CLIP similarity (IS), directional CLIP score (DC), Custom CLIP score (CC), and Structure Loss (SL).

Discussion & Conclusion

We developed a TI2I model SOWall, an object-preserving TI2I translation model to reflect changing time and weather conditions. Seg-scale enables differentiated treatment of foreground and background, offering precise control over preservation and adaptation without additional large computation. Guided Filter is effectively applied, so that the detailed structures are also preserved. Textual inversion also captures more nuanced colors and ambience, and tiling addresses memory constraints to produce high-resolution images.

Wallpaper may seem like a simple backdrop, but it holds within it the memories of the day we chose to personalize our screens with our own photos. It refreshes and brings joy to our daily lives in an emotional way. With the AI-powered Galaxy Z series, we aim to enhance these emotions even further. By incorporating real-time information such as weather and time, if it's raining outside, you'll see rain effects on your screen. Your wallpapers can now transform dynamically using the state-of-the-art image generative model. Embrace this new experience and look forward to more delightful surprises from AI on Samsung devices. We conclude our post with a poem by the poet Kim Sowol, whose name has the same pronunciation with SOWall in Korean.

The setting sun guides us down the path of twilight / the clouds vanish into the dark in the distant mountains --- Kim Sowol, Desire to meet.

References

[1] MENG, Chenlin, et al. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.

[2] BROOKS, Tim; HOLYNSKI, Aleksander; EFROS, Alexei A. Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 18392-18402.

[3] HO, Jonathan; SALIMANS, Tim. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.

[4] WINNEMÖLLER, Holger; KYPRIANIDIS, Jan Eric; OLSEN, Sven C. XDoG: An eXtended difference-of-Gaussians compendium including advanced image stylization. Computers & Graphics, 2012, 36.6: 740-753.

[5] GAL, Rinon, et al. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.

[6] JIMÉNEZ, Álvaro Barbero. Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412, 2023.

[7] SHEN, Dazhong, et al. Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 9370-9379.