Image Generators with Conditionally-Independent Pixel Synthesis

By Kirill Demochkin Samsung AI Center - Moscow

Figure 1. Samples from our generators trained on several challenging datasets (LSUN Churches, FFHQ, Landscapes, Satellite-Buildings, Satellite-Landscapes) at resolution 256 × 256. The images are generated without spatial convolutions, upsampling, or self-attention operations. No interaction between pixels takes place during inference. image source

Introduction:

Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random latent vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capabilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar or better generation quality to state-of-the-art convolutional generators such as NVidia’s StyleGAN2. We also investigate several interesting properties unique to the new architecture.

State-of-the-art in unconditional image generation is achieved using large-scale convolutional generators trained in an adversarial fashion [3, 4, 1]. While lots of nuances and ideas have contributed to the state-of-the-art recently, for many years since the introduction of DCGAN[9] such generators are based around spatial convolutional layers, also occasionally using the spatial self-attention blocks[11]. Spatial convolutions are also invariably present in other popular generative architectures for images, including autoencoders[6], autoregressive generators, or flow models[3, 5]. Thus, it may seem that spatial convolutions (or at least spatial self-attention) are an unavoidable building block for state-of-the-art image generators.

Recently, a number of works have shown that individual images or collections of images of the same scene can be encoded/synthesized using rather different deep architectures (deep multi-layer perceptrons) of a special kind[9, 10]. Such architectures are not using spatial convolutions or spatial self-attention and yet are able to reproduce images rather well. They are, however, restricted to individual scenes. In this work, we investigate whether deep generators for unconditional image class synthesis can be built using similar architectural ideas, and, more importantly, whether the quality of such generators can be pushed to state-of-the-art.

Perhaps surprisingly, we come up with a positive answer, at least for the medium image resolution (of 256 × 256). We have thus designed and trained deep generative architectures for diverse classes of images that achieve a similar quality of generation to state-of-the-art convolutional generator StyleGANv2[4] by NVidia, even surpassing this quality for some datasets. Crucially, our generators are not using any form of spatial convolutions or spatial attention in their pathway. Instead, they use coordinate encodings of individual pixels, as well as sidewise multiplicative conditioning (weight modulation) on random vectors. Aside from such conditioning, the color of each pixel in our architecture is predicted independently (hence we call our image generator architecture Conditionally-Independent Pixel Synthesis (CIPS) generators).

Figure 2.Samples from CIPS generators trained on various datasets. The top row of every grid shows real samples, and the remaining rows contain samples from the models. The samples from CIPS generators are plausible and diverse. image source

In addition to suggesting this class of image generators and comparing its quality with state-of-the-art convolutional generators, we also investigate the extra flexibility that is permitted by the independent processing of pixels. This includes an easy extension of synthesis to non-trivial topologies (e.g. cylindrical panoramas), for which the extension of spatial convolutions is known to be non-trivial [7, 2].

Figure 3.Samples of cylindrical panoramas, generated by the CIPS model trained on the Landscapes dataset. The training data contains standard landscape photographs from the Flickr website. No panoramas are provided to the model during training. image source

Furthermore, the fact that pixels are synthesized independently within our generators, allows sequential synthesis for memory-constrained computing architectures. It enables our model to both improve the quality of photos and generate more pixel values in specific areas of an image (i.e. to perform foveated synthesis).

Figure 4.Samples from CIPS generators learned with memory-constrained patch-based training. Within every grid, the top row contains images from models trained with patches of size 128 × 128 and the bottom row represents outputs from training on 64 × 64 patches. While the samples obtained with such memory-constrained training are meaningful, their quality and diversity are worse compared to standard training. image source

Main Ideas:

Our generator network synthesizes images of a fixed resolution H × W and has the multi-layer perceptron-type architecture G. In more detail, the synthesis of each pixel takes a random vector z ∈ Z shared across all pixels, as well the pixel coordinates (x,y) ∈ {0...W−1}×{0...H−1}. as input. It then returns the RGB value c ∈ [0,1]3 of that pixel G : (x,y,z) → c. Therefore, to compute the whole output image I, the generator G is evaluated at each pair (x, y) of the coordinate grid, while keeping the random part z fixed:

Figure 5.The Conditionally-Independent Pixel Synthesis (CIPS) generator architecture. The generation pipeline, in which the coordinates (x, y) of each pixel are encoded (yellow) and processed by a fully-connected (FC) network with weights, modulated with a latent vector w, shared for all pixels. The network returns the RGB value of that pixel. image source

I = {G(x,y;z) | (x,y) ∈ mgrid(H,W)}, where mgrid(H,W) = {(x,y) | 0 ≤ x < W, 0 ≤ y < H} is a set of integer pixel coordinates.

Next, a mapping network M (also a perceptron) turns z into a style vector w∈ W, M : z→w, and all the stochasticity in the generating process comes from this style component.

We then follow the StyleGANv2 approach of injecting the style w into the process of generation via weight modulation. To make the paper self-contained, we describe the procedure in brief here.

Any modulated fully-connected (ModFC) layer of our generator can be written in the form ψ = Bφ + b, where φ ∈ Rn is an input, B is a learnable weight matrix B ∈ Rm×n modulated with the style w, b ∈ Rm is a learnable bias, and ψ ∈ Rm is an output. The modulation takes place as follows: at first, the style vector w is mapped with a small net to a scale vector s ∈ Rn. After this linear mapping, a LeakyReLU function is applied to ψ.

Finally, in our default configuration, we add skip connections for every two layers from intermediate feature maps to RGB values and sum the contributions of RGB outputs corresponding to different layers. These skip connections naturally add values corresponding to the same pixel and do not introduce interactions between pixels.

We note that the independence of the pixel generation process, makes our model parallelizable at inference time and, additionally, provides flexibility in the latent space z. E.g., as we show below, in some modified variants of synthesis, each pixel can be computed with a different noise vector z, though gradual variation in z is needed to achieve consistently looking images.

The architecture described above needs an important modification in order to achieve state-of-the-art synthesis quality. Recently two slightly different versions of positional encoding for coordinate-based multi-layer perceptrons (MLP), producing images, were described in litera- ture. Firstly, SIREN proposed a perceptron with a principled weight initialization and sine as an activation function, used throughout all the layers. Secondly, the Fourier features, introduced in, employed a periodic activation function in the very first layer only. In our experiments, we apply a somewhat in-between scheme: the sine function is used to obtain Fourier embedding efo , while other layers use a standard LeakyReLU function, where pixel coordinates are uniformly mapped to the range [−1, 1] and the weight matrix Bf o ∈ R2xn is learnable, like in SIREN paper.

However, only Fourier positional encoding usage turned out insufficient to produce plausible images. In particular,

we have found out that the outputs of the synthesis tend to have multiple wave-like artifacts. Therefore, we also train a separate vector e(x,y)co for each spatial position and call them coordinate embeddings. They represent H × W learnable vectors in total. The full positional encoding e (x, y) is a concatenation of Fourier features and a coordinate embedding and serves as an input for the next perceptron layer: G(x,y;z) = G′ (e(x,y);M (z)).

In conclusion, we presented a new generator model called CIPS, a high-quality architecture with conditionally independent pixel synthesis, such that the color value is computed using only random noise and coordinate position.

Our key insight is that the proposed architecture without spatial convolutions, attention, or upsampling operations has the ability to compete in the model market and obtain decent quality in terms of FID and precision & recall; such results have not been presented earlier for perceptron-based models. Furthermore, in the spectral domain outputs of CIPS are harder to discriminate from real images. Interestingly, CIPS-NE modification is weaker in terms of plausibility, yet has a more realistic spectrum.

Figure 6.Result comparison between SAIC-Moscow’s CIPS and NVIDIA’s StyleGAN2. CIPS has better image quality than StyleGAN2. image source

Direct usage of a coordinate grid allows us to work with more complex structures, such as cylindrical panoramas, just by replacing the underlying coordinate system.

In summary, our generator demonstrates quality on par with state-of-the-art model StyleGANv2 and surpasses it on some datasets; moreover, it has applications in various diverse scenarios. We have shown that the considered model could be successfully applied to foveated rendering and super-resolution problems in their generative interpretations. Future development of our approach assumes researching these problems in their image-to-image formulations.

Publication

https://openaccess.thecvf.com/content/CVPR2021/papers/Anokhin_Image_Generators_With_Conditionally-Independent_Pixel_Synthesis_CVPR_2021_paper.pdf

Reference

[1] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.

[2] T. S. Cohen, M. Geiger, J. Ko ̈hler, and M. Welling. Spherical cnns. In International Conference on Learning Representations, 2018.

[3] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, pages 4396–4405, 2019.

[4] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of style- gan. In Proc. CVPR, pages 8107–8116, 2020.

[5] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Proc. NeurIPS, pages 10215– 10224, 2018.

[6] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[7] C. H. Lin, C. Chang, Y. Chen, D. Juan, W. Wei, and H. Chen. Coco-gan: Generation by parts via conditional coordinating. In Proc. ICCV, pages 4511–4520, 2019.

[8] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Proc. ECCV, pages 405–421, Cham, 2020. Springer International Publishing.

[9] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative ad- adversarial networks. In International Conference on Learning Representations, 2016.

[10] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit Neural Representations with Periodic Activation Functions. In Proc. NeurIPS. Curran Associates, Inc., 2020.