AI

FlowFormer: A Transformer Architecture for Optical Flow

By Chao Zhang Samsung R&D Institute China - Beijing
By Qiang Wang Samsung R&D Institute China - Beijing

Background

Optical flow targets at estimating per-pixel correspondences between a source image and a target image, in the form of a 2D displacement field. In many down- stream video tasks, such as action recognition [45, 36, 60], video inpainting [28,49, 13], video super-resolution [30, 5, 38], and frame interpolation [50, 33, 20], op- tical flow serves as a fundamental component providing dense correspondences as valuable clues for prediction.

Recently, transformers have attracted much attention for their ability of mod- eling long-range relations, which can benefit optical flow estimation. Perceiver IO [24] is the pioneering work that learns optical flow regression with a transformer- based architecture. However, it directly operates on pixels of image pairs and ignores the well-established domain knowledge of encoding visual similarities to costs for flow estimation. It thus requires a large number of parameters and 80 training examples to capture the desired input-output mapping. We therefore raise a question: can we enjoy both advantages of transformers and the cost volume from the previous milestones? Such a question calls for designing novel transformer architectures for optical flow estimation that can effectively aggregate information from the cost volume. In this paper, we introduce the novel optical Flow TransFormer (FlowFormer) to address this challenging problem.

Our contributions can be summarized as fourfold. 1) We propose a novel transformer-based neural network architecture, FlowFormer, for optical flow es- timation, which achieves state-of-the-art flow estimation performance. 2) We design a novel cost volume encoder, effectively aggregating cost information into compact latent cost tokens. 3) We propose a recurrent cost decoder that recur- rently decodes cost features with dynamic positional cost queries to iteratively refine the estimated optical flows. 4) To the best of our knowledge, we vali- date for the first time that an ImageNet-pretrained transformer can benefit the estimation of optical flow.

Method

The task of optical flow estimation requires to output a per-pixel displacement field f : R2 -> R2 that maps every 2D location x R2 of a source image Is to its corresponding 2D location p = x+f(x) of a target image It. To take advantage of the recent vision transformer architectures as well as the 4D cost volumes widely utilized by previous CNN-based optical flow estimation methods, we propose FlowFormer, a transformer-based architecture that encodes and decodes the 4D cost volume to achieve accurate optical flow estimation. In Fig. 1, we show the overview architecture of FlowFormer, which processes the 4D cost volumes from siamese features with two main components: 1) a cost volume encoder that encodes the 4D cost volume into a latent space to form cost memory, and 2) a cost memory decoder for predicting a per-pixel displacement field based on the encoded cost memory and contextual features.

Figure 1.  Architecture of FlowFormer. FlowFormer estimates optical flow in three steps: 1) building a 4D cost volume from image features. 2) A cost volume encoder that encodes the cost volume into the cost memory. 3) A recurrent transformer decoder that decodes the cost memory with the source image context features into flows.

Building the 4D Cost Volume

A backbone vision network is used to extract an H × W × Df feature map from an input HI × WI 3 × RGB image, where typically we set (H, W ) = (HI /8, WI /8). After extracting the feature maps of the source image and the target image, we construct an H × W H × W × 4D cost volume by computing the dot-product similarities between all pixel pairs between the source and target feature maps.

Cost Volume Encoder

To estimate optical flows, the corresponding positions in the target image of source pixels need to be identified based on source-target visual similarities en- coded in the 4D cost volume. The built 4D cost volume can be viewed as a series of 2D cost maps of size H × W , each of which measures visual similarities be- tween a single source pixel and all target pixels. We denote source pixel x’s cost map as Mx RH×W . Finding corresponding positions in such cost maps is gen- erally challenging, as there might exist repeated patterns and non-discriminative regions in the two images. The task becomes even more challenging when only considering costs from a local window of the map, as previous CNN-based optical flow estimation methods do. Even for estimating a single source pixel’s accurate displacement, it is beneficial to take its contextual source pixels’ cost maps into consideration.

To tackle this challenging problem, we propose a transformer-based cost vol- ume encoder that encodes the whole cost volume into a cost memory. Our cost volume encoder consists of three steps: 1) cost map patchification, 2) cost patch token embedding, and 3) cost memory encoding.

Cost Memory Decoder for Flow Estimation

Given the cost memory encoded by the cost volume encoder, we propose a cost memory decoder to predict optical flows. Since the original resolution of the input image is HI × WI, we estimate optical flow at the H × W resolution and then upsample the predicted flows to the original resolution with a learnableconvex upsampler [46]. However, in contrast to previous vision transformers that seek abstract semantic features, optical flow estimation requires recovering dense correspondences from the cost memory. Inspired by RAFT [46], we propose to use cost queries to retrieve cost features from the cost memory and iteratively refine flow predictions with a recurrent attention decoder layer.

Experiment

We evaluate our FlowFormer on the Sintel [3] and the KITTI-2015 [14] bench- marks. Following previous works, we train FlowFormer on FlyingChairs [12] and FlyingThings [35], and then respectively finetune it for Sintel and KITTI bench- mark. Flowformer achieves state-of-the-art performance on both benchmarks. Experimental setup. We use the average end-point-error (AEPE) and F1- All(%) metric for evaluation. The AEPE computes mean flow error over all valid pixels. The F1-all, which refers to the percentage of pixels whose flow error is larger than 3 pixels or over 5% of length of ground truth flows. The Sintel dataset is rendered from the same model but in two passes, i.e. clean pass and final pass. The clean pass is rendered with smooth shading and specular reflections. The final pass uses full rendering settings including motion blur, camera depth-of- field blur, and atmospheric effects.

Table 1.  Experiments on Sintel [3] and KITTI [14] datasets. * denotes that the methods use the warm-start strategy [46], which relies on previous image frames in a video. ‘A’ denotes the autoflow dataset. ‘C + T’ denotes training only on the FlyingChairs and FlyingThings datasets. ‘+ S + K + H’ denotes finetuning on the combination of Sintel, KITTI, and HD1K training sets. Our FlowFormer achieves best generalization performance (C+T) and ranks 1st on the Sintel benchmark (C+T+S+K+H).

Figure 2.  Qualitative comparison on the Sintel test set. FlowFormer greatly reduces the flow leakage around object boundaries (pointed by red arrows) and clearer details (pointed by blue arrows).

Table 2.  Ablation study. We gradually change one component of the RAFT at a time to obtain our FlowFormer model. MCR LCT+CMD: replacing RAFT’s decoder with OUR latent cost tokens + cost memory decoder. CNN Twins: replacing RAFT’s CNN encoder with Twins-SVT transformer. Cost Encoding: adding intra-cost-map and inter-cost-map to form an Alternate-Group Trans- former layer in the encoder. 3 AGT layers are used in our final model.

Table 3.  FlowFormer v.s. GMA. Ours (small) is a small version of FlowFormer and uses the CNN image feature encoder of GMA. GMA-L is a large version of GMA. GMA-Twins replace its CNN image feature encoder with pre-trained Twins. (+x%) indicates that this model obtains x% larger error than ours.

Conclusion

We have proposed FlowFormer, a Transformer-based architecture for optical flow estimation. To our best knowledge, FlowFormer is the first method that deeply integrates transformers with cost volumes for optical flow estimation. Thanks to the compact cost tokens and long-range relation modeling ability of transformers, FlowFormer achieves state-of-the-art accuracy and shows strong cross-dataset generalization.

Link to the paper

https://arxiv.org/abs/2202.06258

References

1. Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision. pp. 231–236. IEEE (1993)

2. Bruhn, A., Weickert, J., Schn¨orr, C.: Lucas/kanade meets horn/schunck: Combin- ing local and global optic flow methods. International journal of computer vision 61(3), 211–231 (2005)

3. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611– 625. Springer (2012)

4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European Conference on Computer Vision. pp. 213–229. Springer (2020)

5. Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: Basicvsr: The search for essential components in video super-resolution and beyond. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4947– 4956 (2021)

6. Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12299–12310 (2021)

7. Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems 34 (2021)

8. Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840 (2021)

9. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)

10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

11. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

12. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu- tional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2758–2766 (2015)

13. Gao, C., Saraf, A., Huang, J.B., Kopf, J.: Flow-edge guided video completion. In: European Conference on Computer Vision. pp. 713–729. Springer (2020)

14. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11), 1231–1237 (2013)

15. Guo, M.H., Cai, J.X., Liu, Z.N., Mu, T.J., Martin, R.R., Hu, S.M.: Pct: Point cloud transformer. Computational Visual Media 7(2), 187–199 (2021)

16. Hofinger, M., Bulo, S.R., Porzi, L., Knapitsch, A., Pock, T., Kontschieder, P.: Improving optical flow on a pyramid level. In: European Conference on Computer Vision. pp. 770–786. Springer (2020)

17. Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial intelligence 17(1-3), 185–203 (1981)

18. Huang, Z., Pan, X., Xu, R., Xu, Y., Zhang, G., Li, H., et al.: Life: Lighting invariant flow estimation. arXiv preprint arXiv:2104.03097 (2021)

19. Huang, Z., Zhou, H., Li, Y., Yang, B., Xu, Y., Zhou, X., Bao, H., Zhang, G., Li, H.: Vs-net: Voting with segmentation for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6101–6111 (2021)

20. Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294 (2020)

21. Hui, T.W., Tang, X., Loy, C.C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8981–8989 (2018)

22. Hui, T.W., Tang, X., Loy, C.C.: A lightweight optical flow cnn—revisiting data fidelity and regularization. IEEE transactions on pattern analysis and machine intelligence 43(8), 2555–2569 (2020)

23. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2462–2470 (2017)

24. Jaegle, A., Borgeaud, S., Alayrac, J.B., Doersch, C., Ionescu, C., Ding, D., Kop-pula, S., Zoran, D., Brock, A., Shelhamer, E., et al.: Perceiver io: A general archi- tecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)

25. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. arXiv preprint arXiv:2104.02409 (2021)

26. Jiang, S., Lu, Y., Li, H., Hartley, R.: Learning optical flow from a few matches. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16592–16600 (2021)

27. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: Cotr: Correspondence transformer for matching across images. arXiv preprint arXiv:2103.14167 (2021)

28. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5792–5801 (2019)

29. Kondermann, D., Nair, R., Honauer, K., Krispin, K., Andrulis, J., Brock, A., Gussefeld, B., Rahimimoghaddam, M., Hofmann, S., Brenner, C., et al.: The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban au- tonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 19–28 (2016)

30. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networks for fast and accurate super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 624–632 (2017)

31. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1833–1844 (2021)

32. Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., Wang, X., Dai, J., Li, H.: Fuseformer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14040–14049 (2021)

33. Liu, X., Liu, H., Lin, Y.: Video frame interpolation via optical flow estimation with image inpainting. International Journal of Intelligent Systems 35(12), 2087–2102 (2020)

34. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

35. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040–4048 (2016)

36. Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 9945–9953 (2019)

37. Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4161–4170 (2017)

38. Sajjadi, M.S., Vemulapalli, R., Brown, M.: Frame-recurrent video super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition. pp. 6626–6634 (2018)

39. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4938–4947 (2020)

40. Sun, D., Roth, S., Black, M.J.: A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision 106(2), 115–137 (2014)

41. Sun, D., Vlasic, D., Herrmann, C., Jampani, V., Krainin, M., Chang, H., Zabih, R., Freeman, W.T., Liu, C.: Autoflow: Learning a better training set for optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10093–10102 (2021)

42. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8934–8943 (2018)

43. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE transactions on pattern analysis and machine intelligence 42(6), 1408–1423 (2019)

44. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8922–8931 (2021)

45. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided fea- ture: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1390–1399 (2018)

46. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020)

47. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L ., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)

48. Xu, H., Yang, J., Cai, J., Zhang, J., Tong, X.: High-resolution optical flow from 1d attention and correlation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 10498–10507 (2021)

49. Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3723–3732 (2019)

50. Xu, X., Siyao, L., Sun, W., Yin, Q., Yang, M.H.: Quadratic video interpolation. Advances in Neural Information Processing Systems 32 (2019)

51. Xu, Y., Lin, K.Y., Zhang, G., Wang, X., Li, H.: Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization (2022)

52. Yan, W., Sharma, A., Tan, R.T.: Optical flow in dense foggy scenes using semi- supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13259–13268 (2020)

53. Yang, G., Ramanan, D.: Volumetric correspondence networks for optical flow. Ad- vances in neural information processing systems 32, 794–805 (2019)

54. Yang, L., Xu, Y., Yuan, C., Liu, W., Li, B., Hu, W.: Improving visual grounding with visual-linguistic verification and iterative reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9499– 9508 (2022)

55. Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition for match density estimation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 6044–6053 (2019)

56. Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: European Conference on Computer Vision. pp. 528–543. Springer (2020)

57. Zhang, F., Woodford, O.J., Prisacariu, V.A., Torr, P.H.: Separable flow: Learning motion cost volumes for optical flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10807–10817 (2021)

58. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16259– 16268 (2021)

59. Zhao, S., Sheng, Y., Dong, Y., Chang, E.I., Xu, Y., et al.: Maskflownet: Asymmetric feature matching with learnable occlusion mask. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6278–6287 (2020)

60. Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.U.: Improved two-stream model for human action recognition. EURASIP Journal on Image and Video Pro- cessing 2020(1), 1–9 (2020)

61. Zheng, Y., Zhang, M., Lu, F.: Optical flow in the dark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6749– 6757 (2020)