Learning Receptive Field Size by Learning Filter Size
Published
IEEE Winter Conference on Applications of Computer Vision (WACV)
Abstract
As a step towards visual reasoning, we present an attention-based structured captioning model that explicitly reason about complex scenes. Our model consists of three modules, a sequential prediction module, an attention module, and a graph learner module. The sequential prediction module predicts various numbers of related objects depending on generated words, and the attention module finds out the objects if the matching objects are present in the current scene. The graph learner module capture relationship between attended objects. The adaptive multi-glimpse network consisting of the above three modules is a generalization of the existing attention-based network. Empirical comparisons against state-of-the-art image captioning methods show that our adaptive multi-glimpse network not only performs significantly better in terms of the standard evaluation metric on COCO, but also generalizes well on zero-shot complex scenes, which contains unseen compositional objects and relationship.