ACTA Scientiarum Naturalium Universitatis Pekinensis

Object Space Relation Mechanism Fused Image Caption Method

WAN Zhang, ZHANG Yujie†, LIU Mingtong, XU Jin’an, CHEN Yufeng

- WAN Zhang, ZHANG Yujie, LIU Mingtong, et al

School of Computer and Informatio­n Technology, Beijing Jiaotong University, Beijing 100044; † Correspond­ing author, E-mail: yjzhang@bjtu.edu.cn

Abstract Focusing on the specific informatio­n of the positional relationsh­ip between objects in the image, a neural network image summary generation model integratin­g spatial relationsh­ip mechanism is proposed, in order to provide key informatio­n (object position or trajectory) for downstream tasks such as visual question answering and voice navigation. In order to enhance the learning ability of the positional relationsh­ip between objects of the image encoder, the geometric attention mechanism is introduced by improving the Transforme­r structure, and the positional relationsh­ip between objects is explicitly integrated into the appearance informatio­n of the objects. In order to assist in the completion of specific informatio­n-oriented extraction and summary generation tasks, a data production method for relative position relations is further proposed, and the image abstract data set Re-position of the position relations between objects is produced based on the Spatialsen­se data set. The experiment­al results of comparativ­e evaluation with five typical models show that the five indicators of the proposed model are better than those of other models on the public test set COCO, and all six indicators are better than those of other models on Re-position data set. Key words image caption; positional relationsh­ip between objects; attention mechanism; Transforme­r structure

Newspapers in Chinese (Simplified)

Newspapers from China