Back
 JILSA  Vol.12 No.1 , February 2020
Application of Dual Attention Mechanism in Chinese Image Captioning
Abstract: Objective: The Chinese description of images combines the two directions of computer vision and natural language processing. It is a typical representative of multi-mode and cross-domain problems with artificial intelligence algorithms. The image Chinese description model needs to output a Chinese description for each given test picture, describe the sentence requirements to conform to the natural language habits, and point out the important information in the image, covering the main characters, scenes, actions and other content. Since the current open source datasets are mostly in English, the research on the direction of image description is mainly in English. Chinese descriptions usually have greater flexibility in syntax and lexicalization, and the challenges of algorithm implementation are also large. Therefore, only a few people have studied image descriptions, especially Chinese descriptions. Methods: This study attempts to derive a model of image description generation from the Flickr8k-cn and Flickr30k-cn datasets. At each time period of the description, the model can decide whether to rely more on images or text information. The model captures more important information from the image to improve the richness and accuracy of the Chinese description of the image. The image description data set of this study is mainly composed of Chinese description sentences. The method consists of an encoder and a decoder. The encoder is based on a convolutional neural network. The decoder is based on a long-short memory network and is composed of a multi-modal summary generation network. Results: Experiments on Flickr8k-cn and Flickr30k-cn Chinese datasets show that the proposed method is superior to the existing Chinese abstract generation model. Conclusion: The method proposed in this paper is effective, and the performance has been greatly improved on the basis of the benchmark model. Compared with the existing Chinese abstract generation model, its performance is also superior. In the next step, more visual prior information will be incorporated into the model, such as the action category, the relationship between the object and the object, etc., to further improve the quality of the description sentence, and achieve the effect of “seeing the picture writing”.
Cite this paper: Zhang, Y. and Zhang, J. (2020) Application of Dual Attention Mechanism in Chinese Image Captioning. Journal of Intelligent Learning Systems and Applications, 12, 14-29. doi: 10.4236/jilsa.2020.121002.
References

[1]   Chen, X., Ma, L., Jiang, W., Yao, J. and Liu, W. (2018) Regularizing RNNS for Caption Generation by Reconstructing the Past with the Present. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, 1-9.
https://doi.org/10.1109/CVPR.2018.00834

[2]   Mathews, A., Xie, L. and He, X. (2018) SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, 8591-8600.
https://doi.org/10.1109/CVPR.2018.00896

[3]   Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2017) Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, 6077-6086.
https://doi.org/10.1109/CVPR.2018.00636

[4]   Bernardi, R., Cakici, R., Ellioš, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A. and Plank, B. (2016) Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures. Journal of Artificial Intelligence Research, 55, 409-442.
https://doi.org/10.1613/jair.4900

[5]   Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T. and Rohrbach, M. (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, November 2016, 457-468.
https://doi.org/10.18653/v1/D16-1044

[6]   He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 27-30 June 2016, 770-778.
https://doi.org/10.1109/CVPR.2016.90

[7]   Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. (2015) Show and Tell: A Neural Image Caption Generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3156-3164.
https://doi.org/10.1109/CVPR.2015.7298935

[8]   Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A. (2015) Learning Deep Features for Discriminative Localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 27-30 June 2016, 2921-2929.
https://doi.org/10.1109/CVPR.2016.319

[9]   Ellioš, D., Frank, S. and Hasler, E. (2015) Multilingual Image Description with Neural Sequence Models. arXiv preprint arXiv:1510.04709.

[10]   Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L. and Xu, W. (2015) Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. Computer Science, 1-10.

[11]   Rabuñal Dopico, J.R., Dopico, J. and Pazos, A. (2008) Encyclopedia of Artificial Intelligence: Volume 3. Encyclopedia of Artificial Intelligence. Information Science Reference. IGI Publishing, Hershey, PA.
https://doi.org/10.4018/978-1-59904-849-9

[12]   Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, 311-318.
https://doi.org/10.3115/1073083.1073135

[13]   Michael Denkowski, A.L. (2010) METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages. Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, Uppsala, Sweden, July 2010, 339-342.

[14]   Hori, C. (2003) Evaluation Methods for Automatic Speech Summarization. 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, 1-4 September 2003.

[15]   Vedantam, R., Zitnick, C.L. and Parikh, D. (2014) CIDEr: Consensus-Based Image Description Evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 4566-4575.
https://doi.org/10.1109/CVPR.2015.7299087

[16]   Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Plaš, J., Zitnick, L. and Zweig, G. (2015) From Captions to Visual Concepts and Back. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 1473-1482.
https://doi.org/10.1109/CVPR.2015.7298754

[17]   Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L. and Parikh, D. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7-13 December 2015, 2425-2433.
https://doi.org/10.1109/ICCV.2015.279

 
 
Top