tailieunhanh - Integrating image features with convolutional sequence to sequence network for multilingual visual question answering

Visual question answering is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease, but it is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual question answering task in the multilingual domain on a newly released dataset UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese, and Japanese. | Journal of Computer Science and Cybernetics 2024 1- DOI no 1813-9663 18155 INTEGRATING IMAGE FEATURES WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE NETWORK FOR MULTILINGUAL VISUAL QUESTION ANSWERING TRIET M. THAI SON T. LUU University of Information Technology Ho Chi Minh City Viet Nam Vietnam National University Ho Chi Minh City Viet Nam Abstract. Visual question answering is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but it is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual question answering task in the multilingual domain on a newly released dataset UIT-EVJVQA in which the questions and answers are written in three different languages English Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task in which we integrated hints from pre-trained state-of-the-art VQA models and image features with a convolutional sequence-to-sequence network to generate the desired answers. Our results obtained up to by F1 score on the public test set and on the private test set. Keywords. Visual question answering Sequence-to-sequence learning Multilingual Multimodal. Abbreviations QA Question answering VQA Visual question answering VLSP Association for Vietnamese language and speech processing Seq2Seq Sequence-to-sequence ViT Vision transformer SOTA State-of-the-art GRU Gated recurrent unit GLU Gate linear unit LSTM Long short-term memory RNN Recurrent neural network API Application programming interface ConvS2S Convolutional sequence-to-sequence network Bi-RNN Bi-directional recurrent neural networks ConvS2S Convolutional sequence-to-sequence network BERT Bidirectional encoder representations from transformers Corresponding author. E-mail addresses 19522397@ sonlt@ . Luu . 2024 Vietnam Academy of Science amp Technology 2 TRIET M. THAI SON T. LUU 1. .