tailieunhanh - Báo cáo khoa học: "Collective Generation of Natural Image Descriptions"
We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those phrases to generate a novel description for the query image. | Collective Generation of Natural Image Descriptions Polina Kuznetsova Vicente Ordonez Alexander C. Berg Tamara L. Berg and Yejin Choi Department of Computer Science Stony Brook University Stony Brook NY 11794-4400 pkuznetsova vordonezroma aberg tlberg ychoi @ Abstract We present a holistic data-driven approach to image description generation exploiting the vast amount of noisy parallel image data and associated natural language descriptions available on the web. More specifically given a query image we retrieve existing human-composed phrases used to describe visually similar images then selectively combine those phrases to generate a novel description for the query image. We cast the generation process as constraint optimization problems collectively incorporating multiple interconnected aspects of language composition for content planning surface realization and discourse structure. Evaluation by human annotators indicates that our final system generates more semantically correct and linguistically appealing descriptions than two nontrivial baselines. 1 Introduction Automatically describing images in natural language is an intriguing but complex AI task requiring accurate computational visual recognition comprehensive world knowledge and natural language generation. Some past research has simplified the general image description goal by assuming that relevant text for an image is provided . Aker and Gaizauskas 2010 Feng and Lapata 2010 . This allows descriptions to be generated using effective summarization techniques with relatively surface level image understanding. However such text . news articles 359 or encyclopedic text is often only loosely related to an image s specific content and many natural images do not come with associated text for summarization. In contrast other recent work has focused more on the visual recognition aspect by detecting content elements . scenes objects attributes actions etc and then composing .
đang nạp các trang xem trước