Deep Learning for Picture Captioning: A Complete Review

Authors

  • G. Rachana
  • T. Ravi Kumar

Keywords:

Content based image retrieval, Convolution neural network, Deep learning, Photographs, Recurrent neural network

Abstract

Textual information about non-text content that appears on a website is provided via image descriptions, allowing it to be displayed in any way that is most helpful to the user—auditorily, as a visual text, etc. In order to analyze photographs and transform their textual information into other usable forms, this project will require creating a model that generates appropriate descriptions for images. Why is it necessary to describe images? Those who are blind and others who use screen readers must rely only on the text to be read aloud to them because visuals cannot be understood by them. The text frequently scales more effectively than images, which can pixelate or fill the entire screen and require users to scroll the image either vertically or horizontally. Unintentional labels have been applied to images for a very long time, but in the social media era, they are now also impacted by neighbouring activities, attitudes, and events. A major challenge in artificial intelligence that mixes computer vision and natural language processing is defining a picture's contents. The model is built on a CNN that compresses a picture's representation into a little amount of data, followed by an RNN that creates equivalent phrases using the image attributes that were previously learnt.  The model operates at a level close to that of the state-of-the-art, and the captions that are generated are fairly evocative of the objects and circumstances observed in the photographs.

Published

2023-05-07

Issue

Section

Articles