Thanks to the development of Transformer-based models, the technique of semantic segmentation successfully achieved a great result through an encoder-decoder structure. proposed DeepLabV3 8, a model with an atrous convolution, to deal with this problem and greatly improved the accuracy. fine spatial distribution of small objects in an image) cannot be obtained due to continuous convolutions. They, however, have a problem on the fineness of representation 7. In recent years, CNN-based semantic segmentation approaches have been studied. Such traditional methods commonly still had some difficulties on the segmentation accuracy due to excessive segmentation and/or too much sensitivity to noise. proposed a method for natural images using effective histogram thresholding techniques 6. proposed a Watershed algorithm along with DBMF (linear and Decision based Median Filtering) 5. proposed a method for region- and fuzzy-based segmentation within Watershed Transform 4, and Kaur et al. We expect that this improvement enables the above-mentioned RNN model to generate a more detailed and/or rich captioning.Ī variety of image segmentation techniques has been proposed, such as threshold-based, region-based, edge-based, fuzzy theory-based, and neural network-based one 3. We preferred a pixel-wise classification as the PSI represents more detailed object regions and background information than the results of normal object detection. We introduced pixel-wise semantic information (hereinafter called “PSI”), which is calculated by pixel-wise classification within semantic segmentation, as middle-level information between image and text data. The proposed method introduces middle-level information between text and image data for decreasing ambiguous weight allocation caused by a semantic meaningless word. This research aims to improve the attention modules for image captioning so as to focus on more appropriate image regions according to each input word. Here is a problem that the attention for a preposition “with” just focused on the bottom of the image, that is not a meaningful region for captioning, which leads to the lack of words “table” and “painting” in the generated caption. Right: An output caption of an attention-guided RNN model with its visualized attention for each word. Left: An input image and its ground-truth caption. With inputs of an image feature and a sequentially generated word, the attention can find a related region and reallocate the weight of the image feature in each generation step, which helps to generate the next word. They proposed an attention-guided RNN to solve the problem caused by constant image features. Such constant image features cause inappropriate expressions of generated captions. 2, image features are extracted from an input image as the first step, and used all the subsequent steps within the RNN model without any adaptation. This is because, as pointed out by Xu et al. Although Vinyals’s method can effectively deal with different modal data (images and texts), its captioning accuracy is greatly limited. The CNN focuses on extracting image features, whereas the RNN generates an image caption from constant image features and sequentially generated word information. The architecture of the model is shown in Fig. Therefore, the task can be typically solved with an encoder-decoder model even in recent advancements of image captioning.Ī simple and effective CNN-RNN model for image captioning has been proposed by Vinyals 1. It is usually handled as a translation task, since image features can be regarded as a sentence to be translated. Image captioning is a task to give descriptive words or sentences for an input image.
0 Comments
Leave a Reply. |