An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings

Image captioning is a dynamic and crucial research area focused on Pads automatically generating image textual descriptions.Traditional models, primarily employing an encoder-decoder framework with Convolutional Neural Networks (CNNs), often struggle to capture the complex spatial and sequential relationships inherent in visual data.This gap in performance underscores the necessity for more sophisticated solutions.

The proposed work introduces a groundbreaking ensemble model that integrates CNN, Graph Convolutional Network (GCN), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures.Our approach achieves an outstanding 97% increase in CIDEr scores on the Flickr30K dataset and a remarkable 28.6% improvement on the Flickr8K dataset, thanks to the innovative implementation of Rotary Positional Encoding (RoPE).

By strategically incorporating GCN and BiLSTM layers, our model adeptly captures essential relationships within the data.This groundbreaking research effectively addresses the challenges of image captioning, leveraging a powerful combination of advanced architectures.As a result, our model significantly enhances the generation of accurate and contextually rich captions, positioning it as a game-changer for automated image-to-text applications.

The proposed Ensemble model with RoPE, achieved impressive performance on the Flickr8k and Flickr30k datasets, with scores of 80.62 and 95.0 for BLEU-1, 72.

01 and 90.51 Body Scrub for BLEU-2, 63.12 and 81.

24 for BLEU-3, 48.32 and 68.8 for BLEU-4, 74.

26 and 81.89 for METEOR, 80.24 and 84.

29 for ROUGE-L, 118.94 and 155.77 for CIDEr, and 48.

7 and 39.0 for SPICE, respectively.

Leave a Reply

Your email address will not be published. Required fields are marked *