Enhanced Hybrid Framework and Comparative Analysis of Deep Learning Architectures for Video Captioning

Open Access

Issue		EPJ Web Conf. Volume 341, 2025 2^nd International Conference on Advent Trends in Computational Intelligence and Communication Technologies (ICATCICT 2025)


Article Number		01046
Number of page(s)		12
DOI		https://doi.org/10.1051/epjconf/202534101046
Published online		20 November 2025

T. Kehkashan, A. Alsaeedi, W. M. S. Yafooz, N. A. Ismail, and A. Al-Dhaqm, "Combinatorial analysis of deep learning and machine learning video captioning studies: A systematic literature review," IEEE Access, vol. 12, pp. 35048-35080, 2024. [Google Scholar]
M. Goyal, V. Mandal, M. Hassija, M. Aloqaily, and V. Chamola, "Captionomaly: A deep learning toolbox for anomaly captioning in social surveillance systems," IEEE Trans. Comput. Soc. Syst., vol. 11, no. 1, pp. 207-215, Feb. 2024. [Google Scholar]
X. Huang, K.-H. Chan, W. Ke, and H. Sheng, "Fusion of multi-modal features to enhance dense video caption," Sensors, vol. 23, p. 5565, 2023. [Google Scholar]
X. Huang, K.-H. Chan, W. Ke, and H. Sheng, "Parallel dense video caption generation with multi-modal features," Mathematics, vol. 11, p. 3685, 2023. [Google Scholar]
Z. Ghaderi, L. Salewski, and H. P. A. Lensch, "Diverse video captioning by adaptive spatio-temporal attention," in Proc. DAGM GCPR, 2022, pp. 8899. [Google Scholar]
S. Wajid, "Deep learning and knowledge graph for image/video captioning: A review of datasets," Engineering Reports, vol. 6, 2024. [Google Scholar]
A. J. Yousif and M. H. Al-Jammas, "Exploring deep learning approaches for video captioning: A comprehensive review," e-Prime - Advances in Electrical Engineering, Electronics and Energy, vol. 6, p. 100372, 2023. [Google Scholar]
B. Subedi, S. Singh, and B. K. Bal, "Nepali video captioning using CNN-RNN architecture," arXiv:2311.02699, 2023. [Google Scholar]
D. Naik and C. D. Jaidhar, "Semantic context driven language descriptions of videos using deep neural network," J. Big Data, vol. 9, p. 17, 2022. [Google Scholar]
S. Varma and J. D. Peter, "Deep learning-based video captioning technique using transformer," in Proc. ICACCS, 2022, pp. 847-850. [Google Scholar]
V. G. Biradar, M.G.S. Agarwal, S.K. Singh, and R.U. Bharadwaj, "Leveraging deep learning model for image caption generation for scenes description," in Proc. EASCT, 2023, pp. 1-5. [Google Scholar]
M. Matsuhara and J. Tsushima, "Effectiveness of automatic caption generation method for video in Japanese," in Proc. ICOTL, 2023, pp. 1-5. [Google Scholar]
S. Nakamura, H. Yanagimoto, and K. Hashimoto, "Movie caption generation with vision transformer and transformer-based language model," in Proc. IIAI-AAI, 2023, pp. 88-93. [Google Scholar]
E. Rashno, M. Safarzadehvahed, F. Zulkernine, and S. Givigi, "Image caption generation based on image-text matching schema in deep reinforcement learning," in Proc. IEEE SSCI, 2023, pp. 1139-1144. [Google Scholar]
S. Yenugula, M. S. Sirisha, K. S. Priya, Y. S. Reddy, and M. N. R. Rao, "Automatic image and video captioning production using deep learning," in Proc. ICSCDS, 2022, pp. 156-161. [Google Scholar]
C. Cheng, C. Li, Y. Han, and Y. Zhu, "A semi-supervised deep learning image caption model based on pseudo-label and n-gram," Int. J. Approx. Reason., vol. 131, pp. 93-107, 2021. [Google Scholar]
H. N. Alkalouti and M. A. Al Masre, "Encoder-decoder model for automatic video captioning using YOLO algorithm," in Proc. IEMTRONICS, 2021, pp. 1-4. [Google Scholar]
M. A. Al-Malla, A. Jafar, and N. Ghneim, "Image captioning model using attention and object features to mimic human image understanding," J. Big Data, vol. 9, p. 20, 2022. [Google Scholar]
X. Wang, W. Chen, J. Wu, Y. Wang, and Z. Lin, "Video captioning via hierarchical reinforcement learning," in Proc. CVPR, 2018, pp. 4213-4222. (included for HRL context despite earlier year) [Google Scholar]
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C.L. Zitnick, "Microsoft COCO captions: Data collection and evaluation server," arXiv:1504.00325, version updated 2020. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.