Transformer architectures for computer vision: A comprehensive review and future research directions

Open Access

Issue		EPJ Web Conf. Volume 341, 2025 2^nd International Conference on Advent Trends in Computational Intelligence and Communication Technologies (ICATCICT 2025)


Article Number		01023
Number of page(s)		13
DOI		https://doi.org/10.1051/epjconf/202534101023
Published online		20 November 2025

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551. [CrossRef] [Google Scholar]
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (2002). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. [Google Scholar]
Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. [Google Scholar]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022). [Google Scholar]
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347-10357). PMLR. [Google Scholar]
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In Icml (Vol. 2, p. 4). [Google Scholar]
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., … Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568578). [Google Scholar]
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22-31). [Google Scholar]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [Google Scholar]
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34, 3965-3977. [Google Scholar]
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video Swin Transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3202-3211. [Google Scholar]
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., & Ren, J. (2022). EfficientFormer: Vision Transformers at MobileNet Speed. arXiv preprint arXiv:2206.01191. [Google Scholar]
Chen, Z., Zhong, Z., Liu, Z., & Li, S. (2022). EdgeViT: Efficient Visual Modeling for Edge Computing. arXiv preprint arXiv:2205.03436. [Google Scholar]
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 9650-9660). IEEE. [Google Scholar]
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the 40th International Conference on Machine Learning (ICML). [Google Scholar]
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., & Simonyan, K. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS). [Google Scholar]
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., & Girshick, R. (2023). Segment Anything. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). [Google Scholar]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248-255. https://doi.org/10.1109/CVPR.2009.5206848 [Google Scholar]
Ridnik, T., Ben-Baruch, E., Noy, A., & Zelnik-Manor, L. (2021). ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972. [Google Scholar]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248-255. https://doi.org/10.1109/CVPR.2009.5206848 [Google Scholar]
Thapak, P., & Hore, P. (2020). Transformer++. arXiv preprint arXiv:2003.04974. https://arxiv.org/abs/2003.04974 [Google Scholar]
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61-80. https://doi.org/10.1109/TNN.2008.2005605 [CrossRef] [PubMed] [Google Scholar]
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. 5th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1609.02907 [Google Scholar]
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1710.10903 [Google Scholar]
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2005.12872 [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. https://doi.org/10.1109/CVPR.2016.90 [Google Scholar]
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (NeurIPS), 28. https://arxiv.org/abs/1506.01497 [Google Scholar]
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.04159 [Google Scholar]
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. European Conference on Computer Vision (ECCV), 740-755. https://doi.org/10.1007/978-3-319-10602-148 [Google Scholar]
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2980-2988. https://doi.org/10.1109/ICCV.2017.322 [Google Scholar]
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7263-7271. https://doi.org/10.1109/CVPR.2017.690 [Google Scholar]
Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767. https://arxiv.org/abs/1804.02767 [Google Scholar]
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://arxiv.org/abs/2004.10934 [Google Scholar]
Jocher, G., Chaurasia, A., & Qiu, J. (2022). YOLOv5 by Ultralytics. GitHub Repository. https://github.com/ultralytics/yolov5 [Google Scholar]
Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022). YOLOv7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696. https://arxiv.org/abs/2207.02696 [Google Scholar]
Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2021). ViDT: An efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921. [Google Scholar]
Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y., & Ren, J. (2022). EfficientFormer: Vision Transformers at MobileNet Speed. arXiv preprint arXiv:2206.01191. [Google Scholar]
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucie, M., & Schmid, C. (2021). ViViT: A Video Vision Transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 6836-6846. [Google Scholar]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735 [CrossRef] [Google Scholar]
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. https://arxiv.org/abs/1406.1078 [Google Scholar]
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221-231. https://doi.org/10.1109/TPAMI.2012.59 [Google Scholar]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, 4489-4497. https://doi.org/10.1109/ICCV.2015.510 [Google Scholar]
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 813-824). PMLR. [Google Scholar]
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video Swin Transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3202-3211. [Google Scholar]
Patrick, M., Campbell, D., Asano, Y. M., Misra, I., Metze, F., Feichtenhofer, C., Vedaldi, A., & Henriques, J. F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers. arXiv preprint arXiv:2106.05392. [Google Scholar]
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., … Xing, L. (2021). TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. [Google Scholar]
Bannur, S., Hyland, S., Liu, Q., Pérez-Garcia, F., Ilse, M., Castro, D. C., Boecking, B., Sharma, H., Bouzid, K., Thieme, A., Schwaighofer, A., Wetscherek, M., Lungren, M. P., Nori, A., Alvarez-Valle, J., & Oktay, O. (2023). Learning to exploit temporal structure for biomedical vision-language processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [Google Scholar]
Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H., & Xu, D. (2022). Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv preprint arXiv:2201.01266. [Google Scholar]
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv preprint arXiv:1505.04597. [Google Scholar]
Fuller, A., Millard, K., & Green, J. R. (2022). SatViT: Pretraining Transformers for Earth Observation. IEEE Geoscience and Remote Sensing Letters, 19, 1-5. [CrossRef] [Google Scholar]
Bandara, W. G. C., & Patel, V. M. (2022). A Transformer-Based Siamese Network for Change Detection. In IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium (pp. 207-210). IEEE. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.