| Issue |
EPJ Web Conf.
Volume 367, 2026
Fifth International Conference on Robotics, Intelligent Automation and Control Technologies (RIACT 2026)
|
|
|---|---|---|
| Article Number | 04004 | |
| Number of page(s) | 16 | |
| Section | AI & Machine Learning | |
| DOI | https://doi.org/10.1051/epjconf/202636704004 | |
| Published online | 29 April 2026 | |
https://doi.org/10.1051/epjconf/202636704004
Explainable Multi-Modal Skin Lesion Classification with a Hybrid CNN-Transformer
1 Department of Computer Science and Engineering, Ramaiah University of Applied Sciences, Bengaluru, India
2 Assistant Professor Department of Computer Science and Engineering Ramaiah University of Applied Sciences Bengaluru, India
* Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.
Published online: 29 April 2026
Abstract
Fast and accurate identification of skin lesions is important for the outcome of patients. The evaluation of lesions is subjective, and poor quality images may limit accuracy. Deep learning models can be an alternative; however, many of them lack interpretability or do not combine different types of data. The current research presents an innovative, interpretable multimodal system for diagnosing skin lesions that overcomes many of these limitations. A hybrid neural network was created that uses a CNN-Transformer architecture and EfficientNetV2-B0 backbone to process and extract visual patterns from dermoscopy images. Additionally, this model was integrated with a second network that uses the HAM10000 dataset in order to incorporate and process historical patient information. The model has been class-balanced by using SMOTE to ensure strong performance. The model provides transparency by using Explainable AI (XAI) methods, primarily with Grad-CAM for visual and LIME for tabular features. Overall, this multimodal system produces an adaptable, reliable and effective diagnostic tool with an overall classification accuracy of 80.04% and an Area Under the Curve (AUC) of 0.95. Our results suggest that multimodal data combined with a transparent hybrid architecture produces an effective tool for enhancing clinician support, diagnostic confidence and provides a framework for clinical deployment in real-world practice.
© The Authors, published by EDP Sciences, 2026
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

