Deep feature fusion for melanoma detection using ResNet-VGG networks with attention mechanism

Muhammad Ahtsam Naeem; Shangming Yang; Mamta Mittal; Jyotindra Narayan; Muhammad Asim Saleem; Mohammad Shabaz

doi:10.7717/peerj-cs.3248

Deep feature fusion for melanoma detection using ResNet-VGG networks with attention mechanism

Muhammad Ahtsam Naeem¹, Shangming Yang¹, Mamta Mittal², Jyotindra Narayan^3,4, Muhammad Asim Saleem¹, Mohammad Shabaz ⁵

1School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China

2Department of Data Analytics, Delhi Skill and Entrepreneurship University, New Delhi, New Delhi, India

3Department of Mechanical Engineering, Indian Institute of Technology Patna, Bihta, Bihar, India

4Department of Computing, Imperial College London, London, London, United Kingdom

5Model Institute of Engineering and Technology, Jammu, J&K, India

DOI: 10.7717/peerj-cs.3248

Published: 2025-10-27
Accepted: 2025-09-05
Received: 2025-04-08

Academic Editor: José Manuel Galán

Subject Areas: Adaptive and Self-Organizing Systems, Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning, Neural Networks
Keywords: Melanoma detection, Skin lesion classification, Deep learning, Channel attention, Spatial attention

Copyright: © 2025 Naeem et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Naeem MA, Yang S, Mittal M, Narayan J, Saleem MA, Shabaz M. 2025. Deep feature fusion for melanoma detection using ResNet-VGG networks with attention mechanism. PeerJ Computer Science 11:e3248 https://doi.org/10.7717/peerj-cs.3248

The authors have chosen to make the review history of this article public.

Abstract

Skin cancer, especially melanoma, the most severe type, has increased in recent decades. Identifying skin carcinoma infection is important to improve treatment outcomes and patient survival. However, accurate identification of skin lesions remains problematic due to the large variability and heterogeneity of the source material, especially lesion characteristics, even though significant progress has been made in computer vision. To overcome these challenges, this study proposes the channel and spatial attention-based hybrid model (CA-SLNet), which combines the architectures of residual network-50 (ResNet50) and visual geometry group-16 (VGG-16) with updated attention mechanisms. It applies channel attention to enhance the important feature maps and spatial attention to illuminate the important area in the feature. CA-SLNet was evaluated on three dermatological benchmark datasets, ISIC-2016, ISIC-2017, and Ph². The proposed model achieved an accuracy of 98.59%, 96.80%, and 99.50% for the ISIC-2016, ISIC-2017, and Ph² datasets and high area under the curve (AUC) values. The proposed model is compared with state-of-the-art methods and emphasizes the robustness and precision of the model in differentiating between benign and malignant lesions. Although slightly more false negatives were observed in the ISIC-2017 dataset, the model’s overall performance proves its suitability for clinical application. CA-SLNet provides a reliable solution for early and accurate melanoma detection by effectively integrating complementary features and minimizing noise and redundancy.

Introduction

Melanoma, the most malignant skin cancer, is a global health challenge due to its rapid progression, high metastatic potential and high mortality rate. The human skin is the human body’s largest organ (Iravanimanesh et al., 2021; Wang et al., 2025). It is responsible for more than 75% of annual skin cancer-related deaths, underscoring the need for early detection and intervention (Hosny & Kassem, 2022). While early-stage melanoma is highly treatable through surgical removal, late-stage cases often metastasize, resulting in poor survival rates and increased healthcare costs (Cassidy et al., 2022; Jojoa Acosta et al., 2021). Dermoscopy, a non-invasive imaging technique, has become indispensable for skin cancer diagnosis as it provides high-resolution images that improve the detection of malignant tumors (Sufia, Hossain & Hossain, 2025). However, manual dermoscopic analysis remains labor-intensive, subjective and dependent on the dermatologist’s expertise, leading to variability in results and highlighting the need for scalable, automated diagnostic systems (Brinker et al., 2019; Li & Shen, 2018). Identifying skin cancer is particularly difficult due to the visual and size-related similarities between malignant lesions and benign moles. Tumours that resemble non-progressive moles are classified as ‘benign’, while those that show uncontrolled growth and a disfigured appearance are considered malignant, as shown in Fig. 1.

Few samples from ISIC-2016, ISIC-2017 and Ph2 datasets. — Figure 1: Few samples from ISIC-2016, ISIC-2017 and Ph² datasets.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-1

Artificial intelligence (AI) and deep learning have transformed medical imaging, providing reliable, reproducible and highly accurate diagnostic tools (Wu et al., 2025; Zhao et al., 2025). Convolutional neural networks (CNNs) have automatically classified and segmented lesions (Song et al., 2024), achieving dermatologist-level performance with datasets such as the ISIC archive and Ph² (Adegun & Viriri, 2019; Sahoo, Dash & Mohapatra, 2024). However, other challenges exist, such as the low contrast between melanomas and healthy skin, visual similarities between malignant and benign lesions and artifacts such as hairs, blisters and shadows in dermoscopic images. Distortions in the data set, such as unbalanced classes and duplicate images, further distort training and hinder generalizability (Mabrouk et al., 2022). These limitations emphasize the need for innovative solutions for melanoma detection. Conventional CNN models often struggle to accurately delineate the dataset and deal with bias, resulting in fuzzy segmentation of lesions and overfitting. While CNNs are effective in hierarchical feature extraction, they often lose spatial integrity, compromising segmentation results and clinical applicability (Tang et al., 2025; Al-Selwi et al., 2025).

This study examines recent AI-driven advances in skin cancer detection, focusing on CNNs, ensemble deep learning and transfer learning approaches. It identifies key challenges, including dataset diversity, overfitting and computational cost and proposes a novel hybrid deep learning model. The proposed model combines residual network-50 (ResNet50) and visual geometry group-16 (VGG-16) architectures with advanced attention mechanisms, channel attention and spatial attention to improve feature extraction and prioritize critical melanoma features. This approach aims to improve diagnostic accuracy and segmentation precision in dermoscopic images by refining the focus and attenuating irrelevant artifacts.

This study presents an innovative hybrid deep-learning system for melanoma detection. The main contributions of this study are as follows:

(i)

A hybrid model combining ResNet50 and VGG16 utilizes complementary features to improve the discrimination between benign and malignant lesions.
(ii)

The channel and spatial attention modules are integrated to prioritize features and focus on diagnostically significant regions.
(iii)

Random erasing, color jittering and affine transformations address class imbalances and strengthen training data.
(iv)

Performance is optimized by unfreezing select pre-trained layers to ensure efficient and accurate melanoma detection.

This work addresses the increasing need for reliable, scalable and clinically effective methods for early melanoma detection. The proposed model improves accuracy and generalizability by incorporating attention mechanisms and advanced data processing strategies, ultimately optimizing diagnostic workflows.

The remaining sections are organized as follows: ‘Related Work’ discusses related studies; ‘Materials and Methods’ describes the materials and methods; ‘Experiments Results’ explains the experimental results; ‘Discussion’ provides a discussion; and ‘Conclusion and Future Work’ concludes with findings and future directions.

Related work

The field of automated melanoma detection has evolved rapidly thanks to the transformative impact of deep learning on medical image analysis. Convolutional neural networks (CNNs) are fundamental as they extract complex hierarchical and discriminative features, outperforming traditional machine learning methods based on hand-crafted features (Khouloud et al., 2022; Shaik et al., 2025). The U-Net, a widely used architecture in biomedical segmentation, is particularly effective due to its encoder-decoder structure that captures local and global features (Ronneberger, Fischer & Brox, 2015; Gutman et al., 2016). However, standard U-Net models often cannot preserve spatial details, resulting in fuzzy lesion boundaries and lower segmentation accuracy (Kaneria & Jotwani, 2024).

To overcome these limitations, researchers have improved CNN architectures with multiscale feature extraction, residual connections and attention mechanisms to increase segmentation accuracy and generalizability (Su et al., 2024; Liu et al., 2024). Ensemble learning methods, such as those of Codella et al. (2018) combine CNNs with traditional techniques to achieve better performance in lesion segmentation and classification. Similarly, encoder-decoder architectures with skip connections improve spatial continuity and segmentation results for datasets such as ISIC (El-Khatib, Popescu & Ichim, 2020).

Despite the progress, biases in publicly available datasets such as ISIC and Ph², including duplicate images, unbalanced classes and inconsistent annotations, remain a major challenge. These biases distort training and evaluation, inflating performance metrics and limiting generalizability (Akram et al., 2024; Nazari & Garcia, 2025). Robust data curation and augmentation strategies are crucial to solving these problems. Skin cancer, especially melanoma, is a significant health threat worldwide. Early detection is crucial for effective treatment, but diagnostic challenges include visual similarities between benign and malignant lesions, low contrast and complex features (Khan et al., 2020). Although melanoma accounts for only a small percentage of skin cancer cases, it is responsible for the majority of skin cancer-related deaths worldwide (Montaha et al., 2022).

Recent studies have shown significant advances in the automatic classification and segmentation of lesions using deep learning and computer-aided diagnosis (CAD) systems. Frameworks such as DermoExpert, which integrates segmentation, augmentation and transfer learning, have improved the classification accuracy on datasets such as ISIC-2016 and ISIC-2018 (Hasan et al., 2022). Other models, such as BF2SkNet, use deep feature fusion and fuzzy entropy optimization and provide competitive results for different datasets (Ajmal et al., 2023). MobileNet-V2, which incorporates attention mechanisms and multiscale feature extraction, is an example of lightweight models that achieve excellent accuracy (Nirupama, 2024).

Hybrid models that combine CNN frameworks with advanced preprocessing and feature extraction have come to the fore. Attention mechanisms, such as channel and spatial attention, improve segmentation accuracy by prioritizing diagnostically significant regions and minimizing the impact of irrelevant artifacts. Advanced data augmentation techniques, such as random deletion, color jittering and affine transformations, increase the robustness of the model by increasing the diversity of data sets and mitigating overfitting. AI-based diagnostic tools now achieve dermatologist-level accuracy. For example, faster region-based convolutional neural networks (R-CNN) achieved 91.5% accuracy in classifying clinical images, outperforming dermatologists on certain metrics (Jinnai et al., 2020). Ensemble approaches, such as the integration of Mask R-CNN and DeeplabV3+ by Goyal et al. (2019) achieved a high sensitivity of 89.93% and a specificity of 97.94% on ISIC-2017. Hybrid frameworks using optimization techniques such as the artificial bee colony algorithm have also shown impressive performance, achieving an F1-score of 93.12% (Farea et al., 2024). Transfer learning approaches that utilize pre-trained models such as VGG16, VGG19 and AlexNet are also effective, especially with limited datasets (Nakai, Chen & Han, 2022). Fine-tuned VGG-based models have achieved up to 98.18% accuracy in melanoma classification (Faghihi, Fathollahi & Rajabi, 2024). However, challenges such as dataset diversity and overfitting remain, necessitating robust preprocessing, augmentation and ensemble techniques.

This study proposes a hybrid model that integrates ResNet50 and VGG16 and effectively combines complementary features to account for lesion variability. The system improves segmentation and classification accuracy by incorporating channel and spatial awareness mechanisms and compensating for dataset biases through robust data augmentation. These innovations represent a significant advance in automated melanoma detection and pave the way for scalable, clinically reliable diagnostic tools.

Materials and Methods

Pre-trained CNN models

CNNs are deep feed-forward architectures known for their remarkable performance in feature extraction and classification tasks (Saleem et al., 2024). Their architecture integrates convolutional, pooling and fully connected layers, enabling efficient hierarchical feature extraction. In this study, two pre-trained models, ResNet50 and VGG16, are used for feature extraction, selected for their robust architectures and proven accuracy in large datasets such as ImageNet.

ResNet50

ResNet50 is a residual network comprising 50 layers. Its architecture allows bypass connections from certain layers to other layers, so that certain layers can be skipped to solve the vanishing gradient problem and save computation time (Ramanagiri, Mukunthan & Balamurugan, 2024). The image features previously extracted from the dataset ImageNet were trained with a pre-trained model. As input image size, the ResNet50 uses 224 $\times$ 224 $\times$ 3 with kernels of 3 $\times$ 3 and 7 $\times$ 7, respectively. We fine-tuned the final layers to fit our particular dataset, so the model contains these pre-trained features from the earlier layers.

VGG16

A deep CNN architecture of VGG16 consists of 16 layers (Faghihi, Fathollahi & Rajabi, 2024; Meitantya et al., 2024). The architecture consists of a uniform block of convolutional layers with small receptive fields (3 $\times$ 3) and a stride value of 1. As a result, high accuracy can be maintained with efficient feature extraction. By pre-training VGG16 on ImageNet, similar to ResNet 50, a strong baseline is achieved. Uniform (3 $\times$ 3) kernel sizes; input image size is (224 $\times$ 224 $\times$ 3). We fine-tuned the final layers to optimize the model for the task.

Proposed framework

This study uses a unified framework with feature extraction, attention mechanism and classification. By combining pre-trained models with innovative attention modules, the proposed framework solves the main challenges in binary classification tasks: the lack of distinguishable features and redundant data. Deep features were extracted using two pre-trained models, ResNet50 and VGG16, which use different architectural paradigms to capture different feature representations of the input images. A feature fusion approach was used to reduce the redundant and meaningless component that results from the fusion of features in multiple models. It is simply a way to combine the strengths of ResNet50 and VGG16 while preserving the important features. To focus the model on critical data, we added attention mechanisms to refine the fused features. The workflow of the proposed model is shown in Fig. 2.

Figure 2: Detailed flow of proposed skin lesion classification framework.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-2

After the feature maps are independently extracted from ResNet50 and VGG16, they are flattened and concatenated into a single feature vector that retains both global and local representations. This fused feature is then successively refined with the channel and spatial attention modules. The channel attention module focuses on which features are important by adaptively reweighting the channels based on the global average and maximum pooled descriptors and boosting critical texture and lesion-specific filters while suppressing irrelevant ones. Next, the spatial attention module determines “where” to focus by analysing the mean and maximum projections of the feature maps along the channel axis, helping the model to highlight lesion-relevant spatial regions. These refined features are then passed to the final fully linked layers for binary classification. This integration allows CA-SLNet to utilise complementary feature hierarchies from both backbones while dynamically refining attention to discriminative content, improving sensitivity to subtle lesion patterns in different inputs.

Feature fusion strategy

The feature fusion strategy combines the feature maps extracted by ResNet50 and VGG16 by flattening their outputs into 1D vectors. These vectors are then concatenated along the channel dimension to form a unified feature vector that includes global and local image details. The channel and spatial attention modules are applied sequentially to enhance the fused features. Channel-based attention highlights significant channels, while spatial attention highlights prominent spatial regions in the combined feature map. This ensures a rich, focused feature representation for subsequent processing.

To validate the discriminative power of the fused feature embeddings, we performed a t-SNE visualization of the high-dimensional feature space generated by the proposed model. Figure 3, the feature embeddings form different clusters corresponding to the Ph², ISIC-2016 and ISIC-2017 datasets. The clear separation between these clusters demonstrates the model’s ability to create compact and distinguishable feature representations, thus improving classification performance.

The t-SNE plot visualizes the 2D projection of the feature embeddings of the proposed CA-SLNet model for the Ph2, ISIC-2016 and ISIC-2017 datasets. — Figure 3: The t-SNE plot visualizes the 2D projection of the feature embeddings of the proposed CA-SLNet model for the Ph², ISIC-2016 and ISIC-2017 datasets.
The different dataset clusters emphasize the model’s ability to extract discriminative features, while the clear separation of benign and malignant classes demonstrates its effectiveness in classification. The compact Ph² cluster reflects homogeneity, while the broader ISIC-2016 and ISIC-2017 clusters capture greater variability.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-3

Channel attention module

The channel attention module uses adaptive average pooling and adaptive max pooling to compress the spatial dimensions of the input feature map into channel-wise descriptors, effectively summarizing each channel’s global and local importance. These pooled features are passed through a common two-layer, fully connected network with a bottleneck structure (channel reduction followed by expansion). The outputs of the Average and Max Pooling branches are then combined by element-wise addition. A sigmoid activation function generates channel-wise attention weights to re-weight the input feature map and highlight the most relevant channels for the classification task.

Spatial attention module

The spatial attention module improves the spatial relevance of features by learning the importance of different regions in the feature map so that the network can localize and highlight key areas of the image. The refined features are then passed through fully concatenated layers with dropout to avoid overfitting and classified with a sigmoid activation function. Regularization and learning rate scheduling techniques were used to reduce overfitting and optimize performance. A low learning rate combined with a stepwise learning rate scheduler provides fine-tuning of the pre-trained weights for the particular dataset while preserving the learned representations.

Dataset

The ISIC-2016 dataset is an important resource in dermatology and computer vision. It is often used to further develop and evaluate machine learning algorithms for skin cancer detection based on dermoscopic images. It consists of dermoscopic images categorized as melanoma and benign cases. In addition, based on the ISIC archive, which contains images with different resolutions, a diverse set of skin lesion images was used for the study. The ISIC-2016 dataset contains a total of 1,279 images. The ISIC-2017 dataset includes 2,000 images of skin lesion cases. This dataset serves as an important benchmark for the development and validation of machine-learning models for skin cancer diagnosis. Due to the high prevalence of keratoses, the images are often divided into two main categories: benign and malignant. The Ph² dataset was also included in this study (Benyahia, Meftah & Lézoray, 2022). It contains 200 dermoscopic RGB images. This dataset, curated at the Hospital Pedro Hispano in Matosinhos, contains both dermatoscopic images and clinical assessments by dermatologists indicating whether an image represents a normal case, a melanoma, or an atypical nevus. The ISIC-2016, 2017 and Ph² datasets evaluate melanoma detection as they offer a large variety and with good annotations. Dermatologists have manually annotated all images in these datasets with precise basic annotations. To assess class balance, we analysed the distribution of benign and malignant cases in all datasets. The ISIC-2016 dataset consists of 900 benign and 379 malignant samples, while ISIC-2017 contains 1,400 benign and 600 malignant cases. The Ph2 dataset contains 160 benign and 40 malignant images. These ratios reflect inherent class imbalances, especially in ISIC-2016 and Ph², which may bias model learning towards the majority class. Therefore, we implemented customised augmentation and attention mechanisms to improve sensitivity and reduce misclassification, especially false negatives in minority classes.

Design and optimization of CA-SLNet for skin lesion classification

In this section, we develop and optimize the proposed CA-SLNet for skin lesion classification. To improve the generalization, data augmentation techniques such as rotation, flipping and color jittering were applied to the data. The ResNet50 and VGG16 models are combined with attention modules for refined feature extraction. Binary cross-entropy loss, an Adam optimizer and a StepLR scheduler for learning rate decay were used in training. They quickly improve the model’s ability to be classified and to classify. The detailed flowchart in Fig. 4 shows the stages of the proposed framework, from image acquisition to the final classification process.

Figure 4: Detailed flow of the proposed architecture for the classification of skin lesions.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-4

Data augmentation

To improve the model’s generalization, data augmentation techniques were used for both data sets. Random rotation, flipping, affine transformation and possibly partial color jittering were used. Random deletions were introduced to simulate further occlusions. These transformations not only increased the training set, but also helped to reduce the imbalance of the classes by artificially increasing the variability of the (malignant) samples of the minority class. For example, random deletion simulates occlusions that are common in dermoscopic images, while color jittering enhances robustness to lighting variation. These enhancements were increasingly applied to the underrepresented malignant class to promote balanced feature learning. This strategy helped to reduce the number of false negatives. ImageNet means and standard deviations were used for normalization.

(1) $x_{n o r m a l i z e d} = \frac{x - m e a n}{s t d},$ where mean $μ$ = [0.485, 0.456, 0.406], $σ$ = [0.229, 0.224, 0.225] are normalize inputs.

Model architecture

The CA-SLNet model integrates ResNet50 and VGG16 architectures with channel and spatial attention mechanisms to optimize feature extraction. The feature maps generated by the ResNet50 and VGG16 backbones are concatenated as follows:

(2) $F_{c o n c a t} = F_{R e s N e t} \oplus F_{V G G},$ where $F_{R e s N e t} \in R^{2048}$ and $F_{V G G} \in R^{25088}$ and $\oplus$ denotes concatenation.

To emphasize critical features, the channel attention module computes channel-wise weights:

(3) $W_{c h a n n e l} = σ (W_{c} \cdot R e L U ({P o o l}_{a v g} (F) + {P o o l}_{m a x} (F))),$ where $σ$ is the sigmoid activation function and $W_{c}$ are learnable weights of the attention layer.

The spatial attention module enhances spatial focus:

(4) $W_{s p a t i a l} = σ (C o n v ([M e a n (F); M a x (F)])),$ where $M e a n (F)$ and $M a x (F)$ are the mean and max pooled features and [;] denotes concatenation.

The refined features are passed through fully connected layers for classification:

(5) $y_{o u t p u t} = S i g m o i d (W_{f i n a l} \cdot F_{a t t} + b_{f i n a l}),$ where $F_{a t t}$ is the attended feature map.

Training protocol

The binary cross-entropy (BCE) loss function was used to optimize the model:

(6) $L_{B C E} = - \frac{1}{M} \sum_{j = 1}^{M} (y_{j} \cdot \log (p_{j}) + (1 - y_{j}) \cdot \log (1 - p_{j})),$ where M is the batch size, $y_{j}$ is the ground truth label and $p_{j}$ is the predicted probability.

Optimization was performed using the Adam optimizer with the update rule:

(7) $θ^{t + 1} = θ^{t} - α \cdot \frac{g_{t}}{\sqrt{v_{t}} + ε},$ where $g_{t}$ is the gradient, $v_{t}$ is the second moment estimate and $ε$ is a small constant for numerical stability.

A StepLR scheduler was employed to decay the learning rate after every seven epochs:

(8) $α_{n e w} = α_{c u r r e n t} \cdot γ,$ where $γ = 0.1$ is the decay factor.

Training environment

All training and evaluation experiments were performed on a dedicated local workstation equipped with an NVIDIA RTX 4070 GPU, an Intel Core i9 processor and 32 GB RAM running Windows 11. The model was implemented with Python 3.10 and PyTorch 2.1.0. Image preprocessing and augmentation were performed with Torchvision 0.16.0, while performance metrics and analysis were performed with scikit-learn 1.3.0 and Matplotlib 3.7.2. GPU acceleration was managed with CUDA 11.8. This environment ensured consistent model training, uniform evaluation and reproducibility of results.

Training hyperparameters

The CA-SLNet model was trained using the Adam optimiser with a learning rate of 0.001, a batch size of 32 and a maximum of 100 epochs. All input images were resized to 224 $\times$ 224 $\times$ 3. A dropout layer with a rate of 0.5 was applied before the last fully concatenated layers to prevent overfitting. Weight initialization was used when initialising the weights and early stopping was applied with a patience of 10 epochs based on validation loss to prevent overtraining. No planner was used for the learning rate. These hyperparameters were kept consistent across all data sets to ensure a fair and reproducible evaluation, as shown in Table 1.

Table 1:

Hyperparameters used in CA-SLNet training.

Hyperparameter	Value
Device	CUDA
Batch size	32
Input image size	224 $\times$ 224 pixels
Normalization	Mean = [0.485, 0.456, 0.406], Standard Deviation = [0.229, 0.224, 0.225]
Data augmentation	Random Resized Crop, Horizontal Flip, Random Rotation (30^∘), Color Jitter, Random Erasing
Learning rate	1e-5
Optimizer	Adam
Loss function	Binary Cross-Entropy with Logits Loss
Scheduler	StepLR (Step size = 7 epochs, Decay = 0.1)
Dropout rate	0.5
Epochs	100
Pretrained models	ResNet50, VGG16
Fine-tuned layers	Last 10 layers
Attention mechanisms	Channel attention, Spatial attention

DOI: 10.7717/peerj-cs.3248/table-1

A five-fold cross-validation was performed for each experiment to ensure the robustness of the results and to avoid bias in the assessment. Each data set was randomly divided into five equal folds, with four folds used for training and validation 80% and one fold retained for testing 20%. An internal split of 80:20 was made within the training area to form the training and validation sets. The splitting process was stratified by class to maintain the original distribution of benign and malignant samples across all folds. The performance metrics were averaged across all five folds to obtain stable and generalised results.

Proposed model algorithm

The proposed algorithm CA-SLNet (channel attention for skin lesion network) describes a systematic approach to melanoma detection that utilises attention mechanisms for robust feature extraction and classification Algorithm 1. By incorporating preprocessing, feature fusion and feature refinement into the attention method, the algorithm provides high accuracy, strong capabilities and feature generalisation in melanoma classification. The operation of CA-SLNet consists of several important steps: First, the data is pre-processed by normalising the input images to ImageNet mean and standard deviations. Data augmentation techniques such as random flipping, rotation and affine transformations are applied to the data to artificially augment and improve the generalisation of the model.

Algorithm 1 :

CA-SLNet model for melanoma detection.

Require: Dataset

D

, batch size M, learning rate α, epochs T, decay factor γ, folds

K = 5

Ensure: Best model

θ_{b e s t}

, average validation accuracy

A_{a v g}

1: Initialize

D_{f o l d s} = k F o l d (D, K)

A = []

2: for

k = 1

to K do

3: Split

D_{t r a i n}

D_{v a l}

from

D_{f o l d s}^{(k)}

4: Initialize model parameters

θ_{k}

5: Preprocess

D_{t r a i n}

: Resize, normalize and apply data augmentation

6: Extract features

F

using ResNet50 and VGG16

7: Compute channel attention

W_{c h a n n e l}

and spatial attention

W_{s p a t i a l}

8: for

t = 1

to T do

9: Train model:

10: Apply dropout to prevent overfitting

11: Compute binary cross-entropy loss

L_{B C E}

12: Update model parameters

θ_{k}

using gradient descent

13: Adjust learning rate α using decay factor γ

14: end for

15: Evaluate model on validation set

D_{v a l}

16: Compute validation accuracy A_k

17: Store A_k and

θ_{k}

18: end for

19: Compute average validation accuracy

A_{a v g} = \frac{1}{K} \sum_{k = 1}^{K} A_{k}

20: Select best model

θ_{b e s t} = \arg max_{θ_{k}} A_{k}

21: return

θ_{b e s t}

A_{a v g}

DOI: 10.7717/peerj-cs.3248/table-101

In the following phase, feature extraction is performed using two deep learning architectures, ResNet50 and VGG16. The feature maps generated by these backbones are linked to form a comprehensive feature representation. After feature extraction, attention mechanisms are used to refine the concatenated features. Attention to the channels is calculated by combining the global average pooling and global max pooling operations, followed by fully concatenated layers to assign weights to the important channels. At the same time, spatial attention is computed by applying a convolution operation to the combined average and max feature maps to highlight critical spatial regions for melanoma detection. These attention mechanisms work in tandem to produce a refined feature representation.

In the classification phase, the features are flattened and fed into a fully concatenated layer with a sigmoid activation function that generates the final prediction probabilities. Training is performed by iteratively optimizing the model parameters according to the BCE loss objective function. Second, efficient convergence of the weights is achieved by using the first and second-order moments of the gradient updates provided by the Adam optimizer. A learning rate scheduler (StepLR) keeps the learning rate stable and optimal throughout the training. The algorithm systematically employs feature fusion, attention mechanisms and robust optimization to ensure that the final model has accurate classification and discrimination capability. CA-SLNet integrates spatial and channel attention and effectively highlights critical lesion regions to improve its robustness in different datasets and clinical scenarios.

Evaluation metrics

The evaluation metrics used were accuracy, precision, recall, F1-score and the area under the receiver operating characteristic (ROC) curve. The metrics for precision and recall are defined as follows:

(9) $P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N},$ where TP, FP and FN are the number of true-positive, false-positive and false-negative results.

The F1-score, a harmonic mean of precision and recall, is computed as:

(10) $F 1 - s c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l} .$

The area under the curve-receiver operation characteristic (AUC-ROC) curve measures the discriminatory ability of the model:

(11) $A U C = \int_{0}^{1} T P R (x) \cdot d x,$

The true-positive rate depends on the false-positive rate $x$ , which is noted as $T P R (x)$ . This model guarantees its performance and reliability in detecting melanoma. It has been tested with the ISIC 2016, 2017 and Ph² datasets.

To assess the performance of the model, we used different evaluation metrics, such as confusion matrices, to assess accuracy along with the receiver operating characteristic (AUC). The model used cross-validation to obtain a reliable and generalized assessment of performance. The efficiency of the model was improved by carefully selecting several hyperparameters, in particular, the batch size, the loss function, the learning rate, the target size and the optimization function.

Ablation study

The ablation study evaluates how the different parts and parameters of the CA-SLNet architecture change the model’s capacity. The primary focus of this analysis is to identify the essential operations in the design of signal processing, such as attention mechanisms, feature fusion and hyperparameters, to achieve optimal performance. In this way, attention mechanisms have increased the importance of focusing on appropriate components to conceal interference upstream. By turning off the channel attention (CA) module, the performance degradation in ISIC-2016, ISIC-2017 and Ph² was 3.25%, 2.98% and 3.85%, respectively. Switching off the spatial attention (SA) module led to a reduction of up to 4.2%. These results show that attention mechanisms are crucial for improving feature representation and increasing discriminative capacity. Therefore, using ResNet50 and VGG16 features was very important for both overall and local features. The average accuracy of ResNet50 and VGG16 is 5.6% and 6.4%, respectively, showing that feature fusion is crucial for proper visualization.

Hyperparameter tuning was crucial for optimal performance. A learning rate of $1 \times 10^{- 5}$ achieved the best balance between stability and convergence, as higher rates ( $1 \times 10^{- 4}$ ) led to instability and lower rates ( $1 \times 10^{- 6}$ ) slowed down training. A batch size of 32 balanced computational efficiency and accuracy, while a dropout rate of 0.5 effectively curbed overfitting. As shown in Table 2, these configurations were experimentally validated as the most effective for training CA-SLNet. Using pre-trained ResNet50 and VGG16 models was critical for achieving high performance. Replacing the pre-trained weights with random initialization decreased accuracy by more than 10% for all datasets, with corresponding reductions in AUC values. This underlines the importance of transfer learning for improving generalization and accelerating convergence. These results confirm the robustness and optimality of the CA-SLNet architecture for automatic melanoma detection.

Table 2:

Ablation study results on accuracy (%).

Configuration	ISIC-2016 (%)	ISIC-2017 (%)	Ph² (%)
Proposed model (CA-SLNet)	98.59	96.80	99.50
Without CA	95.34	93.82	95.65
Without SA	94.39	92.76	95.30
Only ResNet50	92.85	91.63	93.72
VGG16 only	92.19	91.02	93.11
Random weights	88.27	85.67	89.34
Learning rate ( $1 \times 10^{- 4}$ )	94.52	92.75	94.31
Learning rate ( $1 \times 10^{- 6}$ )	95.11	93.20	95.10
Dropout rate (0.3)	95.21	93.68	95.90
Dropout rate (0.7)	94.58	92.89	94.78
Batch Size (16)	94.92	93.45	95.13
Batch Size (64)	93.87	92.01	94.15

DOI: 10.7717/peerj-cs.3248/table-2

Feature localization and explainability in CA-SLNet

To assess the interpretability of the CA-SLNet model, we created visual explanations using Grad-CAM overlays, saliency maps and attention heat maps. These visualisations help to understand how the model prioritises different image regions when making classification decisions. Figure 5 illustrates this process using a skin lesion as an example. The base feature maps of ResNet50 and VGG16, Figs. 5B and 5C, show a general feature extraction, while the fused CA-SLNet feature map, Fig. 5D, localises the lesion-critical regions more precisely. The channel attention map, Fig. 5E, highlights the feature dimensions that have the greatest influence on classification, while the spatial attention map, Fig. 5F, identifies the lesion-relevant regions in the spatial domain. The saliency map Fig. 5G illustrates pixel-level influence and the Grad-CAM overlay Fig. 5H provides an intuitive heatmap that confirms the model’s focus on lesion boundaries and morphological features. These interpretation tools increase the transparency of CA-SLNet and show that its predictions are based on medically meaningful image regions, which is crucial for clinical validation.

Figure 5: Visual explanation of CA-SLNet: (A) original lesion image, (B, C) ResNet and VGG base feature maps, (D) CA-SLNet feature map highlighting lesion critical regions, (E, F) channel and spatial attention heat maps highlighting diagnostic areas, (G) saliency map showing influential pixels and (H) Grad-CAM overlay for interpretability.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-5

Comparison with state-of-the-art methods

The proposed CA-SLNet performs better than existing state-of-the-art (SOTA) models across multiple dermatological benchmark datasets, including ISIC-2016, ISIC-2017 and Ph². For the classification, the results of the ISIC-2016 dataset showed the robust performance of the model. Figure 6 shows a high accuracy and correct classification of 900 benign and 361 malignant samples, with no false positives and only 18 false negatives. The minimal misclassification confirms that the model will reliably identify melanomas in clinical applications. The effectiveness of the model is shown in Fig. 7. The model showed excellent performance in classifying cancerous lesions with an average AUC value of 0.9814 $\pm$ 0.0121. The individual AUC values obtained in the five cross-validation folds ranged from 0.9643 to 0.9968, demonstrating the reliable and consistent performance of the model. These results suggest that the model could be a robust model for dermatologists to aid clinical diagnosis. Table 3 compares the classification accuracy of CA-SLNet with previous work on the ISIC-2016 dataset. The table summarizes the performance of the models developed in recent years and their respective accuracy levels.

Figure 6: Confusion matrix for the ISIC 2016 dataset.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-6

Figure 7: ROC curve for the ISIC-2016 dataset.
The model achieved a mean AUC of 0.9814 $\pm$ 0.0121, demonstrating strong discriminatory power between benign and malignant lesions.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-7

Table 3:

Comparison of the proposed model with state-of-the-art models on the ISIC-2016 dataset.

Authors	Year	Problem type	Accuracy (%)	Recall (%)	Precision (%)	F1-score (%)
Sahoo, Dash & Mohapatra (2024)	2024	Classification	80.61	–	–	–
Mabrouk et al. (2022)	2022	–	88.39	88.39	87.81	87.51
Hosny & Kassem (2022)	2022	–	92.57	–	–	–
Akram et al. (2024)	2024	–	96.26	96.00	96.40	96.20
Khouloud et al. (2022)	2022	–	97.49	–	–	–
Amin et al. (2020)	2020	–	98.39	–	–	–
Proposed model	2024	Classification	98.59	95.25	99.00	97.57

DOI: 10.7717/peerj-cs.3248/table-3

Experiments results

The evaluation of the proposed CA-SLNet experiments were performed using three publicly available datasets Ph² (Hosny & Kassem, 2022), ISIC-2017 (Hasan et al., 2022) and ISIC-2016 (Gutman et al., 2016). The selected benchmarking datasets were chosen because they contain different skin lesion examples in Ph², ISIC-2017 and ISIC-2016.

Misclassification analysis

To analyse the model errors, we examined the confusion matrices for each dataset. In ISIC-2016, the model misclassified only 18 malignant cases as benign, with zero false positives. In ISIC-2017, 64 malignant cases were classified as benign, reflecting the visual complexity of the dataset and the overlap of lesions. In contrast, the model in the Ph2 dataset achieved near-perfect classification, with only one benign image incorrectly categorised as malignant. These results indicate that false-negative images were the main cause of misclassification, particularly in ISIC-2017. Our augmentation strategy and attention-based fusion mitigated the imbalance between classes and improved overall recall, but further improvements may be needed to reduce critical misclassifications.

Model inference time and hardware efficiency

To assess the clinical suitability of CA-SLNet, we measured the inference time and hardware efficiency with an NVIDIA RTX 4070 GPU and an Intel Core i9 processor. The model processes a dermatoscopic image in 42 ms, enabling near real-time support. Despite its dual backbone design (ResNet50 and VGG16), efficiency is maintained through selective layer tuning. CA-SLNet has a model size of 97 MB, 29.6 million parameters and uses 1.8 GB of GPU memory during inference, confirming its suitability for use in clinical environments with moderate hardware.

The proposed model was developed to accurately discriminate between melanoma and non-melanoma skin lesions.

Table 3 shows that CA-SLNet achieves an accuracy of 98.59% and outperforms most other models, including the original work in Amin et al. (2020) with 98.39%. CA-SLNet exhibits high accuracy and provides several additional advantages by incorporating the ResNet50 backbone and the VGG16 backbone and attention mechanisms in feature extraction. In particular, it is generalized through an advanced data augmentation technique. These features make the classification robust and adaptable to different datasets, making this tool an actionable tool for melanoma classification. The CA-SLNet achieves the highest possible accuracy in discriminating melanoma and its accuracy is commercially attractive and state-of-the-art.

Figure 8, the model shows a high classification accuracy for the ISIC-2017 dataset with correct identification of 1,400 benign and 536 malignant samples. The model also achieved 0 false positives and 64 false negatives, which is important as it shows the robust predictive ability. This reflects that the model avoided overdiagnosis, which is crucial for avoiding unnecessary treatment. While the slightly increased rate of false-negative results provides an improvement to minimize the number of missed melanoma cases, this is important in clinical applications. The mean AUC value from Fig. 9 is 0.9626 $\pm$ 0.0215. The AUC value demonstrates that CA-SLNet shows strong performance in discriminating benign from malignant skin lesions, reinforcing its position as a reliable model for skin cancer diagnosis. The individual AUC metrics of five cross-validation splits ranged from 0.9317 to 0.9873, indicating a stable and reliable performance of the model. The model classifies the images with high accuracy and an evident capacity to distinguish between skin diseases and other conditions. This underlines the great potential of the model to help dermatologists in clinical practice. Table 4 shows the performance of the latter studies as well as their gradual advancement in melanoma classification sensitivity.

Figure 8: Confusion matrix for the ISIC 2017 dataset.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-8

Figure 9: ROC curve for the ISIC-2017 dataset.
The CA-SLNet model achieved an average AUC of 0.9626 $\pm$ 0.0215 with five-fold cross-validation, with individual AUC values ranging from 0.9317 to 0.9873. This demonstrates the strong and stable ability of the model to discriminate between benign and malignant lesions in different clinical presentations.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-9

Table 4:

Comparison of the proposed model with state-of-the-art models on the ISIC-2017 dataset.

Authors	Year	Problem type	Accuracy (%)	Recall (%)	Precision (%)	F1-score (%)
Nakai, Chen & Han (2022)	2022	Classification	92.10	84.31	72.00	87.41
Hasan et al. (2022)	2022	–	93.10	92.80	92.00	91.80
Khan et al. (2020)	2020	–	93.40	–	94.40	94.29
Song, Wang & Wang (2023)	2023	–	95.60	–	–	–
Akram et al. (2024)	2024	–	95.71	95.90	95.70	95.70
Proposed model	2024	Classification	96.80	89.33	95.70	94.37

DOI: 10.7717/peerj-cs.3248/table-4

In addition, the results shown in Figs. 8 and 9 demonstrate the perfect predictive power of the model with a high AUC and minimal false positives. The CA used for dermoscopy performed better than others because its attention planes extracted more features. However, they only focus on important parts of the dermoscopic images. In addition, the model created in this work is more general due to the application of new methods for data augmentation, e.g., random deletion and color shifting. This is reflected in the consistently high performance of CA-SLNet over a given dataset, which in turn emphasizes the robustness of the network for use in clinical practice.

Figure 10 shows that the model correctly identified 40 malignant cases and correctly classified 159 benign lesions, with only one misdiagnosis of benign as malignant and no misdiagnosed malignant cases. The model achieves its results through significant accuracy with almost no errors in classification. The model achieves exceptional performance by completely suppressing misdiagnosis of melanoma, as the elimination of false negatives contributes significantly to early detection of melanoma and improved patient outcomes.

Confusion matrix for the Ph2 dataset. — Figure 10: Confusion matrix for the Ph² dataset.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-10

The evaluation of the Ph² dataset of the proposed model shows both reliable and valid results. The model showed an average AUC value of 0.9814 $\pm$ 0.0121 in Fig. 11, indicating its significant ability to discriminate between benign and malignant lesions. The AUC values measured in five-fold cross-validation ranged from 0.9644 to 0.9969, confirming the reliability of the model.

Figure 11: ROC curve for the Ph2 dataset.
The model showed exceptional classification ability with an average AUC of 0.9814 $\pm$ 0.0121, with five-fold AUC values ranging from 0.9644 to 0.9969. The consistently high AUC underlines the generalisability and accuracy of the model, even with smaller, high-quality clinical datasets.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-11

For all three datasets examined, ISIC-2016, ISIC-2017 and Ph², the model achieved a very similar AUC, which could be due to specific characteristics of the concrete datasets, including the degree of disparity between classes and the range of lesions, as well as the overall resolution of the images. The results of these datasets showed that the proposed model is fully generalizable and has a high classification accuracy and a nearly flawless AUC metric, which is excellent for clinical use. Several challenges were encountered when working with the different datasets. The model can provide accurate and reproducible results when overcoming these challenges. It supports the accuracy of using a dermatoscopy in different clinical cases related to melanoma. Table 5 shows the classification accuracy of CA-SLNet compared to other models for the Ph² dataset. Current models for melanoma classification have shown promising results, as seen in the table and the CA-SLNet proposed in this work has the highest accuracy among the models.

Table 5:

Comparison of the proposed model (CA-SLNet) with state-of-the-art models on the PH² dataset.

Authors	Year	Problem type	Accuracy (%)	Recall (%)	Precision (%)	F1-score (%)
Ozkan & Koklu (2017)	2017	Classification	92.50	90.86	92.38	90.45
Xie et al. (2020)	2020	–	94.20	94.00	95.00	94.60
Hosny & Kassem (2022)	2022	–	94.97	92.32	92.32	92.20
Amin et al. (2020)	2020	–	98.39	–	–	–
Benyahia, Meftah & Lézoray (2022)	2022	–	99.05	–	–	–
Proposed model	2024	Classification	99.50	99.40	98.80	99.77

DOI: 10.7717/peerj-cs.3248/table-5

The proposed CA-SLNet model shows exceptional performance and achieves 99.50% accuracy on the Ph² dataset compared to SOTA, as shown in Table 6. This performance sets a new benchmark, exceeding the highest previously reported accuracy of 99.10% (Benyahia, Meftah & Lézoray, 2022). This significant improvement is due to CA-SLNet’s attention mechanisms, which increase focus on key lesion regions and a robust data augmentation strategy that improves generalisation. Figure 12 highlights the exceptional performance of the proposed model across multiple datasets. CA-SLNet consistently outperforms previous approaches, achieving near-perfect classification results in the Ph² dataset and strong generalization in the ISIC 2016 and ISIC 2017 datasets. These results confirm the robustness and adaptability of the model and emphasize its potential for practical applications in clinical dermatology.

Table 6:

Performance comparison of the proposed model on the Ph², ISIC-2016 and ISIC-2017 datasets.

Dataset	Accuracy (%)	Recall (%)	Precision (%)	F1-score (%)
Ph² dataset	99.50	99.40	98.80	99.77
ISIC-2016 dataset	98.59	95.25	99.00	97.57
ISIC-2017 dataset	96.80	89.33	95.70	94.37

DOI: 10.7717/peerj-cs.3248/table-6

Performance metrics of CA-SLNet on Ph2, ISIC 2016 and ISIC 2017 datasets. — Figure 12: Performance metrics of CA-SLNet on Ph², ISIC 2016 and ISIC 2017 datasets.

Download full-size image

DOI: 10.7717/peerj-cs.3248/fig-12

The integration of attentional mechanisms allows the model to extract more informative features from the lesion regions, ensuring higher precision, recall and F1-scores. Consequently, CA-SLNet proves to be a reliable solution for accurate and fast melanoma detection, which is promising for real-world applications.

Figure 12 illustrates the numerical results shown in Table 6 and shows that the model is able to achieve state-of-the-art results on different data sets. This makes CA-SLNet a scalable and suitable solution for real-world clinical applications where high precision and generalization are required.

Statistical reliability of results

To validate the robustness and reproducibility of the CA-SLNet model, we performed a five-fold cross-validation for each dataset. In all datasets, ISIC-2016, ISIC-2017 and Ph², the model consistently provided high performance with low variance. The mean AUC values were reported along with their standard deviations in the respective ROC curve analyses: 0.9814 $\pm$ 0.0121 for ISIC-2016, 0.9626 $\pm$ 0.0215 for ISIC-2017 and 0.9814 $\pm$ 0.0121 for Ph². These small standard deviations reflect minimal variation in classification ability across different folds and demonstrate the statistical significance and reliability of the model’s predictions.

Discussion

The proposed CA-SLNet model has achieved significant results in classifying skin lesions in different datasets with channel-spatial attention implemented simultaneously with ResNet50 and VGG16 backbones. The proposed method achieved high accuracy, precision, recall and F1-scores due to global and local features. In the ISIC-2016 dataset, it achieved 98.59% accuracy, 95.25% recall, 99.00% precision and 97.57% F1-score on average. In the classification of the ISIC-2017 dataset, the model achieved an accuracy of 96.80%, a recall of 89.33%, a precision of 95.70% and an F1-score of 94.37%. The false negative rate is slightly higher due to the variety of lesions. When testing the Ph² dataset, the model achieved near-perfect results with 99.50% accuracy, 99.40% recall rate, 98.80% precision and an F1-score of 99.77%. All of the above statistics demonstrate the excellent generality of the model, even with smaller datasets. Ablation studies confirmed the importance of each component. Disabling the CA module reduced accuracy by 3.25%, 2.98% and 3.85% for ISIC-2016, ISIC-2017 and Ph², respectively, while removing the spatial attention (SA) module led to accuracy drops of up to 4.2%. Excluding ResNet50 or VGG16 resulted in average accuracy declines of 5.6% and 6.4%, respectively, emphasizing the complementary strengths of the dual-backbone architecture. These findings highlight the model’s ability to refine feature representation and suppress noise effectively. As shown in Table 6, CA-SLNet outperformed state-of-the-art models EfficientNet and Dermo-Optimizer. On the Ph² dataset, CA-SLNet achieved 99.50% accuracy, surpassing EfficientNet 98.70% and Dermo-Optimizer 99.05%. Similarly, for the ISIC-2016 dataset, CA-SLNet’s 98.59% accuracy exceeded Deep CNN’s 96.3% and the Newton-Raphson framework’s 94.5%. For the ISIC-2017 dataset, which presents greater lesion variability, CA-SLNet maintained a competitive accuracy of 96.80%, outperforming most existing approaches. CA-SLNet’s high sensitivity, specificity and consistent performance across diverse lesion types establish it as a state-of-the-art automated skin lesion classification system. Its robustness and clinical relevance, particularly for early melanoma detection, indicate its potential to improve diagnostic accuracy and patient outcomes significantly. However, the model showed a comparatively higher false-negative rate in the ISIC-2017 dataset. This can be attributed to the greater lesion diversity of the dataset and complex visual overlap between benign and malignant cases, which can reduce the sensitivity of the classifier. Such cases underscore the importance of minimising false negatives in clinical diagnostics as they represent missed melanoma detections.

To evaluate the practicality of CA-SLNet, we compared its inference performance with baseline fusion strategies under the same hardware conditions. CNN models with a single backbone, such as ResNet50 and VGG16, showed the fastest inference times (28–30 ms per image). Concatenation-based fusion slightly increased this time (36 ms), while gated fusion caused more overhead (46 ms) due to the additional adaptive components. With the proposed adaptive softmax-weighted fusion strategy in CA-SLNet, a balanced inference time of about 42 ms per image was achieved, while maintaining high diagnostic accuracy at high computational performance. These results demonstrate that CA-SLNet is suitable for integration into clinical workflows that require real-time or near real-time inference, especially when deployed on a modern GPU infrastructure. In addition, the dual-backbone design combining ResNet50 and VGG16 significantly improves feature extraction and classification performance, but leads to higher computational complexity. This may limit the use of the model on edge devices or in resource-constrained clinical environments. Future work will reduce this overhead through architectural optimisation techniques such as model pruning, knowledge distillation, or replacing a backbone with a lightweight alternative such as MobileNet or EfficientNet.

Conclusion and future work

This study presented CA-SLNet, a hybrid model based on channel and spatial attention, developed for the classification of skin lesions. By integrating attention mechanisms into two backbone architectures, ResNet50 and VGG16, the model effectively captures both spatial and semantic features. The evaluation of three publicly available datasets, ISIC-2016, ISIC-2017 and PH² showed the strong performance of the model, which achieved AUC values of 0.9814, 0.9626 and 0.9814 with corresponding classification accuracies of 98.59%, 96.80% and 99.50%, respectively. The integration of attention mechanisms with complementary features improved the model’s ability to discriminate between melanoma and non-melanoma lesions and also made it possible to identify critical spatial regions in dermoscopic images. However, the model has some limitations, including increased computational complexity due to the use of two backbones and a slightly increased false negative rate, especially for the ISIC-2017 dataset. These limitations indicate challenges for use in resource-constrained clinical settings and emphasize the need for improved sensitivity in datasets with high visual similarity between classes. Future work will focus on reducing the computational burden by optimising the model architecture, increasing the training diversity by generating synthetic data and improving the interpretability of predictions through explainable AI techniques. In addition, extending validation to larger and more diverse datasets such as ISIC-2018 and HAM10000 will be crucial to ensure better generalization and clinical applicability. Ensemble and calibration techniques can also be explored to further improve sensitivity and reduce false negatives.

[1] Adegun AA, Viriri S. 2019. Deep learning-based system for automatic melanoma detection. IEEE Access 8:7160-7172

[2] Ajmal M, Khan MA, Akram T, Alqahtani A, Alhaisoni M, Armghan A, Althubiti SA, Alenezi F. 2023. BF2SkNet: best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Computing and Applications 35(30):22115-22131

[3] Akram T, Alsuhaibani A, Khan MA, Khan SU, Naqvi SR, Bilal M. 2024. Dermo-optimizer: skin lesion classification using information-theoretic deep feature fusion and entropy-controlled binary bat optimization. International Journal of Imaging Systems and Technology 34(5):e23172

[4] Al-Selwi M, Ning H, Gao Y, Chao Y, Li Q, Li J. 2025. Enhancing object pose estimation for RGB images in cluttered scenes. Scientific Reports 15(1):8745

[5] Amin J, Sharif A, Gul N, Anjum MA, Nisar MW, Azam F, Bukhari SAC. 2020. Integrated design of deep features fusion for localization and classification of skin cancer. Pattern Recognition Letters 131(3):63-70

[6] Benyahia S, Meftah B, Lézoray O. 2022. Multi-features extraction based on deep learning for skin lesion classification. Tissue and Cell 74(22):101701

[7] Brinker TJ, Hekler A, Enk AH, Klode J, Hauschild A, Berking C, Schilling B, Haferkamp S, Schadendorf D, Fröhling S. 2019. A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European Journal of Cancer 111(10151):148-154

[8] Cassidy B, Kendrick C, Brodzicki A, Jaworek-Korjakowska J, Yap MH. 2022. Analysis of the ISIC image datasets: usage, benchmarks and recommendations. Medical Image Analysis 75(4):102305

[9] Codella NC, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H. 2018. Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC)

[10] El-Khatib H, Popescu D, Ichim L. 2020. Deep learning–based methods for automatic diagnosis of skin lesions. Sensors 20(6):1753

[11] Faghihi A, Fathollahi M, Rajabi R. 2024. Diagnosis of skin cancer using VGG16 and VGG19 based transfer learning models. Multimedia Tools and Applications 83(19):57495-57510

[12] Farea E, Saleh RA, AbuAlkebash H, Farea AA, Al-antari MA. 2024. A hybrid deep learning skin cancer prediction framework. Engineering Science and Technology, An International Journal 57(4):101818

[13] Goyal M, Oakley A, Bansal P, Dancey D, Yap MH. 2019. Skin lesion segmentation in dermoscopic images with ensemble deep learning methods. IEEE Access 8:4171-4181

[14] Gutman D, Codella NCF, Celebi E, Helba B, Marchetti M, Mishra N, Halpern A. 2016. Skin Lesion Analysis toward Melanoma Detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC)

[15] Hasan MK, Elahi MTE, Alam MA, Jawad MT, Martí R. 2022. DermoExpert: skin lesion classification using a hybrid convolutional neural network through segmentation, transfer learning, and augmentation. Informatics in Medicine Unlocked 28(49):100819

[16] Hosny KM, Kassem MA. 2022. Refined residual deep convolutional network for skin lesion classification. Journal of Digital Imaging 35(2):258-280

[17] Iravanimanesh S, Nazari MA, Jafarbeglou F, Mahjoob M, Azadi M. 2021. Extracting the elasticity of the human skin in microscale and in-vivo from atomic force microscopy experiments using viscoelastic models. Computer Methods in Biomechanics and Biomedical Engineering 24(2):188-202

[18] Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. 2020. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules 10(8):1123

[19] Jojoa Acosta MF, Caballero Tovar LY, Garcia-Zapirain MB, Percybrooks WS. 2021. Melanoma diagnosis using deep learning techniques on dermatoscopic images. BMC Medical Imaging 21(1):1-11

[20] Kaneria S, Jotwani V. 2024. Comparative performance analysis of deep learning-based image steganography using u-net, v-net, and u-net++ encoders. Image 13:20

[21] Khan MA, Sharif M, Akram T, Bukhari SAC, Nayak RS. 2020. Developed newton-raphson based deep features selection framework for skin lesion recognition. Pattern Recognition Letters 129(4/5):293-303

[22] Khouloud S, Ahlem M, Fadel T, Amel S. 2022. W-net and inception residual network for skin lesion segmentation and classification. Applied Intelligence 52(4):3976-3994

[23] Li Y, Shen L. 2018. Skin lesion analysis towards melanoma detection using deep learning network. Sensors 18(2):556

[24] Liu J, Zhou S, Zang M, Liu C, Liu T, Wang Q. 2024. Multiple instance learning method based on convolutional neural network and self-attention for early cancer detection. In: Computer Methods in Biomechanics and Biomedical Engineering. England: Informa UK (Taylor & Francis). 1-16

[25] Mabrouk A, Dahou A, Elaziz MA, Díaz Redondo RP, Kayed M. 2022. Medical image classification using transfer learning and chaos game optimization on the internet of medical things. Computational Intelligence and Neuroscience 2022(1):9112634

[26] Meitantya MD, Sari CA, Rachmawanto EH, Ali RR. 2024. VGG-16 architecture on CNN for American sign language classification. Jurnal Teknik Informatika (Jutif) 5(4):1165-1171

[27] Montaha S, Azam S, Rafid ARH, Islam S, Ghosh P, Jonkman M. 2022. A shallow deep learning approach to classify skin cancer using down-scaling method to minimize time and space complexity. PLOS ONE 17(8):e0269826

[28] Nakai K, Chen YW, Han XH. 2022. Enhanced deep bottleneck transformer model for skin lesion classification. Biomedical Signal Processing and Control 78(5):103997

[29] Nazari S, Garcia R. 2025. Going smaller: attention-based models for automated melanoma diagnosis. Computers in Biology and Medicine 185(11):109492

[30] Nirupama V. 2024. Mobilenet-v2: an enhanced skin disease classification by attention and multi-scale features. Journal of Imaging Informatics in Medicine 38(3):1-21

[31] Ozkan IA, Koklu M. 2017. Skin lesion classification using machine learning algorithms. International Journal of Intelligent Systems and Applications in Engineering 5(4):285-289

[32] Ramanagiri A, Mukunthan M, Balamurugan G. 2024. Enhanced brain tumor detection using resnet-50.

[33] Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation.

[34] Sahoo SR, Dash R, Mohapatra RK. 2024. Fusion of deep and wavelet feature representation for improved melanoma classification. Multimedia Tools and Applications 83(35):1-27

[35] Saleem MA, Javeed A, Akarathanawat W, Chutinet A, Suwanwela NC, Asdornwised W, Chaitusaney S, Deelertpaiboon S, Srisiri W, Benjapolakul W. 2024. Innovations in stroke identification: a machine learning-based diagnostic model using neuroimages. IEEE Access 12:35754-35764

[36] Shaik A, Dutta SS, Sawant IM, Kumar S, Balasundaram A, De K. 2025. An attention based hybrid approach using CNN and BiLSTM for improved skin lesion classification. Scientific Reports 15(1):15680

[37] Song W, Wang X, Guo Y, Li S, Xia B, Hao A. 2024. CenterFormer: a novel cluster center enhanced transformer for unconstrained dental plaque segmentation. IEEE Transactions on Multimedia 26:10965-10978

[38] Song L, Wang H, Wang ZJ. 2023. Decoupling multi-task causality for improved skin lesion segmentation and classification. Pattern Recognition 133(1):108995

[39] Su Y, Shi L, Zhou K, Bai G, Wang Z. 2024. Knowledge-informed deep networks for robust fault diagnosis of rolling bearings. Reliability Engineering & System Safety 244:109863

[40] Sufia S, Hossain MF, Hossain MA. 2025. An attention-based hybrid deep learning approach for skin lesion classification.

[41] Tang Y, Zhao D, Pertsau D, Gourinovitch A, Kupryianava D. 2025. Edge-guided and hierarchical aggregation network for robust medical image segmentation. Biomedical Signal Processing and Control 101(11):107202