CVTF : CNN-Vision Transformer Fusion for adaptive steganographic malware detection in social media images
Abstract
The exponential growth of social media platforms has heightened the risk of image-based malware, where malicious code is covertly embedded within otherwise benign images using steganographic techniques. Traditional detection systems struggle to identify these threats due to their static nature and inability to recognize adaptive concealment methods. To address these challenges, we propose CVTF (CNN-Vision Transformer Fusion), a hybrid framework that integrates Convolutional Neural Networks (CNNs) with Vision Transformers (ViTs) to exploit both local pixel patterns and global contextual cues through a novel attention-weighted feature fusion mechanism. Coupled with a Dynamic Threshold Calibration Mechanism (DTCM), the CVTF framework adaptively adjusts classification boundaries in real time to reduce false positives while maintaining high sensitivity. Additionally, the Adaptive Threat Profiling Module (ATPM) incrementally updates the model to respond effectively to newly emerging malware variants, thereby mitigating zero-day threats. Extensive experiments on real-world social media image datasets, comprising both benign and steganographically modified malware samples, demonstrate the robustness and scalability of CVTF. The system achieves 95.2% detection accuracy, 2.3 ms average inference time, and a 3.5% accuracy improvement through continuous learning from 1,000 new malicious samples. These results validate the efficacy of CVTF in detecting stealthy malware payloads while maintaining low-latency, real-time performance. CVTF can be seamlessly integrated into existing security infrastructures, enhancing the detection of steganographic threats and promoting secure social media ecosystems