Advancements in cloud-native systems: A comprehensive survey on reliability, scalability, and architectural innovations in distributed and edge ecosystems
Abstract
Cloud-native systems, enabled by microservices, serverless computing, and edge intelligence, are reshaping the design and deployment of modern distributed applications, with a projected 35% CAGR by 2030 (IDC, 2025). While offering enhanced scalability and operational agility, these systems introduce significant challenges in ensuring reliability, observability, and security, particularly in latency-sensitive edge deployments. This survey systematically analyzes 300 high-impact peer-reviewed studies from 2017 to September 2025 across key domains such as root cause analysis, chaos engineering, predictive autoscaling, and federated security. Noteworthy advancements include the Nezha framework achieving 89.77% top-1 accuracy in root cause analysis using multi-modal telemetry, outperforming traditional methods by 15%, and Kubernetes-based remediation frameworks demonstrating 98.7% recovery precision under failure injection. Additional progress is observed in STEAM’s GNN-based trace sampling, low-latency FPGA-based anomaly detection, and RLNC-enhanced 5G packet recovery, enabling sub-10ms responsiveness, validated in real-world AWS and Azure environments. Despite these innovations, the review identifies persistent gaps in explainability, cross-cluster observability, and the scalability of LLM-based remediation, with explainability scores dropping below 60% in complex scenarios. Real-world implementations such as Microsoft Teams and NATO defense clouds underscore the practicality of resilient, AI-driven cloud-native infrastructures achieving 99.9% uptime in critical operations. The findings highlight that future cloud-native platforms must integrate ML-based diagnostics, hardware acceleration, and formal verification to achieve five-nines availability as validated by a simulated case study with a 98% success rate in mission-critical environments spanning healthcare, defense, and smart industry. Such systems must be inherently adaptive, self-healing, and secure to effectively manage the increasing architectural complexity and workload volatility characteristic of next-generation cloud ecosystems.