5 hours Microsoft 365 outage caused by faulty ECS deployment

www.bleepingcomputer.com

Cloud MS 365 reduced to less than 4 nines SLA uptime just from this outage From the 2016 MMIE paper "A careful ‘cloud bug study’ (11 person/year effort) was undertaken to classify and manually annotate thousands of issues in six popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume) along multiple dimensions. The study found that every implication, from failed operation, to performance degradation, downtime, data loss, data corruption, loss and staleness can be caused by virtually any software and hardware fault combination [4]. More recently, the same group homed in on 104 distributed concurrency (DC) bugs (bugs triggered by unexpected timing of events). They created a detailed multi-dimensional taxonomy; analyzed timing issues, trigger pre-conditions, error and fixing strategies. Their findings are striking and well worth studying, among them: We lack tools to analyze complex protocol interactions. Distributed model checkers have triggering blind spots due to intractability of event state space. Injecting delays at runtime seems to prevent 40% of DC bugs from triggering, but may introduce hanging risks. Almost half of DC bugs lead to silent failures, and possible mysterious errors much later in time. Fix strategies are challenging because correctly implemented synchronization of globally distributed systems is a hard problem [5]. We understand predictably as of yet little about the event timings and hardware/software constellations which violate the implicit and explicit system assumptions"