Embracing Collaborative Chaos - Lyndsay Prewer - NDC Oslo 2021

45:53

0 views

Published February 21, 2022

About this talk

As engineers, we spend a lot of our time thinking about how best to shield our clients and customers from the risks inherent in the systems we build. We ask ourselves: "what’s the worst that could happen", then work hard to mitigate the risk. A common risk in most systems, particularly distributed ones, is the unexpected failure of a component part. As a system’s complexity and its number of subsystems grow, so does the likelihood of a subsystem failure. Subsystem failures can compound in such a manner that catastrophic system failure becomes a certainty. The only uncertainty is when the system will fail. Chaos Engineering addresses the risks inherent in distributed systems that stem from unexpected component failure. It does so by running experiments that explore the impact of sub-system failures by deliberately inducing different types of failure in different components. Outcomes are then analysed and learnings applied to improve the system’s resilience. These learnings deepen our understanding of the system and its failure modes, which aids the identification of new failure scenarios. This feedback loop informs subsequent rounds of experimentation, and thus the cycle repeats. In addition, planned failures provide a safe environment for teams to improve their incident response and how they conduct subsequent postmortems. Chaos experiments can take many forms, ranging from continuous, automated failure injection (made famous by the Netflix Chaos Monkey), to one-off Chaos Days (similar to Amazon’s Game Day (https://aws.amazon.com/gameday/)), where disruption is manually instigated. Chaos engineering is similar to the ethos of "building quality in": it’s a mindset, not a toolset: you don’t need to be running EKS on AWS to benefit from being curious about failure modes and how to improve a system’s resilience towards them. It just requires a focus on "building resilience in". This session shares our experience of running Chaos Days over several years, with one of our clients – a major Government department that hosts around 60 distributed, digital delivery teams. These teams design, deliver and support hundreds of microservices that serve online content to the department’s varied customers. The microservices all run on a single platform, itself run by seven Platform Teams that take responsibility for distinct areas (infrastructure, security and so on). Inspired by the Netflix Chaos Monkey and Amazon’s Game Day, the Platform Teams have planned and executed several Chaos Days – to see just how well they and the Platform coped when everything that could go wrong, does go wrong. The session will explore why you’d run a Chaos Day, and how to know when you and your platform are ready to do so. We’ll share our learnings of the actual mechanics of running one: how do you plan, execute and retrospect a Chaos Day. We’ll also share what’s not worked so well, and areas we’d like to focus on in the future. When real (unplanned) failures occur, they provide excellent opportunities to learn and improve a system’s resilience. We’ll explore how to make the most of these events by running effective postmortems, and how Chaos Days can further refine your postmortem approach. The session will conclude by discussing how Chaos Engineering could be applied to attendees own context, through presenting various possible starting points. Check out more of our featured speakers and talks at https://ndcconferences.com/ https://ndcoslo.com/