Scaling Reliability Engineering with Tools - Nikos Katirtzis & Daniel Albuquerque - NDC Porto 2022

Nikos Katirtzis & Daniel Albuquerque NDC Conferences

41:08

0 views

Published July 18, 2022

About this talk

Site Reliability Engineering at companies with thousands of production services and engineers can be approached in different ways. You may have SREs being responsible for specific domains, or central SRE teams which are tasked with creating a centre of excellence for the entire company. You may also have teams building tools in that space. In this talk we will go through our approach to building tools for reliability engineering. The first part will touch upon: - Buy vs build vs open-source decisions - Local vs global maxima - providing an on-road experience with a single platform and common tools - Feedback loop between tools, teams, and events such as incident reviews and GameDays The second part will focus on two of our projects; an automated region failover capability and chaos engineering. We will present our work and any learnings so far. Designs and architectures, technical challenges, the importance for good developer experience, challenges for adoption, and more.