Site Reliability Engineering at companies with thousands of production services and engineers can be approached in different ways. You may have SREs being responsible for specific domains, or central SRE teams which are tasked with creating a centre of excellence for the entire company. You may also have teams building tools in that space. In this talk we will go through our approach to building tools for reliability engineering. The first part will touch upon: - Buy vs build vs open-source decisions - Local vs global maxima - providing an on-road experience with a single platform and common tools - Feedback loop between tools, teams, and events such as incident reviews and GameDays The second part will focus on two of our projects; an automated region failover capability and chaos engineering. We will present our work and any learnings so far. Designs and architectures, technical challenges, the importance for good developer experience, challenges for adoption, and more.
Get notified about new features and conference additions.