--- title: SRE Site Reliability Engineering tags: SRE,devops description: View the slide with "Slide Mode". --- # Site Reliability Engineering (SRE) --- > “What happens when a software engineer is tasked with what used to be called operations” — Ben Treynor, Google. --- ## What is SRE? - SRE is a set of practices to run software operations - Operations are a software problem - DevOps vs SRE: SRE implements DevOps - DevOps is a more generic term -> SRE is one option to implement it --- ## Key Principles of SRE - Blame-free Culture: Embrace Mistakes as a learning experience for the whole team -> Control Emotions - Eliminate Toil - Run systems based on Service Level Objectives (SLO) and Error Budgets -> Measure,analyze,iterate - There are no heros -> Not a one man show - Be an Advisor -> be involved early --- ## Set of Practices - Post Mortem = Report of a Failure/Incident - Playbooks = Actionable How Tos - Wheel of misfortune = A wheel that picks someone to solve simulated failure - Automation = Automation by software to solve reoccurring tasks --- ## SRE anti-patterns - Do not mix Business KPIs/Bonuses with SLO - SLOs that can not be measured -> Start simple - SPAM - Non-actionable alerts - Spending too much time on reactions 50-50 - "This is bad, This sucks etc.." -> Every Implementation is valid, be an advisor show the pros and cons, make decisions explicit --- ## SRE at Motius - We do not have huge projects in production - Still we should create post mortems - We should define SLOs (NOT in contracts) - We should automate - We should write playbooks - We should focus on a blame free culture (lead by example, skill profile!, make failures public) --- ## Template of a Post Mortem - Date - Authors - Status - Summary - Impact, Root Cause, Trigger - Action Items --- ## Template of a Post Mortem - Lessons Learned - What went well - What went wrong - Where we got lucky - Timeline - Supporting Information --- ## Must have playbooks (Few of them exists in my projects) - How do I access production/applications? - How do I access the logs? - How do I safely reboot the application? - How do I create a manual backup? - How do I restore from backup? - How do I renew certificates/credentials? - How do I react to alert xyz? --- ## Valueable Links - [Playlist O'Reilly](https://learning.oreilly.com/playlists/a7f341f1-8d4b-41f6-93f7-830b216b0e02/) - [SRE Books](https://sre.google/books/) - [SRE Podcasts](https://sre.google/prodcast/)