---
title: SRE Site Reliability Engineering
tags: SRE,devops
description: View the slide with "Slide Mode".
---
# Site Reliability Engineering (SRE)
---
> “What happens when a software engineer is tasked with what used to be called operations” — Ben Treynor, Google.
---
## What is SRE?
- SRE is a set of practices to run software operations
- Operations are a software problem
- DevOps vs SRE: SRE implements DevOps - DevOps is a more generic term -> SRE is one option to implement it
---
## Key Principles of SRE
- Blame-free Culture: Embrace Mistakes as a learning experience for the whole team -> Control Emotions
- Eliminate Toil
- Run systems based on Service Level Objectives (SLO) and Error Budgets -> Measure,analyze,iterate
- There are no heros -> Not a one man show
- Be an Advisor -> be involved early
---
## Set of Practices
- Post Mortem = Report of a Failure/Incident
- Playbooks = Actionable How Tos
- Wheel of misfortune = A wheel that picks someone to solve simulated failure
- Automation = Automation by software to solve reoccurring tasks
---
## SRE anti-patterns
- Do not mix Business KPIs/Bonuses with SLO
- SLOs that can not be measured -> Start simple
- SPAM
- Non-actionable alerts
- Spending too much time on reactions 50-50
- "This is bad, This sucks etc.." -> Every Implementation is valid, be an advisor show the pros and cons, make decisions explicit
---
## SRE at Motius
- We do not have huge projects in production
- Still we should create post mortems
- We should define SLOs (NOT in contracts)
- We should automate
- We should write playbooks
- We should focus on a blame free culture (lead by example, skill profile!, make failures public)
---
## Template of a Post Mortem
- Date
- Authors
- Status
- Summary
- Impact, Root Cause, Trigger
- Action Items
---
## Template of a Post Mortem
- Lessons Learned
- What went well
- What went wrong
- Where we got lucky
- Timeline
- Supporting Information
---
## Must have playbooks (Few of them exists in my projects)
- How do I access production/applications?
- How do I access the logs?
- How do I safely reboot the application?
- How do I create a manual backup?
- How do I restore from backup?
- How do I renew certificates/credentials?
- How do I react to alert xyz?
---
## Valueable Links
- [Playlist O'Reilly](https://learning.oreilly.com/playlists/a7f341f1-8d4b-41f6-93f7-830b216b0e02/)
- [SRE Books](https://sre.google/books/)
- [SRE Podcasts](https://sre.google/prodcast/)