LINE SRE Practice from Observability Viewpoint - 洪立遠（Yuan）

Welcome to DevOps Days 2019 Collaborative Notes

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Getting started from here: https://hackmd.io/@DevOpsDay/2019
Click top left to expand Agenda on the mobile.

Start here

from distributed systems observability by cindy sridharan

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Observability
- Metrics: aggregatable
  - Prometheus
- Logging: distributed events
  - ELK
- Tracing: request scoped, microservice journey

ref

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Outage report meeting

兩層的監控第一層並不會直接跑出警告第二層才跳出通知

Tickets 裡可能會包含很多 Events

Log Level Guideline

Log 僅log需要人去處理的問題

IMON Notification

All events will trigger notification

For logs, only log actionable information
For metrics, instrument every menaingful number available
- RED

Outage Lifecycle

Outage happens
- COde Change
- Mis-operation
- Saturation
- Physicial Resource
- Suddenly broke
IMON notifitication
- Fatal / Error log
- Metric > threshold
Debugging & Solving with IMON
- On-call team
Outage Report Meeting

On-call Process

Weeky 1^st & 2^nd Responsible Engineers Responsibility
- Monitoring Channels
  - IMON
  - Trouble Ticket System
  - Prometheus [promgen] (LINE developed)
  - Slack
- About Pager
  - Discussing a company-wide pager & compensation mechanism
- Issue Triage
  - Scope of Users Impacted
    - 5: Very small
    - 4: small
    - 3: medium
    - 2: large - significantly disrupts
    - 1: very large - completely unusable
  - Severity of Impact
- Issue Response
  1. Record the Issue
    - Ticket, Slack thread
  2. Identify the Scope and Priority of the Issue
    - Issue triage
  3. Identify Possible Actions
    - Scope of impact, expected result, time to see result, alternatives
  4. Decide what Action to Take
    - Constant communication with (other) teams; when possible, do not take action alone
  5. Take Action
    - Tell people what you are doing; make sure who will do it; action and monitor
  6. Report the result
    - Tell and communication
- Outage Report Meeting (兩週以內)
  1. Only responsible team has to attend, optional to others (全社都會收到通知)
  2. Focus on flow/system/mechanism rather then people
  3. Can be good lesson to everyone
  4. Action items are required
  5. Prevention of the next outage
  6. Shared in Tech Leader report meeting (Senior managers / senior engineers) (top-down)
  7. Similar to the postmortem in Google SRE book
No finger pointing!

Share Story: SQL Operation

Preventing future outages
- Short-term
  - More reviews before the SQL operation
  - Use docker to set up a Production-cloned DB. Test on that DB first
- Long-term
  - make operations API or code

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

LINE SRE Practice from Observability Viewpoint - 洪立遠（Yuan）

IMON Notification

All events will trigger notification

Outage Lifecycle

On-call Process

tags: DevOpsDays Taipei 2019

tags: `DevOpsDays Taipei 2019`