owned this note changed 6 years ago
Linked with GitHub

LINE SRE Practice from Observability Viewpoint - 洪立遠(Yuan)

Welcome to DevOps Days 2019 Collaborative Notes

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Getting started from here: https://hackmd.io/@DevOpsDay/2019
Click top left to expand Agenda on the mobile.

Start here

from distributed systems observability by cindy sridharan

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Observability
    • Metrics: aggregatable
      • Prometheus
    • Logging: distributed events
      • ELK
    • Tracing: request scoped, microservice journey

ref

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Outage report meeting

兩層的監控 第一層並不會直接跑出警告 第二層才跳出通知

Tickets 裡可能會包含很多 Events

Log Level Guideline

Log 僅log需要人去處理的問題

IMON Notification

All events will trigger notification

  • For logs, only log actionable information
  • For metrics, instrument every menaingful number available
    • RED

Outage Lifecycle

  • Outage happens
    • COde Change
    • Mis-operation
    • Saturation
    • Physicial Resource
    • Suddenly broke
  • IMON notifitication
    • Fatal / Error log
    • Metric > threshold
  • Debugging & Solving with IMON
    • On-call team
  • Outage Report Meeting

On-call Process

  • Weeky 1st & 2nd Responsible Engineers Responsibility

    • Monitoring Channels
      • IMON
      • Trouble Ticket System
      • Prometheus [promgen] (LINE developed)
      • Slack
    • About Pager
      • Discussing a company-wide pager & compensation mechanism
    • Issue Triage
      • Scope of Users Impacted

        • 5: Very small
        • 4: small
        • 3: medium
        • 2: large - significantly disrupts
        • 1: very large - completely unusable
      • Severity of Impact

    • Issue Response
      1. Record the Issue
        • Ticket, Slack thread
      2. Identify the Scope and Priority of the Issue
        • Issue triage
      3. Identify Possible Actions
        • Scope of impact, expected result, time to see result, alternatives
      4. Decide what Action to Take
        • Constant communication with (other) teams; when possible, do not take action alone
      5. Take Action
        • Tell people what you are doing; make sure who will do it; action and monitor
      6. Report the result
        • Tell and communication
    • Outage Report Meeting (兩週以內)
      1. Only responsible team has to attend, optional to others (全社都會收到通知)
      2. Focus on flow/system/mechanism rather then people
      3. Can be good lesson to everyone
      4. Action items are required
      5. Prevention of the next outage
      6. Shared in Tech Leader report meeting (Senior managers / senior engineers) (top-down)
      7. Similar to the postmortem in Google SRE book

    No finger pointing!

Share Story: SQL Operation

  • Preventing future outages
    • Short-term
      • More reviews before the SQL operation
      • Use docker to set up a Production-cloned DB. Test on that DB first
    • Long-term
      • make operations API or code
tags: DevOpsDays Taipei 2019
Select a repo