Data Architecture and Analysis about OpenTelemetry Observability - 蘇揮原 (Mars)

歡迎來到 SRE Conference 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@sre-conf/2024
手機版請點選上方 按鈕展開議程列表。

從這開始

tags: SRE Conference 2024

01 background & Problems

用戶發現前先發現問題(Proactive monitoring )

  1. Server Monitoring
  2. Endpoint Monitoring
  3. Cloud Monitoring
  4. Network Monitoring

What is the Observability?

  • clear judgement
  • information

Observability Signal (OpenTelemetry Framework)

  • Metrics
  • Traces
  • Logs
  • Trace data on jaeger UI
    • Use SQL
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

心得:

  • 可以接在 Prometheus 之後,也可以傳給 prometheus,重點在處理訊息

Rough Architecture version 1

  • 使用 kafka 做 MessageQueue。
  • prometheus + grafana
  • problem
    • Performance: service 越多,prometheus 需要越開越多
    • Flexibility: 彈性差,多service 多額外的promethus
    • Correctness: 時序的關係,導致正確性問題
    • Maintainance
  • 需找尋大資料快速分析的特性
    • 找pattern
    • LLM 可以學習

Definition of concepts

  • Throughput: Can handle massive data ingestion and queries
  • Cost: Cost Control Strategy
  • Analysis

OLAP Data Warehouse

  • data cleaning : string formater/parsing (去除特殊字元)

ClickHouse

  • ETL: 乾淨處理完訊息(技術邏輯)
  • ELT:訊息呈現(商業邏輯)

ref. https://aws.amazon.com/tw/compare/the-difference-between-etl-and-elt/

keeper 在cpu & mem.裡比apache 更佳

ref.
https://clickhouse.com/blog/clickhouse-keeper-a-zookeeper-alternative-written-in-cpp

Write: Local table, (node1 node2 node3 node4)
Read: Distributed table, Show aggregated data directly

特色: LSM Tree(data有時序性的)

  • 心得
    Kafka + OpenTelemetry(ETL) -> ClickHouse Cluster(ELT) -> Grafana (view)

Rough Data Version 2(clickhorse as storage)

  • Performance: 將prometheus 換成clickhouse
  • Flexibility: 改善performence & 彈性擴展 👍
  • Maintainance: 管理只有改善部分(變困難)
  • Disk 下降 90%(~1TB -> 87GB)
  • query performance:
    • (short Period) 加速很多 (opensearch:227ms, clickhouse:3xms?)
    • (long period) 回復時間其他 (e.g., prometheus)有機會 timeout

ML/AI

Ref: https://clickhouse.com/blog/powering-featurestores-with-clickhouse

image

=聊天=
講好快, 應該會後會提供投影片吧?
需要投影片+1

如果方便的話希望有 PPT (許願)

Hi, All. 已經提供在最上方了喔~

Select a repo