Low resource monitoring stack

Monitoring

In my current job I am a MarkLogic Administrator, which is quite complex. In the version we are currently working on, monitoring of MarkLogic events is quite persistent as the vendor itself admits that their standard system of metrics may show unrealistic data. A few months ago I decided to do a reasearch on the possibilities of monitoring MarkLogic by independent (open-source) applications. The product is not very well known, so there is not much choice, so I decided to create my own stack of monitoring components.

Proper application monitoring should support tracking of all events on the server where the application is hosted. Applications based on UNIX system should contain implementation of system library, which passes messages to syslog. In this case to be sure if the application is healthy we should collect information about all application events, system logs and system metrics.

The main goal was to create a solution, which does not use large system resources and has great flexibility in configuration with other monitoring components. I divided the description of the solution into 2 parts, in the first part I will describe the components for logs, and further I will describe components for monitoring system metrics.

Logs

For redirecting log streams I chose Rsyslog, which has a whole bunch of advantages and is available on most of the available UNIX distributions. I've configured Rsyslog on all servers where the application is running so that all system and application logs are redirected to the central Rsyslog server. Our production server is divided into several projects, so the logs on the central server are already pre-filtered and sorted.

The central Rsyslog server, has been configured to listen on a specific port to receive messages from all Rsyslog clients. Each message is written to a file according to the template description and sorted into folders based on the hostname of the client from which the message was sent. Writing to a file was not necessary, I could have used directly redirecting the logs to any other application that indexes the log streams. Initial segregation and saving makes it easier for me to configure the next monitoring components and gives me many options in archiving old messages.

Client Configuration:

/etc/rsyslog.d/client-collector.conf



















































####################################
############# INIT MODULE ##########
####################################

$ModLoad imfile
$InputFilePollInterval 1

####################################
########## INIT STATIC LOG #########
####################################

input(type="imfile" File="/MarkLogicLogs/TaskServer_ErrorLog.txt" Tag="ml_TSERR" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/TaskServer_RequestLog.txt" Tag="ml_TSREQ" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/AuditLog.txt" Tag="ml_AUDIT" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/CrashLog.txt" Tag="ml_CRASH" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/ErrorLog.txt" Tag="ml_ERROR" Ruleset="FileNameTAG" addMetadata="on")

####################################
########## INIT WILDCARDS ##########
####################################

input(type="imfile" File="/MarkLogicLogs/7*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/8*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/9*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")

###################################
############# TEMPLATE ############
###################################

template(name="LongTagForwardFormat" type="string" string="<%PRI%>%TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag%%$.suffix:1:6:%%msg:::sp-if-no-1st-sp%%msg%")

###################################
############# RULESET #############
###################################

ruleset(name="FileNameRegex") {
        set $.suffix=re_extract($!metadata!filename, "(.*)/([^/]*)", 0, 2, "all.txt");
        call sendToLogserver
        }

ruleset(name="FileNameTAG") {
        call sendToLogserver
        }

ruleset(name="sendToLogserver") {
        action(type="omfwd" Target="centralhostname.com" Port="514" template="LongTagForwardFormat" Protocol="tcp")
        }

###################################
########## END-CFG ################
###################################

Central server configuration:

/etc/rsyslog.d/central-collector.conf







































































####################################
########## JSON FILES ##############
####################################

  lookup_table(name="def_host" file="/etc/rsyslog.d/def_name.json" reloadOnHUP="on")  

####################################
########## INIT MODULE ############# 
####################################

  module(load="imtcp")
  module(load="mmrm1stspace")
  input(type="imtcp" port="514")
  $FileOwnerNum 1000
  $FileGroupNum 1000
  $DirGroupNum 1000
  $DirOwnerNum 1000
  $FileCreateMode 0644
  $DirCreateMode 0755

###################################
########## TEMPLATE ###############
###################################

  template(name="RemoteLogSavePath" type="list") {
    constant(value="/opt/monitoring/logs/")
    property(name="$.envTYPE")  
    constant(value="/")    
    property(name="$.clusterID")
    constant(value="/")
    property(name="$.LogMainFileName")
  }
    
  template(name="LogResultMSG" type="list") {
    property(name="$.def_hostid")
    constant(value=" ")
    property(name="$.LogMSGFileName")
    property(name="msg" compressspace="on" spifno1stsp="on")
    property(name="msg")
    constant(value="\n")
  }
  
###################################
########## ACTION #################
###################################

set $.clusterDB = lookup("def_host", $hostname);
if ($.clusterDB == "") then {
    set $.envTYPE = "OTHER";
    set $.clusterID = $hostname;
    set $.def_hostid = $hostname;
} else {
    set $.envTYPE = substring($.clusterDB, 0, 3);
    set $.clusterID = substring($.clusterDB, 0, 4);
    set $.def_hostid = $.clusterDB;    
}
action(type="mmrm1stspace")
set $.LookUpJSON = lookup("def_host", substring($programname, 3, 4));
if ($.LookUpJSON == "") then {
    set $.LogMainFileName = $programname;
    set $.LogMSGFileName = $programname;
} else {
    set $.LogMainFileName = $.LookUpJSON;
    set $.LogMSGFileName = $programname;
}
action(type="omfile" DynaFile="RemoteLogSavePath" Template="LogResultMSG")
stop

###################################
########## END-CFG ################
###################################

Defined names (file structure example)

/etc/rsyslog.d/def_name.json









{ "version" : 1,
  "type"    : "string",
  "table" : [
    {"index" : "apphostname1.com", "value" : "SERV_1"},
    {"index" : "apphostname2.com", "value" : "SERV_2"},
    {"index" : "apphostname3.com", "value" : "SERV_3"},
    {"index" : "8001", "value" : "ml_ADMIN"},

  ]}

Promtail/Loki/Grafana Compose Configuration:

The preprocessed logs via Rsyslog are already on the central server, so the next step is to configure the search/indexing and visualization software. In my solution I used the Promtail-Loki-Grafana stack configured as a cluster on the docker. In the configuration I used 3 Loki instances connected to an Nginx gateway to route the read and write loads from the clients (Grafana, Promtail). We then get a much smoother running front-end application.

Configuration of docker-compose

docker-compose-ha.yaml





























































































version: "3.8"

services:

  grafana:
    image: grafana/grafana:7.5.6
    ports:
      - "3000:3000"
    networks:
      - loki
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: on-failure      

  promtail:
    image: grafana/promtail:2.2.1
    volumes:
      - /opt/monitoring/logs:/var/log
      - ./config:/etc/promtail/
    ports:
      - "9080:9080"
    command: -config.file=/etc/promtail/promtail.yaml
    networks:
      - loki

  loki-gateway:
    image: nginx:1.19
    volumes:
      - ./config/nginx-loki.conf:/etc/nginx/nginx.conf
    ports:
      - "80"
      - "3100"
    networks:
      - loki

  loki-frontend:
    image: grafana/loki:2.2.1
    volumes:
        - ./config:/etc/loki/
    ports:
        - "3100"
    command: "-config.file=/etc/loki/loki-memberlist.yaml -target=query-frontend"
    networks:
      - loki
    deploy:
      mode: replicated
      replicas: 2

  loki-1:
    image: grafana/loki:2.2.1
    volumes:
      - ./config:/etc/loki/
      - ./chunks:/loki/chunks/
    ports:
      - "3100"
      - "7946"
    command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
    networks:
      - loki
    restart: on-failure

  loki-2:
    image: grafana/loki:2.2.1
    volumes:
      - ./config:/etc/loki/
      - ./chunks:/loki/chunks/
    ports:
      - "3100"
      - "7946"
    command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
    networks:
      - loki
    restart: on-failure

  loki-3:
    image: grafana/loki:2.2.1
    volumes:
      - ./config:/etc/loki/
      - ./chunks:/loki/chunks/
    ports:
      - "3100"
      - "7946"
    command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
    networks:
      - loki
    restart: on-failure
    
networks:
  loki:

promtail.yaml






















server:
    http_listen_port: 9080
    grpc_listen_port: 0
    log_level: "debug"
  
positions:
    filename: /tmp/positions.yaml
  
clients:
    - url: http://loki-gateway:80/loki/api/v1/push
  
scrape_configs:
  - job_name: system
    static_configs:
    - labels:    
        job: MarkLogic
        __path__: /S*/*
        type: ml 
    - labels:
        job: System
        __path__: /var/log/*log
        type: srv

nginx-loki.conf




































































error_log  /dev/stderr;
pid        /tmp/nginx.pid;
worker_rlimit_nofile 8192;

events {
    worker_connections  4096;  ## Default: 1024
}

http {

  default_type application/octet-stream;
  log_format   main '$remote_addr - $remote_user [$time_local]  $status '
    '"$request" $body_bytes_sent "$http_referer" '
    '"$http_user_agent" "$http_x_forwarded_for"';
  access_log   /dev/stderr  main;
  sendfile     on;
  tcp_nopush   on;

  upstream distributor {
    server loki-1:3100;
    server loki-2:3100;
    server loki-3:3100;
  }

  upstream querier {
    server loki-1:3100;
    server loki-2:3100;
    server loki-3:3100;
  }

  upstream query-frontend {
    server loki-frontend:3100;
  }

  server {
    listen 80;
    proxy_set_header     X-Scope-OrgID docker-ha;

    location = /loki/api/v1/push {
        proxy_pass       http://distributor$request_uri;
    }
    
    location = /ring {
        proxy_pass       http://distributor$request_uri;
    }

    location = /loki/api/v1/tail {
        proxy_pass       http://querier$request_uri;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

    location ~ /loki/api/.* {
        proxy_pass       http://query-frontend$request_uri;
    }
  }

  server {
    listen 3100;
    proxy_set_header     X-Scope-OrgID docker-ha;

    location ~ /loki/api/.* {
        proxy_pass       http://querier$request_uri;
    }
    
  }
}

loki-memberlist.yaml

























































































auth_enabled: false

http_prefix:

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_port: 9095
  log_level: debug

memberlist:
  join_members: ["loki-1", "loki-2", "loki-3"]
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_addr: ['0.0.0.0']
  bind_port: 7946

ingester:
  lifecycler:
    join_after: 60s
    observe_period: 5s
    ring:
      replication_factor: 2
      kvstore:
        store: memberlist
    final_sleep: 0s
  chunk_idle_period: 1h
  max_chunk_age: 1h
  chunk_retain_period: 30s
  chunk_encoding: snappy
  chunk_target_size: 0
  chunk_block_size: 262144

schema_config:
  configs:
  - from: 2020-08-01
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
    shared_store: filesystem
    active_index_directory: /tmp/loki/index
    cache_location: /tmp/loki/boltdb-cache
  filesystem:
    directory: /loki/chunks

limits_config:
  max_cache_freshness_per_query: '10m'
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 30m
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

chunk_store_config:
  max_look_back_period: 336h

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

query_range:
  align_queries_with_step: true
  max_retries: 5
  split_queries_by_interval: 15m
  parallelise_shardable_queries: true
  cache_results: true

  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        size: 1024
        validity: 24h

frontend:
  log_queries_longer_than: 5s
  downstream_url: http://loki-gateway:3100
  compress_responses: true

querier:
  query_ingesters_within: 2h

Metrics monitoring

I chose components to monitor the system metrics:

Telegraf is the agent for collecting and reporting metrics and data. Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.
InfluxDB is an open-source database optimized for fast, high-availability storage and retrieval of time series data written in Go. InfluxDB is great for operations monitoring, application metrics, and real-time analytics.

Telegraph is installed on all servers with the application being monitored. The configuration file is set to have the agent collect system metrics and send to the central server where the InfluxDB image container is running.

Configuring Telegraf on the application server side

telegraf.conf






































































[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  hostname = "influxdb"
  omit_hostname = false

# InfluxDB v2
[[outputs.influxdb_v2]]
  urls = ["$INFLUX_HOST"]
  token = "$TELEGRAF_WRITE_TOKEN"
  organization = "$INFLUX_ORG"
  bucket = "$INFLUX_BUCKET"

# Read metrics about cpu usage
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  fielddrop = ["time_*"]

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]

# Read metrics about disk IO by device
[[inputs.diskio]]
  # no configuration

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration

# Read metrics about network interface usage
[[inputs.net]]
  # no configuration
  
[[inputs.netstat]]
  # no configuration

[[inputs.interrupts]]
  # no configuration

[[inputs.linux_sysctl_fs]]
  # no configuration

Configuring InfluxDB on the central server side

InfluxDB settings were added to the docker-compose-ha.yaml file

docker-compose-ha.yaml


















influxdb:
    container_name: influxdb
    image: influxdb:2.0.6
    restart: always
    networks:
        loki:
    ports:
        - 8086:8086
    env_file:
        - './config/influxdb.env'
    volumes:
        - type: "volume"
            source: influxdb2-data
            target: /var/lib/influxdb/
        
volumes:
  influxdb2-data:
    name: influxdb2-data

influxdb.env







INFLUXDB_DATA_ENGINE=tsm1
INFLUXDB_REPORTING_DISABLED=false
INFLUX_DB=metrics
INFLUXDB_ADMIN_ENABLED=true
INFLUX_USERNAME=xxx
INFLUX_PASSWORD=yyy
INFLUX_RETENTION=0

Summary

At the very end, all that remains is to run all the components and connect all the data sources in Grafana from the browser level. This solution gives us the ability to monitor any application both on our own server infrastructure and in the cloud.

Low resource monitoring stack

Monitoring

Logs

Client Configuration:

Central server configuration:

Defined names (file structure example)

Promtail/Loki/Grafana Compose Configuration:

Configuration of docker-compose

Metrics monitoring

Configuring Telegraf on the application server side

Configuring InfluxDB on the central server side

Summary

Functional diagram of the complete solution

Read more

Trieve Self-Hosting Guide

Trieve

Untitled

Summarize this PDF document (o