Try   HackMD

Low resource monitoring stack

Monitoring

In my current job I am a MarkLogic Administrator, which is quite complex. In the version we are currently working on, monitoring of MarkLogic events is quite persistent as the vendor itself admits that their standard system of metrics may show unrealistic data. A few months ago I decided to do a reasearch on the possibilities of monitoring MarkLogic by independent (open-source) applications. The product is not very well known, so there is not much choice, so I decided to create my own stack of monitoring components.

Proper application monitoring should support tracking of all events on the server where the application is hosted. Applications based on UNIX system should contain implementation of system library, which passes messages to syslog. In this case to be sure if the application is healthy we should collect information about all application events, system logs and system metrics.

The main goal was to create a solution, which does not use large system resources and has great flexibility in configuration with other monitoring components. I divided the description of the solution into 2 parts, in the first part I will describe the components for logs, and further I will describe components for monitoring system metrics.

Logs

For redirecting log streams I chose Rsyslog, which has a whole bunch of advantages and is available on most of the available UNIX distributions. I've configured Rsyslog on all servers where the application is running so that all system and application logs are redirected to the central Rsyslog server. Our production server is divided into several projects, so the logs on the central server are already pre-filtered and sorted.

The central Rsyslog server, has been configured to listen on a specific port to receive messages from all Rsyslog clients. Each message is written to a file according to the template description and sorted into folders based on the hostname of the client from which the message was sent. Writing to a file was not necessary, I could have used directly redirecting the logs to any other application that indexes the log streams. Initial segregation and saving makes it easier for me to configure the next monitoring components and gives me many options in archiving old messages.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Client Configuration:

/etc/rsyslog.d/client-collector.conf
#################################### ############# INIT MODULE ########## #################################### $ModLoad imfile $InputFilePollInterval 1 #################################### ########## INIT STATIC LOG ######### #################################### input(type="imfile" File="/MarkLogicLogs/TaskServer_ErrorLog.txt" Tag="ml_TSERR" Ruleset="FileNameTAG" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/TaskServer_RequestLog.txt" Tag="ml_TSREQ" Ruleset="FileNameTAG" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/AuditLog.txt" Tag="ml_AUDIT" Ruleset="FileNameTAG" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/CrashLog.txt" Tag="ml_CRASH" Ruleset="FileNameTAG" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/ErrorLog.txt" Tag="ml_ERROR" Ruleset="FileNameTAG" addMetadata="on") #################################### ########## INIT WILDCARDS ########## #################################### input(type="imfile" File="/MarkLogicLogs/7*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/8*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on") input(type="imfile" File="/MarkLogicLogs/9*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on") ################################### ############# TEMPLATE ############ ################################### template(name="LongTagForwardFormat" type="string" string="<%PRI%>%TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag%%$.suffix:1:6:%%msg:::sp-if-no-1st-sp%%msg%") ################################### ############# RULESET ############# ################################### ruleset(name="FileNameRegex") { set $.suffix=re_extract($!metadata!filename, "(.*)/([^/]*)", 0, 2, "all.txt"); call sendToLogserver } ruleset(name="FileNameTAG") { call sendToLogserver } ruleset(name="sendToLogserver") { action(type="omfwd" Target="centralhostname.com" Port="514" template="LongTagForwardFormat" Protocol="tcp") } ################################### ########## END-CFG ################ ###################################
Central server configuration:
/etc/rsyslog.d/central-collector.conf
#################################### ########## JSON FILES ############## #################################### lookup_table(name="def_host" file="/etc/rsyslog.d/def_name.json" reloadOnHUP="on") #################################### ########## INIT MODULE ############# #################################### module(load="imtcp") module(load="mmrm1stspace") input(type="imtcp" port="514") $FileOwnerNum 1000 $FileGroupNum 1000 $DirGroupNum 1000 $DirOwnerNum 1000 $FileCreateMode 0644 $DirCreateMode 0755 ################################### ########## TEMPLATE ############### ################################### template(name="RemoteLogSavePath" type="list") { constant(value="/opt/monitoring/logs/") property(name="$.envTYPE") constant(value="/") property(name="$.clusterID") constant(value="/") property(name="$.LogMainFileName") } template(name="LogResultMSG" type="list") { property(name="$.def_hostid") constant(value=" ") property(name="$.LogMSGFileName") property(name="msg" compressspace="on" spifno1stsp="on") property(name="msg") constant(value="\n") } ################################### ########## ACTION ################# ################################### set $.clusterDB = lookup("def_host", $hostname); if ($.clusterDB == "") then { set $.envTYPE = "OTHER"; set $.clusterID = $hostname; set $.def_hostid = $hostname; } else { set $.envTYPE = substring($.clusterDB, 0, 3); set $.clusterID = substring($.clusterDB, 0, 4); set $.def_hostid = $.clusterDB; } action(type="mmrm1stspace") set $.LookUpJSON = lookup("def_host", substring($programname, 3, 4)); if ($.LookUpJSON == "") then { set $.LogMainFileName = $programname; set $.LogMSGFileName = $programname; } else { set $.LogMainFileName = $.LookUpJSON; set $.LogMSGFileName = $programname; } action(type="omfile" DynaFile="RemoteLogSavePath" Template="LogResultMSG") stop ################################### ########## END-CFG ################ ###################################
Defined names (file structure example)
/etc/rsyslog.d/def_name.json
{ "version" : 1, "type" : "string", "table" : [ {"index" : "apphostname1.com", "value" : "SERV_1"}, {"index" : "apphostname2.com", "value" : "SERV_2"}, {"index" : "apphostname3.com", "value" : "SERV_3"}, {"index" : "8001", "value" : "ml_ADMIN"}, ]}

Promtail/Loki/Grafana Compose Configuration:

The preprocessed logs via Rsyslog are already on the central server, so the next step is to configure the search/indexing and visualization software. In my solution I used the Promtail-Loki-Grafana stack configured as a cluster on the docker. In the configuration I used 3 Loki instances connected to an Nginx gateway to route the read and write loads from the clients (Grafana, Promtail). We then get a much smoother running front-end application.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Configuration of docker-compose
docker-compose-ha.yaml
version: "3.8" services: grafana: image: grafana/grafana:7.5.6 ports: - "3000:3000" networks: - loki deploy: replicas: 1 placement: constraints: - node.role == manager restart_policy: condition: on-failure promtail: image: grafana/promtail:2.2.1 volumes: - /opt/monitoring/logs:/var/log - ./config:/etc/promtail/ ports: - "9080:9080" command: -config.file=/etc/promtail/promtail.yaml networks: - loki loki-gateway: image: nginx:1.19 volumes: - ./config/nginx-loki.conf:/etc/nginx/nginx.conf ports: - "80" - "3100" networks: - loki loki-frontend: image: grafana/loki:2.2.1 volumes: - ./config:/etc/loki/ ports: - "3100" command: "-config.file=/etc/loki/loki-memberlist.yaml -target=query-frontend" networks: - loki deploy: mode: replicated replicas: 2 loki-1: image: grafana/loki:2.2.1 volumes: - ./config:/etc/loki/ - ./chunks:/loki/chunks/ ports: - "3100" - "7946" command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all" networks: - loki restart: on-failure loki-2: image: grafana/loki:2.2.1 volumes: - ./config:/etc/loki/ - ./chunks:/loki/chunks/ ports: - "3100" - "7946" command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all" networks: - loki restart: on-failure loki-3: image: grafana/loki:2.2.1 volumes: - ./config:/etc/loki/ - ./chunks:/loki/chunks/ ports: - "3100" - "7946" command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all" networks: - loki restart: on-failure networks: loki:
promtail.yaml
server: http_listen_port: 9080 grpc_listen_port: 0 log_level: "debug" positions: filename: /tmp/positions.yaml clients: - url: http://loki-gateway:80/loki/api/v1/push scrape_configs: - job_name: system static_configs: - labels: job: MarkLogic __path__: /S*/* type: ml - labels: job: System __path__: /var/log/*log type: srv
nginx-loki.conf
error_log /dev/stderr; pid /tmp/nginx.pid; worker_rlimit_nofile 8192; events { worker_connections 4096; ## Default: 1024 } http { default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] $status ' '"$request" $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /dev/stderr main; sendfile on; tcp_nopush on; upstream distributor { server loki-1:3100; server loki-2:3100; server loki-3:3100; } upstream querier { server loki-1:3100; server loki-2:3100; server loki-3:3100; } upstream query-frontend { server loki-frontend:3100; } server { listen 80; proxy_set_header X-Scope-OrgID docker-ha; location = /loki/api/v1/push { proxy_pass http://distributor$request_uri; } location = /ring { proxy_pass http://distributor$request_uri; } location = /loki/api/v1/tail { proxy_pass http://querier$request_uri; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; } location ~ /loki/api/.* { proxy_pass http://query-frontend$request_uri; } } server { listen 3100; proxy_set_header X-Scope-OrgID docker-ha; location ~ /loki/api/.* { proxy_pass http://querier$request_uri; } } }
loki-memberlist.yaml
auth_enabled: false http_prefix: server: http_listen_address: 0.0.0.0 grpc_listen_address: 0.0.0.0 http_listen_port: 3100 grpc_listen_port: 9095 log_level: debug memberlist: join_members: ["loki-1", "loki-2", "loki-3"] dead_node_reclaim_time: 30s gossip_to_dead_nodes_time: 15s left_ingesters_timeout: 30s bind_addr: ['0.0.0.0'] bind_port: 7946 ingester: lifecycler: join_after: 60s observe_period: 5s ring: replication_factor: 2 kvstore: store: memberlist final_sleep: 0s chunk_idle_period: 1h max_chunk_age: 1h chunk_retain_period: 30s chunk_encoding: snappy chunk_target_size: 0 chunk_block_size: 262144 schema_config: configs: - from: 2020-08-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: shared_store: filesystem active_index_directory: /tmp/loki/index cache_location: /tmp/loki/boltdb-cache filesystem: directory: /loki/chunks limits_config: max_cache_freshness_per_query: '10m' enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 30m ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 chunk_store_config: max_look_back_period: 336h table_manager: retention_deletes_enabled: true retention_period: 336h query_range: align_queries_with_step: true max_retries: 5 split_queries_by_interval: 15m parallelise_shardable_queries: true cache_results: true results_cache: cache: enable_fifocache: true fifocache: size: 1024 validity: 24h frontend: log_queries_longer_than: 5s downstream_url: http://loki-gateway:3100 compress_responses: true querier: query_ingesters_within: 2h

Metrics monitoring

I chose components to monitor the system metrics:

  • Telegraf is the agent for collecting and reporting metrics and data. Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.
  • InfluxDB is an open-source database optimized for fast, high-availability storage and retrieval of time series data written in Go. InfluxDB is great for operations monitoring, application metrics, and real-time analytics.

Telegraph is installed on all servers with the application being monitored. The configuration file is set to have the agent collect system metrics and send to the central server where the InfluxDB image container is running.

Configuring Telegraf on the application server side
telegraf.conf
[global_tags] # Configuration for telegraf agent [agent] interval = "10s" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 10000 collection_jitter = "0s" flush_interval = "10s" flush_jitter = "0s" precision = "" debug = false quiet = false hostname = "influxdb" omit_hostname = false # InfluxDB v2 [[outputs.influxdb_v2]] urls = ["$INFLUX_HOST"] token = "$TELEGRAF_WRITE_TOKEN" organization = "$INFLUX_ORG" bucket = "$INFLUX_BUCKET" # Read metrics about cpu usage [[inputs.cpu]] percpu = true totalcpu = true fielddrop = ["time_*"] # Read metrics about disk usage by mount point [[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs"] # Read metrics about disk IO by device [[inputs.diskio]] # no configuration # Get kernel statistics from /proc/stat [[inputs.kernel]] # no configuration # Read metrics about memory usage [[inputs.mem]] # no configuration # Get the number of processes and group them by status [[inputs.processes]] # no configuration # Read metrics about swap memory usage [[inputs.swap]] # no configuration # Read metrics about system load & uptime [[inputs.system]] # no configuration # Read metrics about network interface usage [[inputs.net]] # no configuration [[inputs.netstat]] # no configuration [[inputs.interrupts]] # no configuration [[inputs.linux_sysctl_fs]] # no configuration

Configuring InfluxDB on the central server side

InfluxDB settings were added to the docker-compose-ha.yaml file

docker-compose-ha.yaml
influxdb: container_name: influxdb image: influxdb:2.0.6 restart: always networks: loki: ports: - 8086:8086 env_file: - './config/influxdb.env' volumes: - type: "volume" source: influxdb2-data target: /var/lib/influxdb/ volumes: influxdb2-data: name: influxdb2-data
influxdb.env
INFLUXDB_DATA_ENGINE=tsm1 INFLUXDB_REPORTING_DISABLED=false INFLUX_DB=metrics INFLUXDB_ADMIN_ENABLED=true INFLUX_USERNAME=xxx INFLUX_PASSWORD=yyy INFLUX_RETENTION=0

Summary

At the very end, all that remains is to run all the components and connect all the data sources in Grafana from the browser level. This solution gives us the ability to monitor any application both on our own server infrastructure and in the cloud.

Functional diagram of the complete solution

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →