Low resource monitoring stack
===
## **Monitoring**
In my current job I am a MarkLogic Administrator, which is quite complex. In the version we are currently working on, monitoring of MarkLogic events is quite persistent as the vendor itself admits that their standard system of metrics may show unrealistic data. A few months ago I decided to do a reasearch on the possibilities of monitoring MarkLogic by independent (open-source) applications. The product is not very well known, so there is not much choice, so I decided to create my own stack of monitoring components.
Proper application monitoring should support tracking of all events on the server where the application is hosted. Applications based on UNIX system should contain implementation of system library, which passes messages to syslog. In this case to be sure if the application is healthy we should collect information about all application events, system logs and system metrics.
The main goal was to create a solution, which does not use large system resources and has great flexibility in configuration with other monitoring components. I divided the description of the solution into 2 parts, in the first part I will describe the components for logs, and further I will describe components for monitoring system metrics.
### **Logs**
For redirecting log streams I chose Rsyslog, which has a whole bunch of advantages and is available on most of the available UNIX distributions. I've configured Rsyslog on all servers where the application is running so that all system and application logs are redirected to the central Rsyslog server. Our production server is divided into several projects, so the logs on the central server are already pre-filtered and sorted.
The central Rsyslog server, has been configured to listen on a specific port to receive messages from all Rsyslog clients. Each message is written to a file according to the template description and sorted into folders based on the hostname of the client from which the message was sent. Writing to a file was not necessary, I could have used directly redirecting the logs to any other application that indexes the log streams. Initial segregation and saving makes it easier for me to configure the next monitoring components and gives me many options in archiving old messages.
<center>
<img src="https://i.imgur.com/9OBM6uo.png" alt="" class="" data-position="7663" data-size="36" loading="lazy">
</center>
### **Client Configuration:**
:::spoiler `/etc/rsyslog.d/client-collector.conf`
```c=
####################################
############# INIT MODULE ##########
####################################
$ModLoad imfile
$InputFilePollInterval 1
####################################
########## INIT STATIC LOG #########
####################################
input(type="imfile" File="/MarkLogicLogs/TaskServer_ErrorLog.txt" Tag="ml_TSERR" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/TaskServer_RequestLog.txt" Tag="ml_TSREQ" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/AuditLog.txt" Tag="ml_AUDIT" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/CrashLog.txt" Tag="ml_CRASH" Ruleset="FileNameTAG" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/ErrorLog.txt" Tag="ml_ERROR" Ruleset="FileNameTAG" addMetadata="on")
####################################
########## INIT WILDCARDS ##########
####################################
input(type="imfile" File="/MarkLogicLogs/7*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/8*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")
input(type="imfile" File="/MarkLogicLogs/9*_*Log.txt" Tag="db_" Ruleset="FileNameRegex" addMetadata="on")
###################################
############# TEMPLATE ############
###################################
template(name="LongTagForwardFormat" type="string" string="<%PRI%>%TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag%%$.suffix:1:6:%%msg:::sp-if-no-1st-sp%%msg%")
###################################
############# RULESET #############
###################################
ruleset(name="FileNameRegex") {
set $.suffix=re_extract($!metadata!filename, "(.*)/([^/]*)", 0, 2, "all.txt");
call sendToLogserver
}
ruleset(name="FileNameTAG") {
call sendToLogserver
}
ruleset(name="sendToLogserver") {
action(type="omfwd" Target="centralhostname.com" Port="514" template="LongTagForwardFormat" Protocol="tcp")
}
###################################
########## END-CFG ################
###################################
```
:::
##### **Central server configuration:**
:::spoiler `/etc/rsyslog.d/central-collector.conf`
```c=
####################################
########## JSON FILES ##############
####################################
lookup_table(name="def_host" file="/etc/rsyslog.d/def_name.json" reloadOnHUP="on")
####################################
########## INIT MODULE #############
####################################
module(load="imtcp")
module(load="mmrm1stspace")
input(type="imtcp" port="514")
$FileOwnerNum 1000
$FileGroupNum 1000
$DirGroupNum 1000
$DirOwnerNum 1000
$FileCreateMode 0644
$DirCreateMode 0755
###################################
########## TEMPLATE ###############
###################################
template(name="RemoteLogSavePath" type="list") {
constant(value="/opt/monitoring/logs/")
property(name="$.envTYPE")
constant(value="/")
property(name="$.clusterID")
constant(value="/")
property(name="$.LogMainFileName")
}
template(name="LogResultMSG" type="list") {
property(name="$.def_hostid")
constant(value=" ")
property(name="$.LogMSGFileName")
property(name="msg" compressspace="on" spifno1stsp="on")
property(name="msg")
constant(value="\n")
}
###################################
########## ACTION #################
###################################
set $.clusterDB = lookup("def_host", $hostname);
if ($.clusterDB == "") then {
set $.envTYPE = "OTHER";
set $.clusterID = $hostname;
set $.def_hostid = $hostname;
} else {
set $.envTYPE = substring($.clusterDB, 0, 3);
set $.clusterID = substring($.clusterDB, 0, 4);
set $.def_hostid = $.clusterDB;
}
action(type="mmrm1stspace")
set $.LookUpJSON = lookup("def_host", substring($programname, 3, 4));
if ($.LookUpJSON == "") then {
set $.LogMainFileName = $programname;
set $.LogMSGFileName = $programname;
} else {
set $.LogMainFileName = $.LookUpJSON;
set $.LogMSGFileName = $programname;
}
action(type="omfile" DynaFile="RemoteLogSavePath" Template="LogResultMSG")
stop
###################################
########## END-CFG ################
###################################
```
:::
##### **Defined names (file structure example)**
:::spoiler `/etc/rsyslog.d/def_name.json`
```json=
{ "version" : 1,
"type" : "string",
"table" : [
{"index" : "apphostname1.com", "value" : "SERV_1"},
{"index" : "apphostname2.com", "value" : "SERV_2"},
{"index" : "apphostname3.com", "value" : "SERV_3"},
{"index" : "8001", "value" : "ml_ADMIN"},
]}
```
:::
---
### **Promtail/Loki/Grafana Compose Configuration:**
The preprocessed logs via Rsyslog are already on the central server, so the next step is to configure the search/indexing and visualization software. In my solution I used the Promtail-Loki-Grafana stack configured as a cluster on the docker. In the configuration I used 3 Loki instances connected to an Nginx gateway to route the read and write loads from the clients (Grafana, Promtail). We then get a much smoother running front-end application.

##### **Configuration of docker-compose**
:::spoiler `docker-compose-ha.yaml`
```yaml=
version: "3.8"
services:
grafana:
image: grafana/grafana:7.5.6
ports:
- "3000:3000"
networks:
- loki
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
restart_policy:
condition: on-failure
promtail:
image: grafana/promtail:2.2.1
volumes:
- /opt/monitoring/logs:/var/log
- ./config:/etc/promtail/
ports:
- "9080:9080"
command: -config.file=/etc/promtail/promtail.yaml
networks:
- loki
loki-gateway:
image: nginx:1.19
volumes:
- ./config/nginx-loki.conf:/etc/nginx/nginx.conf
ports:
- "80"
- "3100"
networks:
- loki
loki-frontend:
image: grafana/loki:2.2.1
volumes:
- ./config:/etc/loki/
ports:
- "3100"
command: "-config.file=/etc/loki/loki-memberlist.yaml -target=query-frontend"
networks:
- loki
deploy:
mode: replicated
replicas: 2
loki-1:
image: grafana/loki:2.2.1
volumes:
- ./config:/etc/loki/
- ./chunks:/loki/chunks/
ports:
- "3100"
- "7946"
command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
networks:
- loki
restart: on-failure
loki-2:
image: grafana/loki:2.2.1
volumes:
- ./config:/etc/loki/
- ./chunks:/loki/chunks/
ports:
- "3100"
- "7946"
command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
networks:
- loki
restart: on-failure
loki-3:
image: grafana/loki:2.2.1
volumes:
- ./config:/etc/loki/
- ./chunks:/loki/chunks/
ports:
- "3100"
- "7946"
command: "-config.file=/etc/loki/loki-memberlist.yaml -target=all"
networks:
- loki
restart: on-failure
networks:
loki:
```
:::
:::spoiler `promtail.yaml`
```yaml=
server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: "debug"
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki-gateway:80/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- labels:
job: MarkLogic
__path__: /S*/*
type: ml
- labels:
job: System
__path__: /var/log/*log
type: srv
```
:::
:::spoiler `nginx-loki.conf`
```nginx=
error_log /dev/stderr;
pid /tmp/nginx.pid;
worker_rlimit_nofile 8192;
events {
worker_connections 4096; ## Default: 1024
}
http {
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] $status '
'"$request" $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /dev/stderr main;
sendfile on;
tcp_nopush on;
upstream distributor {
server loki-1:3100;
server loki-2:3100;
server loki-3:3100;
}
upstream querier {
server loki-1:3100;
server loki-2:3100;
server loki-3:3100;
}
upstream query-frontend {
server loki-frontend:3100;
}
server {
listen 80;
proxy_set_header X-Scope-OrgID docker-ha;
location = /loki/api/v1/push {
proxy_pass http://distributor$request_uri;
}
location = /ring {
proxy_pass http://distributor$request_uri;
}
location = /loki/api/v1/tail {
proxy_pass http://querier$request_uri;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /loki/api/.* {
proxy_pass http://query-frontend$request_uri;
}
}
server {
listen 3100;
proxy_set_header X-Scope-OrgID docker-ha;
location ~ /loki/api/.* {
proxy_pass http://querier$request_uri;
}
}
}
```
:::
:::spoiler `loki-memberlist.yaml`
```yaml=
auth_enabled: false
http_prefix:
server:
http_listen_address: 0.0.0.0
grpc_listen_address: 0.0.0.0
http_listen_port: 3100
grpc_listen_port: 9095
log_level: debug
memberlist:
join_members: ["loki-1", "loki-2", "loki-3"]
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
bind_addr: ['0.0.0.0']
bind_port: 7946
ingester:
lifecycler:
join_after: 60s
observe_period: 5s
ring:
replication_factor: 2
kvstore:
store: memberlist
final_sleep: 0s
chunk_idle_period: 1h
max_chunk_age: 1h
chunk_retain_period: 30s
chunk_encoding: snappy
chunk_target_size: 0
chunk_block_size: 262144
schema_config:
configs:
- from: 2020-08-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
shared_store: filesystem
active_index_directory: /tmp/loki/index
cache_location: /tmp/loki/boltdb-cache
filesystem:
directory: /loki/chunks
limits_config:
max_cache_freshness_per_query: '10m'
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 30m
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
chunk_store_config:
max_look_back_period: 336h
table_manager:
retention_deletes_enabled: true
retention_period: 336h
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
size: 1024
validity: 24h
frontend:
log_queries_longer_than: 5s
downstream_url: http://loki-gateway:3100
compress_responses: true
querier:
query_ingesters_within: 2h
```
:::
---
### **Metrics monitoring**
I chose components to monitor the system metrics:
- **Telegraf** is the agent for collecting and reporting metrics and data. Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.
- **InfluxDB** is an open-source database optimized for fast, high-availability storage and retrieval of time series data written in Go. InfluxDB is great for operations monitoring, application metrics, and real-time analytics.
Telegraph is installed on all servers with the application being monitored. The configuration file is set to have the agent collect system metrics and send to the central server where the InfluxDB image container is running.
##### **Configuring Telegraf on the application server side**
:::spoiler `telegraf.conf`
```yaml=
[global_tags]
# Configuration for telegraf agent
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
hostname = "influxdb"
omit_hostname = false
# InfluxDB v2
[[outputs.influxdb_v2]]
urls = ["$INFLUX_HOST"]
token = "$TELEGRAF_WRITE_TOKEN"
organization = "$INFLUX_ORG"
bucket = "$INFLUX_BUCKET"
# Read metrics about cpu usage
[[inputs.cpu]]
percpu = true
totalcpu = true
fielddrop = ["time_*"]
# Read metrics about disk usage by mount point
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
# Read metrics about disk IO by device
[[inputs.diskio]]
# no configuration
# Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
# Get the number of processes and group them by status
[[inputs.processes]]
# no configuration
# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration
# Read metrics about system load & uptime
[[inputs.system]]
# no configuration
# Read metrics about network interface usage
[[inputs.net]]
# no configuration
[[inputs.netstat]]
# no configuration
[[inputs.interrupts]]
# no configuration
[[inputs.linux_sysctl_fs]]
# no configuration
```
:::
---
##### **Configuring InfluxDB on the central server side**
InfluxDB settings were added to the docker-compose-ha.yaml file
:::spoiler `docker-compose-ha.yaml`
```yaml=
influxdb:
container_name: influxdb
image: influxdb:2.0.6
restart: always
networks:
loki:
ports:
- 8086:8086
env_file:
- './config/influxdb.env'
volumes:
- type: "volume"
source: influxdb2-data
target: /var/lib/influxdb/
volumes:
influxdb2-data:
name: influxdb2-data
```
:::
:::spoiler `influxdb.env`
```bash=
INFLUXDB_DATA_ENGINE=tsm1
INFLUXDB_REPORTING_DISABLED=false
INFLUX_DB=metrics
INFLUXDB_ADMIN_ENABLED=true
INFLUX_USERNAME=xxx
INFLUX_PASSWORD=yyy
INFLUX_RETENTION=0
```
:::
---
### **Summary**
At the very end, all that remains is to run all the components and connect all the data sources in Grafana from the browser level. This solution gives us the ability to monitor any application both on our own server infrastructure and in the cloud.
### **Functional diagram of the complete solution**
