# SEE 15.06.2022 Notes - Monitoring
## Overview / Current contex
We have a SaaS microservices platform with multiple horizontally scalable services. We have a way to collect individual logs, but it is impossible to trace events load-balanced across a service (type, not instance). [We need the XI factor of a 12-factor app - logs.](https://12factor.net/logs).
Apart from service monitoring of our platform, there are other types of monitoring and different agents that we can use, like:
- Exception monitoring - we already do that client-side with Sentry, server side with Winston
- Usage monitoring - currently not done systematically with our application
- Infrastructure monitoring - we collect data with Prometheus
- Service health monitoring
- Network monitoring
- Resource usage monitoring
- Security monitoring
- ...
- Performance monitoring

How does the elastic stack work?
Roughly speaking, the diagram above depicts the basic ELK flows:
COLLECT+SHTIP (agents) --> [TRANSFORM] (logstash) --> INDEX (elasticsearch) --> VISUALIZE (kibana)
Here, it is important to highlight the following:
- COLLECT + SHIP - the first step collects and streams logs via agents. Here it is very important to highlight, that there isn't one type of agent in the ELK stack. We have the following:
- filebeat - collects file logs and ships them
- metricbeat - infrastructure metrics
- apm - performance monitoring
- heartbeat - uptime
- RUM agent - user experience
- winlogbeat - Windows logs
- packetbeat - wire data
- auditbeat - linux monitoring of user and process activity
- security agent - https://www.elastic.co/security/endpoint-security
The collect phase of our application is at the moment done via ECK (Elastic Cloud for Kubernetes). The relevant configuration is in infra ops repo, orchestration/k8s/base/elk/05-filebeat.yml:
```yml
# Can't have elastic reference + output configured at the same time
# elasticsearchRef:
# name: elasticsearch
# namespace: elk
kibanaRef:
name: kibana
namespace: elk
config:
filebeat:
autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints:
enabled: true
default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
processors:
- add_cloud_metadata:
namespace: default
- add_host_metadata:
namespace: default
output.logstash:
hosts: ["logstash.elk.svc.cluster.local:5044"]
```
This configuration means that we are using k8s autodiscover provider, it finds our nodes, finds where the container logs are, and ships them to logstash. If we uncomment the first four lines and comment the last two, we'll ship directly to elasticsearch.
Here we have a couple of processors - and processors are at the heart of filebeat. They add metadata to our logs which is then indexed by elasticsearch and can be searched upon later. **NB - we need to discuss what metadata is needed as logging a lot of data costs money in the end of the day.**
- TRANSFORM - logstash is an optional step in the elastic setup. Logs can be shipped directly, unformatted, directly from a beats agent to elasticsearch db. That is the fastest way to get a PoC and start logging data (what we have done) but also ends up with hard to filter, hard to visualize information.
logstash configuration is in orchestration/k8s/base/elk/04-logstash.yml. The most important configuration here is the logstash.conf(this is WIP):
```yml
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}"}
}
geoip {
source => "clientip"
ecs_compatibility => disabled
}
}
output {
elasticsearch {
hosts => [ "${ES_HOSTS}" ]
user => "${ES_USER}"
password => "${ES_PASSWORD}"
cacert => '/etc/logstash/certificates/ca.crt'
}
}
```
This configuration defines where we get the data from (beats), how it is filtered + transformed, and where we output it to (elasticsearch)
- INDEX - indexing in elasticsearch is very important because it is at the heart of everything elastic does. Data needs to be formatted in order to be indexed, and indexed data is super fast to be filtered / searched upon.
We have a 2-node elasticsearch instance in acceptance with the following [roles](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html):
- master
- data
- ingest
- transform
- VISUALIZE - Kibana is an outstandingly flexible tool with rich library of graphs and can visualize data in many different ways. With kibana + proper error logging (the right severity) we can e.g. analyze how many errors vs warnings we have, how many errors for a specific time period etc. If we, for example, add geolocation to the index via logstash, we can then also filter event type per location.
There is already an Alkemio Logs dashboard. Here we visualize the logs our relevant services (we can have multiple dashboards if we need to):

We can filter based on the indexes we have, but our Alkemio logs are pretty free form so we can't really apply great transformation of them (idea - transform the log message to json object?). Also, we don't have location, severity etc. to be filtered from.
Not covered / prioritized / discussed until now:
- different agents than filebeat
- format of the transformation we want
- data to collect
- dashboard logs - do we have one dashboard, multiple dashboard (e.g. alkemio-*, auth (traefik + oathkeeper + kratos), routing (traefik + oathkeeper) etc)
- what do we want to get out of elasticsearch (e.g., there is an ML role that I have turned off)
## Goals of the meeting
- share the context, terminology of ELK stack so we are aligned on our short- mid- and long- term goals
- understand the complexity vs business value of different approaches
- prioritize most suitable next set of actions
## Possible solutions
- logs --> severity (level), timestamp, service name, message, location (geoip), log context, service version?
- notifications - how we can notify users / groups? Kibana alerting
- correlationId - It's a UUID assigned to every request/transaction/"action in general" in our system, which is added to every log entry and forwarded in between microservices to help track all the log entries related to certain transaction. This way if we see an error with corrId=XX, we can filter by that corrId and see all the debug/info/warn entries associated to this transaction and filtering out all log entries related to other concurrent user's actions. More info:
- https://www.oreilly.com/library/view/building-microservices-with/9781785887833/1bebcf55-05bb-44a1-a4e5-f9733b8edfe3.xhtml
- https://www.rapid7.com/blog/post/2016/12/23/the-value-of-correlation-ids/
- reporting dashboards
## Actions
- spike: investigate notifications - kibana alerting
- spike: investigate correlationId
- logstash - proper set of fields sent to elasticsearch (above)
- dashboards - vNext (after logstash) - define dashboard with e.g. errors per service per release
## Agreements
###### tags: `SEE`