---
title: '15 AWS Monitoring - Cloudwatch, X-Ray and Cloudtrial'
disqus: hackmd
---
:::info
AWS Certified Developer Associate DVA-C01
:::
15 AWS Monitoring - Cloudwatch, X-Ray and Cloudtrial
===
<style>
img{
/* border: 2px solid red; */
margin-left: auto;
margin-right: auto;
width: 90%;
display: block;
}
</style>
## Table of Contents
[TOC]
Monitoring in AWS
---
- why is monitoring impt?
- need to know how to deploy apps
- safely
- automatically
- using infra as code
- leveraging the best aws components
- ensure app is always working at best of ability
- app latency
- inc over time?
- app outages
- customer experience shld not be degraded
- troubleshooting and remediation
- internal monitoring
- can we prevent issues before it happens?
- perf and cost
- trends
- scaling patterns
- learning and improvement
### AWS Monitoring
- aws cloudwatch
- metrics - collect and track key metrics
- logs - collect, monitor, analyse and store log files
- events
- send notifs when certain events happen in your aws
- alarms
- react in realtime to metrics/events
- aws x-ray
- troubleshooting app perf and errors
- distributed tracing of microservices
- aws cloudtrial
- internal monitoring of api calls made
- audit changes to aws res by users
AWS CloudWatch
---
### Metrics
- cloudwatch provides metrics for __every__ service in aws
- __metric__ is a var to monitor
- Eg. cpu utilisation, network bandwidth etc.
- metrics belong to namespaces
- __dimension__ is attribute of metric
- Eg. instance id, env etc.
- up to 10 dimensions per metric
- metrics have __timestamps__
- can create cloudwatch dashboard of metrics
#### CloudWatch EC2 Detailed Monitoring
- ec2 instance metrics have metrics every 5 mins
- with details monitoring, u get data every 1 min at a cost
- use detailed monitoring if want to more promptly scale your asg
- aws free tier allows for 10 detailed monitoring metrics
- NOTE
- ec2 memory usage is by default not pushed
- must be pushed from inside the instance as custom metric
#### Custom Metrics
- can define and send own custom metrics to cloudwatch
- abi to use dimensions (attr) to segment metrics
- instance.id
- env.name
- metric resolution - `StorageResolution` API param has 2 possible vals
- standard
- 1 min
- default
- high resolution
- 1 second
- higher cost
- have very fine-grained metrics
- use api called `PutMetricData`
- to send metric to cloudwatch
- use exponential back off incase of throttle errors
- eg. send too many metrics to cloudwatch
### Alarms
- alarms used to trigger notifs for any metric
- can go to auto scaling, ec2 actions, sns notifs
- various options
- sampling, %, max, min etc.
- alarm states
- OK
- INSUFFICIENT_DATA
- ALARM
- period
- length of time in seconds to evaluate the metric
- high resolution custom metrics - can only choose 10 or 30 sec
### Logs
- apps can send logs to cloudwatch using sdk
- cloudwatch can collect logs from
- elastic beanstalk
- collection of logs from app
- ecs
- collection from containers
- aws lambda
- collection from func logs
- vpc flow logs
- vpc specific logs
- api gateway
- cloudtrial based on filter
- cloudwatch log agents
- Eg. ec2 machines
- these agents will get log files and send to cloudwatch
- route53
- log dns queries
- logs can go to
- batch explorer to s3 for archival
- stream to elastisearch cluster for further analytics
- logs can use filter expressions
- logs storage architecture
- log grps
- arbitrary name, usually representing an app
- log stream
- instances within app/log files/containers
- can define log expiration policies
- nvr expire, 30 days etc.
- nvr expire by default
- defined at log grp lvl
- using aws cli, we can tail cloudwatch logs
- to send logs to cloudwatch, ensure iam perms are correct
- security
- encryption of logs using kms at grp lvl
#### Logs for EC2
- by default no logs from ec2 will go to cloudwatch
- need to run cloudwatch agent on ec2 to push log files u want
- make sure iam perms correct
- role that can send logs to cloudwatch
- cloudwatch log agent can be set on-premise servers too

### Log Agents and Unified Agent
- both for virtual servers
- Eg. ec2 instances, on-premise servers
- cloudwatch log agents
- old ver of agent
- can only send to cloudwatch logs
- cloudwatch unified agent (new)
- collect extra system-lvl metrics like ram, processes etc.
- collect logs to send to cloudwatch logs
- can do both metrics and logs (hence unified)
- centralised config using SSM param store
- log agent dont have this
#### Unified Agent - Metrics
- collect directly on linux server/ec2 machine
- cpu
- active
- guest
- idle
- system
- user
- steal
- disk metrics
- free, used, total
- disk IO
- writes, reads, bytes, iops
- ram
- free, inactive, used, total, cached
- netstat
- num of tcp and udp conns, net packets, bytes
- processes
- total, dead, bloqued, idle, running, sleep
- swap space
- free, used, used %
- reminder
- out of box metrics for ec2 - disk, cpu, network (high lvl)
- cloudwatch unified agent allows you to get a lot more metrics and more granular details than normal monitoring of ec2 instances
#### Logs Metric Filter
- cloudwatch logs can filter expressions
- eg find specific ip inside log or count occurences of error in logs
- metric filters can be used to trigger alarms
- filters dont retroactively filter data
- only publish metric data pts for events that happen after filter created

### Cloudwatch Events
- schedule events - cron jobs
- event pattern
- event rules to react to service doing sth
- Eg. codepipeline state change
- triggers to lambda funcs, sqs, sns, kinesis msgs
- cloudwatch event creates small json doc to give info abt change
### Console
#### Metrics


- 801 metrics grped by category

- can get graph representation of metric when selected
- can choose duration of data shown - 1h, 3d, 1w etc

- can add graphs to your own dashboard


- in asg, have option to enable group metrics collection
- by default not enabled

- for ec2 can also go to monitoring tab to view all metrics
#### Alarms

- shown here is default alarms created by beanstalk

- if app gets less than 2mb of data out for 5mins, decrease the size of our asg

- can also view scaling policies in asg for a brief descrip of alarms

- create alarm by first choosing a metric

- set threshold for metric
- can also set alarm actions
- sending notif
- sending to an email
- need to cfm your email

- new alarm created
- wait for state to become OK
#### Logs

- default log grp created by codebuild


- by clicking search logs, can filter your logs

- edit retention setting
- when does event expire?
- export data to amazon s3
- stream logs into elastisearch/lambda subscription filter


##### Configuring Beanstalk to Stream Logs to Cloudwatch

- beanstalk env > configuration > edit software config

- can stream logs to s3 or cloudwatch


- alternative can also scroll to monitoring config and edit (above)
#### Metric Filter

- create metric filter

- state pattern

- can send custom log data or get directly from logs to test


- assign metric attributes
- metric name and namespace
- filter name

- state metric val
- what val it returns when filter matches Eg. 1
- default val - if no filter matches what does it return

- metric created wont appear in metric tab yet
- since there arent any probs to generate logs to filter yet

- go to one of your services and restart to generate a some logs

- our custom metric shld appear now
- click on it to generate a graph view

- graph is empty as metric filters dont backfill data
- only consists of data after filter is created

- can click create alarm to do some automations


- state conditions for alarm

- set alarm trigger
- and also whr to send notif to

#### Events

- set event schedule
- happen every 5mins etc.
- can also setup cron expressions

- can view sample json obj of the event
- can also edit event directly from its json obj

- define target to invoke event on

- can also invoke event based on event pattern

- example
- event invoke on codepipeline state change to FAILED

- finalise rule details
Amazon EventBridge
---
- eventbridge is next evolution of cloudwatch events
- event buses
- __default event bus__
- generated by aws services
- eg. cloudwatch events
- __partner event bus__
- receive events from saas service or apps
- other parties can send events to your aws acc and u can react to it
- Eg. zendesk, datadog, segment, auth0 etc.
- __custom event buses__
- for own apps to publish own events and have other apps react to it
- event buses can be accessed by other aws accs
- allows for cross acc event buses
- rules
- how to process the events
- similar to cloudwatch events
### Schema Registry
- eventbridge can analyse the events in your bus and infer __schema__
- schema - how the data is structured
- the schema registry allows you to generate code for your app that will know in advance how data is structured in the event bus
- save time
- add safety
- schema can be versioned
- can make schema/events evolve over time

### EventBridge VS CloudWatch Events
- eventbridge builds upon and extends cloudwatch events
- uses same service API and endpt and same underlying service infrastructure
- eventbridge allows extension to add event buses for your custom apps and 3rd party saas apps
- eventbridge has schema registry capabilty
- event bridge has diff name to mark the new capabiities
- over time, cloudwatch events will be replaced by eventbridge
### Console

- alr have default event bus

- create custom event bus
- can give access to other accs or orgs

- custom event bus created
- own apps can publish event to

- partners with eventbridge
- list will grow in future
- ea partner will have setup instructions

- example symantec
- copy aws acc info
- create event bus for symantec

- default event bus
- to view its rules, click on the bus and go to the rules tab

- 2 rules in eventbridge default bus is exact same rules in cloudwatch events
- hence highlighting fact that they're built on same infra

- creating new rule for your bus

- define pattern

- many configs for service provider of rule

- select event bus to invoke event on and select target

- schema registry tab

- many schemas included
- can search through all
- get info for schemas by aws
- or self discovered schemas by our own event buses
- or create own custom schema registry

- example codepipeline schema for actionexecutionstatechange
- have entire schema of ver 1
- defines what you can expect in every field out of event coming from codepipeline for the event type

- can download code bindings
AWS X-Ray
---
#### Limitations of Debugging
- debugging in production the good old way
- test locally
- add log statements everywhr
- redeploy in prod
- log formats differ across apps using cloudwatch
- makes analytics hard
- have to centralise insights
- debugging
- monolith - easy
- distributed services - hard
- hundred of microservices running
- no common views of your architecture
### Intro to X-Ray
- provides a visual analysis of our apps

### Advantages
- troubleshooting performance
- bottlenecks
- understand dependencies in microservice architecture
- pinpoint service issues
- review request behaviour
- find errors and exceptions
- give insights
- meeting time SLA (service lvl agreement)?
- where are you throttled?
- identify users impacted
### Compatibility
- aws lambda
- elastic beanstalk
- ecs
- elb
- api gateway
- ec2 instances or app server
- even on premise
### Leverages Tracing
- tracing is end to end way to following a request
- ea component dealing with req adds its own trace
- made up of segments and sub segments
- annotations can be added to traces to provide extra info
- ability to trace
- every req
- sample req - as % for example or rate per min
- x-ray security
- iam for authorisation
- kms for encryption at rest
### Enabling X-Ray
- through code
- Eg. java, python, go, nodejs, .net
- must import aws xray sdk
- very little modification needed
- app sdk will then capture
- calls to aws services
- http/https requests
- database calls - mysql, postgresql, dynamodb
- queue calls - sqs
- install x-ray daemon or enable x-ray aws integration
- xray daemon works as low lvl udp packet interceptor
- linux/windows/macs
- aws lambda/other aws services alr run xray daemon for you
- ea app must have iam rights to write data to xray

- xray collects data from all diff services
- service map computed from all segments and traces
- xray is graphical so even non tech people can help troubleshoot
### Troubleshooting
- if xray not working on ec2
- ensure ec2 iam role has proper perms
- ensure ec2 instance is running xray daemon
- to enable on aws lambda
- ensure it has iam execution role with proper policy
- `AWSX-RayWriteOnlyAccess`
- ensure xray is imported in code
- and xray integration enabled on lambda
### X-Ray Instrumentation
- __instrumentation__ - measure of product's perf, diagnose errors and to write trace info
- to instrument your app code, use xray sdk
- many sdk need config changes
- can modify app code to customise and annotate data that xray sends to xray using __interceptors, filters, handlers, middleware etc.__

### X-Ray Concepts
- segments
- ea app/service will send them
- subsegments
- if need more details in segments
- trace
- segments collected tgt to form end-to-end trace
- sampling
- decrease amt of reqs sent to xray
- reduce cost
- annotations
- key val pairs used to index traces and use with filters
- be able to search traces with new indexes vs metadatas
- metadata
- key val pairs
- not indexed
- not used for searching
- xray daemon/agent has config to send traces cross acc
- ensure iam perms correct - agent will assume the role
- allows to have central acc for all the app tracing
### Sampling Rules
- with sampling rules, can control the amt of data u record
- can modify sampling rules w/o changing code
- more reqs, more u need to pay
- by default, xray sdk records 1st req ea second and 5% of any extra reqs
- 1 req per sec is __reservoir__
- this ensures at least 1 trace is recorded ea sec as long as service is serving reqs
- reservoir set the least amount of reqs that has to be sent to xray
- 5% is the __rate__ at which extra reqs beyond the reservoir size is sampled
- rate is % of extra reqs out of total reqs to send to xray if reqs sent per sec reaches reservoir amt
- Eg. reservoir - 50, fixed rate - 10%
- 100 reqs per second matches the rule - 50 + 5 reqs per sec sent
- 5 is 10% of reservoir's 50
- NOT SURE IF THIS CORRECT THOUGH
- [Read the Docs](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html)
#### Custom Sampling Rules
- can create own rules with reservoir and rate

- reservoir 10 so abt 10 reqs per sec is sent to xray

- rate is 1 so all reqs are sent
- since its debugging
- though very expensive
- NOTE
- if change sampling rules in xray console, dont need to restart apps
### X-Ray APIs
#### Write APIs
- used by xray daemon to write to xray
- `PutTraceSegments` - uploads segment documents to xray
- necessary to write to xray
- `PutTelemetryRecords` - used by aws xray daemon to upload telemetry
- `SegmentsReceivedCount`, `SegmentsRejectedCounts`, `BackendConnectionErrors` etc.
- for xray daemon to upload info about how many segments were received, rejected and backend conn errors
- helps with metrics
- `GetSamplingRules` - retrieve all sampling rules
- to know what/when to send
- for xray daemon to know what changed in rules when we update it
- `GetSamplingTargets` & `GetSamplingStatisticSummaries` - advanced
- related to get APIS
- xray daemon needs to have iam policy authorising the correct api calls to func correctly

#### Read APIs
- `GetServiceGraph` - main graph
- `BatchGetTraces`
- retrieves list of traces specified by id
- ea trace is collection of segments documents that originates from single req
- `GetTraceSummaries`
- retrieves IDs and annotations for traces avail for specified time frame using optional filter
- to get full traces, pass trace ids to `BatchGetTraces`
- `GetTraceGraph`
- retrieves a service graph for 1 or more specific trace ids

### X-Ray with Elastic Beanstalk
- aws elastic beanstalk platforms include xray daemon
- can run daemon by setting option in beanstalk console or with config file
- config file in `.ebextensions/xray-daemon.config` (see below it's just 1 line)
- make sure to give instance profile the correct iam perms so xray daemon can function correctly
- make sure app code is instrumented with xray sdk
- NOTE
- xray daemon is not provided for mulicontainer docker

### ECS + X-Ray
#### Integration Options
- ecs cluster
- xray container as daemon itself
- xray daemon container running on every instance

- ecs cluster
- xray container as side car
- 1 xray daemon alongside ea app container
- they will connect from a networking standpt
- xray daemon run side to side with app

- fargate cluster
- xray container as side car

#### Example Task Definition

- port mapping mapped to udp port 2000 for container
- app name - scorekeep-api
- env name called `AWS_XRAY_DAEMON_ADDRESS`
- need to set so xray will know how to find xray daemon
- links section links the port mapping and xray daemon tgt from a networking standpoint
### Console

- getting started
- sample app or instrument own app


- add xray sdk to your app

- xray sdk dont send traces directly so need to setup daemon too
- they want to avoid throttling

- when creating sample app will launch u into cloudformation to create a new stack

- a bunch of xray res created alongside

- xray sample app
- start btn will generate signups every 6 seconds
- creates duplicate signups every min which will generate error

- xray service map generated
- 169.254.169.254 is the metadata service from aws
- can see that there's errors happening in front end dev app which is coming from dynamodb

- error is highlighted in yellow

- can click error > view traces to view all traces of errors
- 66% is signup while rest is favico

- more details on a trace
- frontend made a req to dynamodb
- POST > put item

- can even show what goes on in dynamodb

- this is an example of an OK trace
#### Sampling Rules


- default sampling rule
- priority high
- default rate is 5%

- matching criteria to match reqs
- now its set to match all

#### Xray with Beanstalk
- go to beanstalk env > configuration > software edit

- just enable xray daemon

- make sure ec2 instance of beanstalk has correct roles

- elasticbeanstalk web tier for ec2 iam role

- alr has read/write perms for xray

- json ver for that
AWS CloudTrial
---
- provides governance, compliance and audit for your aws acc
- enabled by default
- get history of events/API calls made within your aws acc
- console
- sdk
- cli
- aws services
- can put logs from cloudtrial into cloudwatch logs
- so u can keep the events
- if res deleted in aws, look in cloudtrial first
### Console


- example event


- can filter event history
CloudTrial VS CloudWatch VS X-Ray
---
- cloudtrial
- audit api calls made by users/services/aws console
- useful to detect unauthorised calls or root cause of changes
- cloudwatch
- cloudwatch metrics over time for monitoring
- cloudwatch logs for storing app log
- cloudwatch alarms to send notifs in case of unexpected metrics
- xray
- automated trace analysis & central service map visualisation
- latency, errors and fault analysis
- request tracking across distributed systems
###### tags: `AWS Developer Associate` `Notes`