# Complete PromQL Tutorial with Real-World Examples
## Table of Contents
1. [Introduction to PromQL](#Introduction-to-Promql)
2. [Basic Concepts](#Basic-Concepts)
3. [Data Types](#Data-Types)
4. [Selectors and Matchers](#Selectors-and-Matchers)
5. [Operators](#Operators)
6. [Functions](#Functions)
7. [Real-World Exercise Cases](#Real-World-Exercise-Cases)
8. [Best Practices and Guidelines](#Best-Practices-and-Guidelines)
9. [Advanced Patterns](#Advanced-Patterns)
## Introduction to PromQL
PromQL (Prometheus Query Language) is a functional query language that lets you select and aggregate time series data in real time. It's designed to be flexible and powerful for monitoring and alerting scenarios.
### Key Features:
- Real-time querying of time series data
- Rich set of operators and functions
- Support for mathematical operations
- Aggregation capabilities
- Vector matching for complex queries
## Basic Concepts
### Metrics
Metrics are identified by a metric name and optional key-value pairs called labels:
```promql
http_requests_total{method="POST", handler="/api/v1/users"}
```
### Time Series
A time series is a sequence of timestamped values sharing the same metric and label set.
### Vector Types
- **Instant Vector**: Set of time series containing a single sample for each time series, all sharing the same timestamp
- **Range Vector**: Set of time series containing a range of data points over time for each time series
## Data Types
### 1. Scalar
A simple numeric floating point value
```promql
42
3.14159
```
### 2. Instant Vector
```promql
# Current CPU usage
cpu_usage_percent
# HTTP requests in the last minute
rate(http_requests_total[1m])
```
### 3. Range Vector
```promql
# HTTP requests over the last 5 minutes
http_requests_total[5m]
# CPU usage over the last hour
cpu_usage_percent[1h]
```
### 4. String
Currently only used as literals in certain functions
```promql
label_replace(up, "new_label", "value", "instance", ".*")
```
## Selectors and Matchers
### Basic Label Matching
```promql
# Exact match
http_requests_total{method="GET"}
# Multiple labels
http_requests_total{method="GET", status="200"}
# All time series for a metric
http_requests_total
```
### Label Matching Operators
```promql
# Equality
{method="GET"}
# Inequality
{method!="GET"}
# Regular expression match
{method=~"GET|POST"}
# Regular expression not match
{method!~"GET|POST"}
```
### Range Vector Selectors
```promql
# Last 5 minutes
http_requests_total[5m]
# Last 1 hour with 30s resolution
http_requests_total[1h:30s]
```
## Operators
### Arithmetic Operators
```promql
# Addition
node_memory_MemTotal_bytes + node_memory_MemFree_bytes
# Subtraction - Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Multiplication - Convert to percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Division - Average request duration
http_request_duration_seconds_sum / http_request_duration_seconds_count
# Modulo
node_time_seconds % 86400
# Power
rate(cpu_seconds_total[5m]) ^ 2
```
### Comparison Operators
```promql
# Greater than - High CPU usage
cpu_usage_percent > 80
# Less than - Low disk space
disk_free_bytes < 1000000000
# Equal to
http_response_status == 200
# Not equal to
http_response_status != 200
# Greater than or equal to
memory_usage_percent >= 90
# Less than or equal to
response_time_seconds <= 0.5
```
### Logical Operators
```promql
# AND - High CPU AND low memory
(cpu_usage_percent > 80) and (memory_available_bytes < 1000000000)
# OR - Error status codes
(http_response_status == 400) or (http_response_status == 500)
# UNLESS - All instances except maintenance ones
up unless on(instance) maintenance_mode
```
### Vector Matching
```promql
# One-to-one matching
method_pos{method="GET"} / ignoring(method) group_left method_total
# Many-to-one matching
rate(http_requests_total[5m]) / on(instance) group_left(job) instance_info
# One-to-many matching
instance_info * on(instance) group_right(job) rate(http_requests_total[5m])
```
## Functions
### Rate Functions
```promql
# Rate - Per-second average rate of increase
rate(http_requests_total[5m])
# irate - Instantaneous rate based on last two data points
irate(http_requests_total[5m])
# increase - Total increase over time range
increase(http_requests_total[1h])
```
### Aggregation Functions
```promql
# Sum - Total across all instances
sum(rate(http_requests_total[5m]))
# Sum by labels - Total per service
sum by(service) (rate(http_requests_total[5m]))
# Average CPU usage per job
avg by(job) (cpu_usage_percent)
# Maximum memory usage
max(memory_usage_bytes)
# Minimum response time
min(response_time_seconds)
# Count number of instances
count(up == 1)
# Standard deviation
stddev(response_time_seconds)
# Quantiles
quantile(0.95, response_time_seconds)
```
### Mathematical Functions
```promql
# Absolute value
abs(temperature_celsius)
# Ceiling
ceil(response_time_seconds)
# Floor
floor(cpu_cores)
# Round
round(memory_usage_gb, 0.1)
# Square root
sqrt(variance_metric)
# Exponential
exp(log_metric)
# Natural logarithm
ln(exponential_metric)
# Logarithm base 2
log2(power_of_two_metric)
# Logarithm base 10
log10(decimal_metric)
```
### Time Functions
```promql
# Current timestamp
time()
# Day of month (1-31)
day_of_month()
# Day of week (0-6, Sunday=0)
day_of_week()
# Hour of day (0-23)
hour()
# Minute of hour (0-59)
minute()
# Month of year (1-12)
month()
# Year
year()
```
### String Functions
```promql
# Replace label values
label_replace(up, "new_instance", "$1", "instance", "([^:]+):.*")
# Join labels
label_join(up, "new_label", "_", "instance", "job")
```
### Selector Functions
```promql
# Last value over time range
last_over_time(cpu_usage_percent[1h])
# Maximum value over time range
max_over_time(response_time_seconds[5m])
# Minimum value over time range
min_over_time(cpu_temperature[10m])
# Average value over time range
avg_over_time(memory_usage_percent[1h])
# Count values over time
count_over_time(http_requests_total[1h])
# Quantile over time
quantile_over_time(0.95, response_time_seconds[5m])
# Standard deviation over time
stddev_over_time(cpu_usage_percent[1h])
```
## Real-World Exercise Cases
### Exercise 1: Web Server Monitoring
**Scenario**: Monitor a web application's performance and health.
**Metrics Available**:
- `http_requests_total{method, status, handler}`
- `http_request_duration_seconds{method, handler}`
- `http_request_size_bytes{method, handler}`
**Tasks**:
1. **Request Rate per Second**:
```promql
# Total requests per second
sum(rate(http_requests_total[5m]))
# Requests per second by HTTP method
sum by(method) (rate(http_requests_total[5m]))
```
2. **Error Rate**:
```promql
# 4xx and 5xx error rate
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
```
3. **Average Response Time**:
```promql
# Average response time in milliseconds
avg(http_request_duration_seconds) * 1000
```
4. **95th Percentile Response Time**:
```promql
# 95th percentile response time
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
```
### Exercise 2: Infrastructure Monitoring
**Scenario**: Monitor server infrastructure health.
**Metrics Available**:
- `node_cpu_seconds_total{mode, cpu}`
- `node_memory_MemTotal_bytes`
- `node_memory_MemAvailable_bytes`
- `node_filesystem_size_bytes{device, mountpoint}`
- `node_filesystem_free_bytes{device, mountpoint}`
**Tasks**:
1. **CPU Usage Percentage**:
```promql
# Overall CPU usage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU usage by core
100 - (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
```
2. **Memory Usage Percentage**:
```promql
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
3. **Disk Usage Percentage**:
```promql
# Disk usage percentage by mount point
(1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100
```
4. **Available Disk Space (GB)**:
```promql
# Available disk space in GB
node_filesystem_free_bytes / 1024 / 1024 / 1024
```
### Exercise 3: Database Performance Monitoring
**Scenario**: Monitor PostgreSQL database performance.
**Metrics Available**:
- `pg_stat_database_tup_returned`
- `pg_stat_database_tup_fetched`
- `pg_locks_count{mode, type}`
- `pg_stat_activity_count{state}`
**Tasks**:
1. **Query Rate**:
```promql
# Queries per second
sum(rate(pg_stat_database_tup_returned[5m]))
```
2. **Cache Hit Ratio**:
```promql
# Buffer cache hit ratio
(sum(rate(pg_stat_database_blks_hit[5m])) / (sum(rate(pg_stat_database_blks_hit[5m])) + sum(rate(pg_stat_database_blks_read[5m])))) * 100
```
3. **Active Connections**:
```promql
# Active database connections
pg_stat_activity_count{state="active"}
```
4. **Lock Contention**:
```promql
# Blocked queries
sum(pg_locks_count{mode=~"ExclusiveLock|ShareUpdateExclusiveLock"})
```
### Exercise 4: Application Performance Monitoring
**Scenario**: Monitor a microservices architecture.
**Metrics Available**:
- `service_request_duration_seconds{service, endpoint, method}`
- `service_requests_total{service, endpoint, method, status}`
- `service_cpu_usage{service}`
- `service_memory_usage_bytes{service}`
**Tasks**:
1. **Service Availability**:
```promql
# Service availability (percentage of successful requests)
sum by(service) (rate(service_requests_total{status="200"}[5m])) /
sum by(service) (rate(service_requests_total[5m])) * 100
```
2. **Service Response Time SLA**:
```promql
# Percentage of requests served under 500ms
(
sum by(service) (rate(service_request_duration_seconds_bucket{le="0.5"}[5m])) /
sum by(service) (rate(service_request_duration_seconds_count[5m]))
) * 100
```
3. **Top 5 Slowest Services**:
```promql
# Top 5 services by average response time
topk(5, avg by(service) (service_request_duration_seconds))
```
4. **Resource Usage Correlation**:
```promql
# Services with high CPU and memory usage
(service_cpu_usage > 80) and (service_memory_usage_bytes > 1000000000)
```
### Exercise 5: Alerting Scenarios
**Real-world alerting queries**:
1. **High Error Rate Alert**:
```promql
# Alert when error rate exceeds 5% for 5 minutes
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) * 100 > 5
```
2. **Service Down Alert**:
```promql
# Alert when service is down
up == 0
```
3. **High Response Time Alert**:
```promql
# Alert when 95th percentile response time exceeds 2 seconds
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
```
4. **Resource Exhaustion Alert**:
```promql
# Alert when memory usage exceeds 90%
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
```
## Best Practices and Guidelines
### 1. Query Performance
**Use Appropriate Time Ranges**:
```promql
# Good: Reasonable time range for rate calculation
rate(http_requests_total[5m])
# Avoid: Too short time range (noisy)
rate(http_requests_total[30s])
# Avoid: Too long time range (delayed detection)
rate(http_requests_total[1h])
```
**Optimize Label Selection**:
```promql
# Good: Specific label selection
http_requests_total{service="api", method="GET"}
# Avoid: Too broad selection
http_requests_total
```
### 2. Aggregation Best Practices
**Use Appropriate Aggregation Functions**:
```promql
# Good: Sum for counters
sum(rate(http_requests_total[5m]))
# Good: Average for gauges
avg(cpu_usage_percent)
# Good: Max for capacity metrics
max(memory_usage_percent)
```
**Group By Relevant Labels**:
```promql
# Good: Group by meaningful dimensions
sum by(service, environment) (rate(http_requests_total[5m]))
# Avoid: Too many grouping labels (high cardinality)
sum by(service, environment, instance, method, handler) (rate(http_requests_total[5m]))
```
### 3. Counter vs Gauge Handling
**For Counters (always increasing)**:
```promql
# Use rate() for per-second rates
rate(http_requests_total[5m])
# Use increase() for total increase
increase(http_requests_total[1h])
```
**For Gauges (can go up and down)**:
```promql
# Use directly or with aggregation
avg(cpu_usage_percent)
# Use with time range functions
max_over_time(cpu_usage_percent[1h])
```
### 4. Histogram and Summary Metrics
**For Histograms**:
```promql
# Calculate quantiles
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
# Calculate average
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
```
**For Summaries**:
```promql
# Use pre-calculated quantiles
http_request_duration_seconds{quantile="0.95"}
# Calculate average
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
```
### 5. Label Management
**Avoid High Cardinality Labels**:
```promql
# Good: Limited cardinality
http_requests_total{method="GET", status="200"}
# Avoid: High cardinality (user IDs, request IDs)
http_requests_total{user_id="12345", request_id="abcdef"}
```
**Use Consistent Label Names**:
```promql
# Good: Consistent naming
service_cpu_usage{service="api"}
service_memory_usage{service="api"}
# Avoid: Inconsistent naming
api_cpu{name="api"}
memory_usage{service_name="api"}
```
### 6. Recording Rules
Create recording rules for complex, frequently used queries:
```yaml
# recording_rules.yml
groups:
- name: http_requests
rules:
- record: instance:http_requests:rate5m
expr: sum by(instance) (rate(http_requests_total[5m]))
- record: job:http_requests:rate5m
expr: sum by(job) (instance:http_requests:rate5m)
```
### 7. Alerting Rules Best Practices
```yaml
# alerting_rules.yml
groups:
- name: alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for {{ $labels.instance }}"
```
## Tag-Based Exercise Cases
### Exercise 6: Multi-Environment Monitoring
**Scenario**: Monitor applications across different environments (dev, staging, prod).
**Tagged Metrics Available**:
```promql
# HTTP requests with environment tags
http_requests_total{method="GET|POST", status="200|404|500", environment="dev|staging|prod", service="api|web|auth"}
# CPU usage with environment and region tags
cpu_usage_percent{environment="dev|staging|prod", region="us-east|us-west|eu-central", instance="host-1|host-2|host-3"}
# Database connections with multiple tags
db_connections_active{environment="dev|staging|prod", database="users|orders|inventory", pool="primary|readonly"}
```
**Tasks**:
1. **Environment-Specific Request Rates**:
```promql
# Total requests per second by environment
sum by(environment) (rate(http_requests_total[5m]))
# Requests per second for production only
sum(rate(http_requests_total{environment="prod"}[5m]))
# Compare request rates between environments
sum by(environment) (rate(http_requests_total[5m]))
```
2. **Service Performance Across Environments**:
```promql
# Average response time by service and environment
avg by(service, environment) (http_request_duration_seconds)
# Production services with response time > 1 second
avg by(service) (http_request_duration_seconds{environment="prod"}) > 1
# Development vs Production performance comparison
avg by(service) (http_request_duration_seconds{environment="prod"}) /
avg by(service) (http_request_duration_seconds{environment="dev"})
```
3. **Cross-Environment Error Analysis**:
```promql
# Error rate by environment and service
(
sum by(environment, service) (rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum by(environment, service) (rate(http_requests_total[5m]))
) * 100
# Services with higher error rates in production than staging
(
sum by(service) (rate(http_requests_total{status=~"5..", environment="prod"}[5m])) /
sum by(service) (rate(http_requests_total{environment="prod"}[5m]))
) > ignoring(environment) (
sum by(service) (rate(http_requests_total{status=~"5..", environment="staging"}[5m])) /
sum by(service) (rate(http_requests_total{environment="staging"}[5m]))
)
```
### Exercise 7: Multi-Region Infrastructure Monitoring
**Scenario**: Monitor infrastructure across multiple regions and availability zones.
**Tagged Metrics Available**:
```promql
# Node metrics with region and AZ tags
node_memory_MemAvailable_bytes{region="us-east-1|us-west-2|eu-central-1", az="a|b|c", instance_type="t3.micro|t3.small|t3.medium"}
node_cpu_seconds_total{region="us-east-1|us-west-2|eu-central-1", az="a|b|c", mode="idle|user|system", instance_type="t3.micro|t3.small|t3.medium"}
node_load1{region="us-east-1|us-west-2|eu-central-1", az="a|b|c", instance_type="t3.micro|t3.small|t3.medium"}
```
**Tasks**:
1. **Regional Resource Usage**:
```promql
# Average CPU usage by region
avg by(region) (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
# Memory usage by region and instance type
avg by(region, instance_type) (
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
)
# Top 3 regions by CPU usage
topk(3, avg by(region) (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)))
```
2. **Availability Zone Analysis**:
```promql
# Load balancing across AZs (should be roughly equal)
avg by(region, az) (node_load1)
# Identify overloaded AZs (load > 2.0)
avg by(region, az) (node_load1) > 2.0
# AZ with highest memory pressure per region
max by(region) (
avg by(region, az) (
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
)
)
```
3. **Instance Type Performance**:
```promql
# CPU efficiency by instance type (lower is better)
avg by(instance_type) (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) /
avg by(instance_type) (node_cpu_cores)
# Memory utilization by instance type
avg by(instance_type) (
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
)
```
### Exercise 8: Multi-Tenant SaaS Monitoring
**Scenario**: Monitor a multi-tenant SaaS application with customer isolation.
**Tagged Metrics Available**:
```promql
# Application metrics with tenant tags
app_requests_total{tenant_id="tenant-1|tenant-2|tenant-3", plan="free|pro|enterprise", method="GET|POST|PUT|DELETE"}
app_response_time_seconds{tenant_id="tenant-1|tenant-2|tenant-3", plan="free|pro|enterprise", endpoint="/api/v1/users|/api/v1/orders"}
app_db_queries_total{tenant_id="tenant-1|tenant-2|tenant-3", plan="free|pro|enterprise", query_type="select|insert|update|delete"}
```
**Tasks**:
1. **Tenant Usage Analysis**:
```promql
# Requests per second by tenant
sum by(tenant_id) (rate(app_requests_total[5m]))
# Top 10 most active tenants
topk(10, sum by(tenant_id) (rate(app_requests_total[5m])))
# Usage by subscription plan
sum by(plan) (rate(app_requests_total[5m]))
```
2. **Performance by Tenant Tier**:
```promql
# Average response time by plan
avg by(plan) (app_response_time_seconds)
# Enterprise customers with response time > 500ms
avg by(tenant_id) (app_response_time_seconds{plan="enterprise"}) > 0.5
# Performance comparison between plans
avg by(plan) (app_response_time_seconds) / scalar(avg(app_response_time_seconds{plan="free"}))
```
3. **Resource Usage Fairness**:
```promql
# Database query rate by tenant (identify heavy users)
sum by(tenant_id) (rate(app_db_queries_total[5m]))
# Tenants using more than their fair share (top 5%)
quantile(0.95, sum by(tenant_id) (rate(app_db_queries_total[5m])))
# Free tier users with enterprise-level usage
sum by(tenant_id) (rate(app_requests_total{plan="free"}[5m])) >
scalar(avg(sum by(tenant_id) (rate(app_requests_total{plan="enterprise"}[5m]))))
```
### Exercise 9: Kubernetes Pod and Container Monitoring
**Scenario**: Monitor Kubernetes workloads with rich metadata tags.
**Tagged Metrics Available**:
```promql
# Container metrics with K8s tags
container_cpu_usage_seconds_total{namespace="kube-system|default|monitoring", pod="pod-1|pod-2", container="app|sidecar|init", node="worker-1|worker-2"}
container_memory_usage_bytes{namespace="kube-system|default|monitoring", pod="pod-1|pod-2", container="app|sidecar|init", node="worker-1|worker-2"}
kube_pod_info{namespace="default|monitoring", pod="pod-1|pod-2", node="worker-1|worker-2", created_by_kind="Deployment|StatefulSet|DaemonSet"}
```
**Tasks**:
1. **Namespace Resource Usage**:
```promql
# CPU usage by namespace
sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD"}[5m]))
# Memory usage by namespace (in GB)
sum by(namespace) (container_memory_usage_bytes{container!="POD"}) / 1024 / 1024 / 1024
# Top 3 resource-consuming namespaces
topk(3, sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD"}[5m])))
```
2. **Pod Performance Analysis**:
```promql
# Pods with high CPU usage (> 80% of requests)
sum by(namespace, pod) (rate(container_cpu_usage_seconds_total{container!="POD"}[5m])) > 0.8
# Memory usage by pod and container
sum by(namespace, pod, container) (container_memory_usage_bytes{container!="POD"})
# Pods with memory usage > 1GB
sum by(namespace, pod) (container_memory_usage_bytes{container!="POD"}) / 1024 / 1024 / 1024 > 1
```
3. **Node Distribution Analysis**:
```promql
# Pod count by node
count by(node) (kube_pod_info)
# CPU usage by node
sum by(node) (rate(container_cpu_usage_seconds_total{container!="POD"}[5m]))
# Identify unbalanced nodes
max by(node) (count by(node) (kube_pod_info)) -
min by(node) (count by(node) (kube_pod_info))
```
### Exercise 10: API Gateway and Load Balancer Monitoring
**Scenario**: Monitor API gateway with upstream service tags.
**Tagged Metrics Available**:
```promql
# API Gateway metrics
gateway_requests_total{method="GET|POST|PUT|DELETE", status="200|400|401|403|404|500|502|503", upstream="service-a|service-b|service-c", path="/api/v1/users|/api/v1/orders|/api/v1/products"}
gateway_request_duration_seconds{method="GET|POST", upstream="service-a|service-b|service-c", path="/api/v1/users|/api/v1/orders"}
gateway_upstream_response_time_seconds{upstream="service-a|service-b|service-c"}
```
**Tasks**:
1. **Upstream Service Health**:
```promql
# Request rate by upstream service
sum by(upstream) (rate(gateway_requests_total[5m]))
# Error rate by upstream service
(
sum by(upstream) (rate(gateway_requests_total{status=~"5.."}[5m])) /
sum by(upstream) (rate(gateway_requests_total[5m]))
) * 100
# Upstream services with error rate > 1%
(
sum by(upstream) (rate(gateway_requests_total{status=~"5.."}[5m])) /
sum by(upstream) (rate(gateway_requests_total[5m]))
) * 100 > 1
```
2. **API Endpoint Analysis**:
```promql
# Most popular endpoints
topk(10, sum by(path) (rate(gateway_requests_total[5m])))
# Slowest endpoints by average response time
topk(5, avg by(path) (gateway_request_duration_seconds))
# Endpoints with high 4xx error rates
(
sum by(path) (rate(gateway_requests_total{status=~"4.."}[5m])) /
sum by(path) (rate(gateway_requests_total[5m]))
) * 100 > 5
```
3. **Load Balancing Effectiveness**:
```promql
# Request distribution across upstream services
sum by(upstream) (rate(gateway_requests_total[5m])) /
scalar(sum(rate(gateway_requests_total[5m]))) * 100
# Response time variation between upstreams
stddev by(path) (avg by(upstream, path) (gateway_upstream_response_time_seconds))
# Identify slow upstreams
avg by(upstream) (gateway_upstream_response_time_seconds) >
scalar(avg(gateway_upstream_response_time_seconds)) * 1.5
```
### Exercise 11: CDN and Edge Location Monitoring
**Scenario**: Monitor CDN performance across different edge locations and content types.
**Tagged Metrics Available**:
```promql
# CDN metrics with geographic and content tags
cdn_requests_total{edge_location="us-east|us-west|eu-west|ap-south", content_type="image|video|api|static", cache_status="hit|miss|stale"}
cdn_response_time_seconds{edge_location="us-east|us-west|eu-west|ap-south", content_type="image|video|api|static"}
cdn_bandwidth_bytes{edge_location="us-east|us-west|eu-west|ap-south", content_type="image|video|api|static", direction="in|out"}
```
**Tasks**:
1. **Cache Performance by Location**:
```promql
# Cache hit rate by edge location
(
sum by(edge_location) (rate(cdn_requests_total{cache_status="hit"}[5m])) /
sum by(edge_location) (rate(cdn_requests_total[5m]))
) * 100
# Locations with low cache hit rates (< 80%)
(
sum by(edge_location) (rate(cdn_requests_total{cache_status="hit"}[5m])) /
sum by(edge_location) (rate(cdn_requests_total[5m]))
) * 100 < 80
```
2. **Content Type Performance**:
```promql
# Response time by content type and location
avg by(content_type, edge_location) (cdn_response_time_seconds)
# Bandwidth usage by content type (GB per hour)
sum by(content_type) (rate(cdn_bandwidth_bytes{direction="out"}[1h])) * 3600 / 1024 / 1024 / 1024
# Cache efficiency by content type
sum by(content_type) (rate(cdn_requests_total{cache_status="hit"}[5m])) /
sum by(content_type) (rate(cdn_requests_total[5m])) * 100
```
3. **Geographic Performance Analysis**:
```promql
# Average response time by region
avg by(edge_location) (cdn_response_time_seconds)
# Request distribution across edge locations
sum by(edge_location) (rate(cdn_requests_total[5m])) /
scalar(sum(rate(cdn_requests_total[5m]))) * 100
# Identify underperforming edge locations
avg by(edge_location) (cdn_response_time_seconds) >
scalar(avg(cdn_response_time_seconds)) * 1.2
```
### Exercise 12: Advanced Tag-Based Alerting
**Scenario**: Create sophisticated alerts using tag-based logic.
**Tasks**:
1. **Environment-Aware Alerting**:
```promql
# Alert only for production services with high error rates
(
sum by(service) (rate(http_requests_total{status=~"5..", environment="prod"}[5m])) /
sum by(service) (rate(http_requests_total{environment="prod"}[5m]))
) * 100 > 1
# Alert for any non-production environment with 100% error rate
(
sum by(service, environment) (rate(http_requests_total{status=~"5..", environment!="prod"}[5m])) /
sum by(service, environment) (rate(http_requests_total{environment!="prod"}[5m]))
) * 100 >= 100
```
2. **Multi-Dimensional Resource Alerts**:
```promql
# Alert for high CPU usage, but exclude known batch processing nodes
avg by(instance) (100 - (irate(node_cpu_seconds_total{mode="idle", instance!~"batch-.*"}[5m]) * 100)) > 85
# Alert for memory pressure, considering instance type
(
avg by(instance, instance_type) (
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
) > 90 and on(instance_type) (
avg by(instance_type) (node_memory_MemTotal_bytes) < 8 * 1024 * 1024 * 1024
)
) or (
avg by(instance, instance_type) (
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
) > 95
)
```
3. **Tag-Based Alert Routing**:
```promql
# Critical alerts for enterprise customers only
avg by(tenant_id, plan) (app_response_time_seconds{plan="enterprise"}) > 1
# Alerts that should page during business hours only
(
avg(app_error_rate) > 5 and on() (
hour() >= 9 and hour() < 17 and
day_of_week() > 0 and day_of_week() < 6
)
)
# Region-specific alerts with different thresholds
(
avg by(region) (cpu_usage_percent{region="us-east-1"}) > 80
) or (
avg by(region) (cpu_usage_percent{region="eu-central-1"}) > 70
)
```
## Advanced Patterns
### 1. Absent and Missing Data Handling
```promql
# Check if metric is absent
absent(up{job="prometheus"})
# Check if no instances are up
absent(up == 1)
# Provide default value for missing metrics
up or on() vector(0)
```
### 2. Subquery Pattern
```promql
# Rate of rate (acceleration)
rate(rate(http_requests_total[5m])[1h:])
# Maximum rate over the last hour
max_over_time(rate(http_requests_total[5m])[1h:])
```
### 3. Complex Vector Matching
```promql
# Join instance metadata with metrics
rate(http_requests_total[5m]) * on(instance) group_left(datacenter, rack) instance_metadata
# Calculate per-service error rate with instance information
(
sum by(service) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by(service) (rate(http_requests_total[5m]))
) * on(service) group_left(team) service_metadata
```
### 4. Time-based Calculations
```promql
# Business hours filter (9 AM to 5 PM UTC)
http_requests_total and on() (hour() >= 9 and hour() < 17)
# Weekend detection
http_requests_total and on() (day_of_week() == 0 or day_of_week() == 6)
# Month-over-month comparison
increase(http_requests_total[30d]) / increase(http_requests_total[30d] offset 30d)
```
### 5. Prediction and Forecasting
```promql
# Linear prediction - predict disk usage in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4*3600)
# Derive - rate of change
deriv(node_filesystem_free_bytes[1h])
```
### 6. Custom SLI/SLO Calculations
```promql
# Availability SLI (99.9% target)
(
sum(rate(http_requests_total{status!~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
) * 100
# Error budget remaining (assume 99.9% SLO)
(
(sum(rate(http_requests_total[30d])) * 0.001) -
sum(rate(http_requests_total{status=~"5.."}[30d]))
) / (sum(rate(http_requests_total[30d])) * 0.001) * 100
```
## Common Pitfalls to Avoid
1. **Using rate() on gauges**: `rate()` should only be used on counters
2. **Inappropriate time ranges**: Too short ranges cause noise, too long cause delays
3. **High cardinality**: Avoid labels with many unique values
4. **Missing rate() on counters**: Always use `rate()` or `increase()` with counters
5. **Ignoring stale data**: Consider using `absent()` or `or vector(0)` patterns
6. **Complex queries without recording rules**: Pre-calculate complex queries
7. **Not considering timezone**: Be aware of UTC vs local time in time-based queries