# Hashicorp Vault at numberly
- part1: ssl certificates -> high technology scale hoster
- part2: pki vpn kafka kubernetes -> battery included
- part3: secrets / databases -> software engineering++
## Part 1: A tale of managing thousands of TLS certificates at numberly
----
### Context
Numberly creates and hosts thousands of websites and web facing APIs developed by more than a hundred developers.
Representing more than 4000 Gitlab projects, these interfaces and endpoints become mission critical parts of our clients businesses once in production.
It's been a long time neccessity for us to protect the access to those resources, starting with on the wire encryption and HTTPS.
Over the years, we faced many of the scaling challenges you need to solve to overcome both the friction that SSL certificate management represent to your production delivery and the automation for their life cycle and maintenance. In many ways, our challenges are similar to the ones that full scale hosters such as OVH have been facing.
There's a big gap between managing a dozen SSL certificates and managing thousands: one can't just optimize the time humans spend on creating/installing/monitoring/renewing them, at some point you need to just make those operations transparent to your developers and infrastructure to scale.
<!-- With the automation tooling evolution, SSL certificates operations (creating/deploying/renewing/monitoging) evolved to become similar to what developers have grown used to as well in their development cycle in the DevOps world: coding/deploying/upgrading/monitoring. Code and SSL certificates have also another thing in common: they need to remain private and safe. -->
We wrote this series of articles to share the 20 years journey and experience we have in managing secrets. Starting from the SSL certificate management at scale to generalizing secret management to teams and applications.
### History
Numberly has been around since 2000, we went through the Internet bubble and had the chance to organize a lot of iterations for our hosting strategy.
Some things never changed though: our autonomy and technical independance.
- **2011**: Numberly get its own [Autonomous System Number](https://en.wikipedia.org/wiki/Autonomous_system_(Internet)) and migrate email routing and web hosting to our own IPs
- **2012**: We buy a bunch of F5 to handle traffic load for the launch of our RTB tracking service (30k RPS)
- **2016**: We start migrating some workloads to Kubernetes and slowly move out of F5 for our internal usages
- **2017**: Numberly becomes a [Local Internet Registry](https://fr.wikipedia.org/wiki/Local_Internet_registry)
- **2018**: We host most of our web facing and data pipelining workloads on Kubernetes
- **2020**: We handle SSL certificates at scale, used in thousands of websites and APIs
- **2021**: Our internal and external network rework allowed us to scale our Kubernetes hosting to multiple datacenters seamlessly
This last challenge made us rethink our whole SSL strategy to meet both publicly and internally accessible services (not only websites).
Before this project, the stack of our web hosting platforms was composed of:
- **Load-balancers**: a bunch of F5 blackboxes
- **SSL certificates**: handled by Digicert with a perfectible automation for issuing and renewing certificates. Not mentioning its tremendous money cost
- **Automation**: an infamous Google Spreadsheet shared between Project Managers so they could let us know which websites required a SSL certificate renewal or not through Gitlab issues
From that standpoint, any design overhaul would give us better results!
#### Our needs
- Not using F5 blackboxes anymore, because of its lack of automation, license costs and poor observability
- Not using human-backed procedures to generate SSL certificates
- Being able to generate SSL certificates at scale
- Having all of these certificates monitored by default, with automated alerting
- Having these certificates securely stored
- Doing that without additionnal costs
#### What we decided to do
- Replace F5 by commodity hardware servers with [NGINx](https://nginx.org/) to leverage on our new network BGP anycast topology running in both our datacenters
- Use **[Let's Encrypt](https://letsencrypt.org/)** for certificate generation
- Leverage on our existing **[Prometheus](https://github.com/prometheus/prometheus)** and **[AlertManager](https://github.com/prometheus/alertmanager)** stack for respectively our monitoring and alerting
- Use Vault as a secure and highly-available storage
Automation of this work required us to create the folowing pipeline:
(faire un zoli schema)
```
créer un cert -> storer le cert -> servir le cert sans reload d'applicatif
|
|-> setup le monito du cert
```
### Securely storing our certificates using Hashicorp Vault
Here comes Hashicorp Vault. SSL certificate storage was the entry point of this technology at numberly. We'll cover the following use cases on later blog posts (stay tuned).
Vault solved the problem of storing and exposing in a secure and high available manner sensitive data such as SSL certificates.
**Access & Audit**: Only members of the infrastructure team have access to this KV mountpoint. And everything is logged with [vault audit logs](https://www.vaultproject.io/docs/audit).
**Hosting & Networking**: our Vault nodes are hosted in our two datacenters with AWS acting as third one.
They are all able to process request by announcing the same service anycast IP address in our internal network.
Any client would be routed to the nearest Vault server. And if this Vault server isn't master, it's able to process the request anyway.
#### Vault storage format for SSL certificates
```
$ vault secrets enable -version=2 kv-certificates
```
We needed our SSL certificate storage keys to always have the same schema. It looks like this:

* **cert**: to store the SSL certificate in PEM format
* **chain**: the Let's Encrypt chain
* **fullchain**: the concatenation of the `chain` and the `cert` keys
* **key**: the SSL certificate key
* **owner**: some information about the owner in case it's a customer's certificate
* **timestamp**: timestamp for the certificate creation
#### Automating Policy and AppRole deployment with [terraform](https://github.com/hashicorp/terraform)
Because SSL certificates are very sensitive, we leverage the AppRole feature of Vault.
That way, our applications never have the same Vault token and can be made aware of their token expiration so they can renew it.
We've automated that part using terraform:
```hcl=
resource "vault_auth_backend" "approle" {
type = "approle"
tune {
default_lease_ttl = "60s"
}
}
data "vault_policy_document" "loadbalancer" {
rule {
description = "Used by nginx load-balancers to read SSL certificates"
path = "kv-certificates/data/*"
capabilities = ["read"]
}
}
resource "vault_policy" "loadbalancer" {
name = "loadbalancer"
policy = data.vault_policy_document.loadbalancer.hcl
}
resource "vault_approle_auth_backend_role" "loadbalancer" {
backend = vault_auth_backend.approle.path
role_name = "loadbalancer"
token_policies = [vault_policy.loadbalancer.name]
token_ttl = 600
}
```
### Automation pipeline
At numberly we run thousands of jobs a day thanks to **Gitlab CI**.
Our Gitlab runners run in our on-premises Kubernetes clusters and sometimes we use external runners to absorb peaks.
It was only logical to use our existing CD platforms for that automation job.
Let's Encrypt implements the ACME protocol, we had to find a hackable ACME client to handle the integrations we wanted. More than 50 ACME clients exist and are referenced on the [Let's Encrypt website](https://letsencrypt.org/fr/docs/client-options/).
We're huge fan of the [KISS principle](https://en.wikipedia.org/wiki/KISS_principle) and some of us had previous experience with one client written in bash : **[dehydrated](https://github.com/dehydrated-io/dehydrated)**.
The dehydrated project implements a hook design that lets you write custom behaviors in simple bash, which came handy for our next goals.
Using dehydrated, our remaining challenges were:
- Find a hook for handling DNS challenges with AWS route53
There are existing hooks for dehydrated on Github such as [dehydrated-route53-hook-script](https://github.com/whereisaaron/dehydrated-route53-hook-script).
- Find a hook for pushing our certificates to Hashicorp Vault
We forked an existing project to fix some issues with [KV v2 store](https://www.vaultproject.io/docs/secrets/kv/kv-v2) and came up with [dehydrated-vault-hook](https://github.com/sebPomme/dehydrated-vault-hook).
After implementing all of this, our `.gitlab-ci.yaml` looked like this:
```yaml=
image: registry/docker-images/alpine:latest
stages:
- test
- trigger
before_script:
- apk --update-cache add curl
lint:
# Some linting to make sure we didn't declare wrong domains
stage: test
script:
- apk add bash grep
- ./check.sh
main:
stage: trigger
script:
# Generating DNS challenge with AWS route53 hook
- dehydrated --config /etc/dehydrated/config --cron --hook /var/lib/dehydrated/dehydrated-route53-hook-script/hook.sh --keep-
# Pushing generated certificates with our Hasicorp vault hook
- dehydrated --config /etc/dehydrated/config --cron --hook /var/lib/dehydrated/dehydrated-vault-hook/vault-hook.sh --keep-going
only:
- master
```
### Using SSL certificates seamlessly
One of the evolution of the infrastructure was to decomission F5 load-balancers and use NGINx, more precisely [OpenResty](https://github.com/openresty/openresty) which is an enhanced version of NGINx with LuaJIT support.
We use catch-all NGINx server names and the [`ssl_certificate_by_lua_block`](https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block) directive to automatically fetch the SSL certificate of a website / API.
The only known limitation for this is that the SSL requests must be done with SNI compatibility so we can have the Server Name while making the handshake during the SSL connection.
Our lua reads the server name out of the SNI and queries Vault through its HTTP API.
We leverage an AppRole with a low TTL (600s). This token is saved in a `lua_shared_dict` **shm storage** that every [OpenResty](https://github.com/openresty/openresty) worker can get to make queries to Vault.
- **First try**: tries to fetch the `server_name` certificate, ie: foo.acme.com
- **Second try**: upper wildcard, ie: *.acme.com
- **Third try**: lower wildcard, ie: *.foo.acme.com
*Assuming the server_name is SAN of the lower wildcard.*
We use 3 different caching system with LRU cache:
* **certs_cache**: working cache with a `cache_expire_time` parameter
* **fallback_certs_cache**: a cache without expiration that covers the case of a domain expiring in `certs_cache` while your Vault server is down
* **unknown_certs_cache**: a cache for domains that don't have SSL certificates in Vault (meaning it reached the third try)
This third cache is really important as it prevents us from flooding our Vault cluster with queries to check if a certificate exists for **every incoming request**.
### Kubernetes integration
To seemlessly allow our developers to use certificates in our Vault cluster, we've used the terribly efficient **[vault-secrets-operator](https://github.com/ricoberger/vault-secrets-operator)** project.
It allows our developers to create [Custom Resource Definition](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) objects that **secrets-manager** will use to know which SSL certificates has to be synchronized with a [Kubernetes secret](https://kubernetes.io/fr/docs/concepts/configuration/secret/).
We leverage [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) to only allow specific users to use this technique as it could be abused to retrieve all SSL certificates.
Here's what a CRD looks like:
```yaml=
---
apiVersion: ricoberger.de/v1alpha1
kind: VaultSecret
metadata:
name: vault-star.numberly.com
namespace: team-xxx
spec:
keys:
- fullchain
- key
path: kv-certificates/*.numberly.com
templates:
tls.crt: '{% .Secrets.fullchain %}'
tls.key: '{% .Secrets.key %}'
type: kubernetes.io/tls
```
### Monitoring and Alerting on SSL certificates
Now that all our SSL certificates are stored in a secure and central place, we can automate their monitoring easily.
Using a Gitlab CI job, we generate a YAML file containing the URLs of all our SSL certificates that we make available to Prometheus using [file_sd_configs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config) and an external [blackbox_exporter](https://github.com/prometheus/blackbox_exporter) to be scrapped by one our Prometheus cluster.
```yaml=
- job_name: blackbox-http-static
file_sd_configs:
- files:
- /etc/prometheus/blackbox/static-http-targets/*.yml
metrics_path: /probe
params:
module:
- http_2xx
```
Below an example of one Prometheus alert:
```yaml=
- alert: SslCertExpiringShortly7days
expr: last_over_time(probe_ssl_earliest_cert_expiry{job="blackbox-http-static"}[2h]) - time() < 86400 * 7
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
grafana: <grafana url>
documentation: <documentation url to know what to do>
```
### Conclusion
Over the course of 2 years using this method, we can outline some main wins:
- We never missed a SSL certificate renewal!
- No human was harmed in creating/renewing a SSL certificate by hand
- No member of the team spent time reloading web servers to add/renew a SSL certificate
- SSL certificate waiting time for Project managers and developers was turned into valued focus time on making sure our customers were serviced promptly and efficiently
- We were always alerted for certificates that were going to expire because of some issues (DNS change, Let's encrypt API error, etc)
- We did countless Vault upgrades without any downtime
We could not have done this engineering piece of work without several great Open Source projects and especially without Let's Encrypt which makes the Internet more secure since late 2015.
We want to thank all the developers for the time and dedication they put in all the Open Source projects and initiatives we've used :heart:. As always, our own time spent in forks or new projects were contributed back or [Open Sourced on our Github account](https://github.com/numberly).
# Commentaires
Please put your comments here and do not edit the text