# Operator, help, please!
Not again; my node seems to be offline. I am unsure what happened, but my node stopped, and I don't know when. :thinking_face: :slightly_frowning_face:
Your node's day-to-day operation and maintenance can sometimes be a significant overhead. Node failure always happens at inconvenient times, like while on vacation. What can be done to help with these failures? How can I manage my node more efficiently? Would it not be nice to have someone watching your node and making intelligent decisions 24/7? Yes, but that can be expensive. The next best option is to automate the actions required to keep your node healthy(self-healing). There are a lot of open-source tools in the wild that can accomplish this goal, but I will go over Kubernetes Operator.
## Why Kubernetes
Kubernetes(k8s) has emerged as the de facto standard for container orchestration. It can be complicated, but it handles complex infrastructure to simplify application deployment. Kubernetes abstracts the underlying infrastructure, so the application does not require components like error handling, scalability, security, redundancy, and other common facilities. These components are now located inside the Kubernetes ecosystem.

Kubernetes' control plane comprises several components, including an API server, scheduler, etcd, and controller manager. I will not go into details on each, but more information can be found [here](https://kubernetes.io/docs/concepts/overview/components/#control-plane-components).
1. The API server is the gateway to talk to all the components in the control plane.
2. Etcd is a key value store that keeps the state of the objects in the cluster.
3. Scheduler is a process that assigns pods to run on a node.
4. The Controller Manager(CM) encapsulates the core Kubernetes logic. CM makes sure that all the pieces work correctly. If not, it takes action to bring the system to the desired state. The CM is where a lot of the action happens.
You can configure a resource under Kubernetes in two ways: imperatively or declaratively. Imperative commands interact with Kubernetes cluster objects directly, such as creating a pod. Declarative uses manifest files along with the ``kubectl apply`` command. In the latter, Kubernetes is given a desired state, and the controller manager ensures this state is achieved.
Kubernetes is great at automating your containers but lacks the knowledge to understand the details of individual applications. To rectify this, Kubernetes allows the extension of its API with custom resources. An Operator is an extension of the Kubernetes.
## Why Kubernetes Operator
While Kubernetes excels at managing stateless applications, operators are required to configure more complexly for stateful applications such as blockchains or databases. Stateless applications can be managed by the built-in controllers in the Kubernetes control plane because all stateless applications have similar recovery and scalability steps.
More application-specific knowledge is usually required for stateful applications for recovery or scaling. For example, with the Symbol blockchain, when the client shuts down incorrectly due to a node restart, it cannot just be restarted on a different node. The recovery tool will need to run to fix inconsistent data before the client starts. Also, upgrades are not as simple as restarting your Symbol node. It would include these steps.
1. upgrade ConfigMaps and Sercets for resources and certs
2. shutdown the server, broker and MongoDB
3. start MongoDB, broker and server.
There are other scenarios, too, such as what happens if MongoDB fails while the broker is still running. How is this handled? These are all application-specific, and this is where the Operator comes in.
The operator pattern extends the capabilities of Kubernetes by encapsulating the application's domain-specific knowledge and management logic into an Operator. An Operator automates the operation and lifecycle management of a specific application. It is a software extension that uses the Kubernetes API to monitor and control the state of an application running on the cluster. They act as human operators who know the desired state of an application and will take necessary actions/steps to ensure the application remains in the correct state.

Custom resources let you store and retrieve structured data using Kubernetes APIs. When paired with your own custom controller, they provide a declarative API. This API enforces a separation of responsibilities, just like the built-in Kubernetes APIs. I can declare the desired state of my resource, and my Kubernetes controller will keep the current state of my objects in sync with the declared desired state.

By adding application-specific knowledge and automating maintenance activities, an Operator can free up time for the end-user to not manually manage their application.
The Operator contains one or more controllers. Each controller watches a specific custom resource type. The controller queries the API controller and takes application-specific actions to make the current state match the desired state.
A controller implements the controller pattern in Kubernetes, which is a control loop. As shown below, the control plane runs the controllers in a loop. Some controllers built into Kubernetes run in the control plane, while others that are part of an operator run on the worker nodes.
### Creating a Kubernetes Operator
#### Setup Kubernetes cluster.
There are several ways to set up a Kubernetes cluster.
1. Most already have Docker Desktop. This has a standalone k8s server(https://docs.docker.com/desktop/kubernetes/)
2. There are several standalone Kubernetes servers, and either one is fine.
a. MicroK8s is one of my favorites if you use Ubuntu since it just installs with Snap. - https://ubuntu.com/tutorials/install-a-local-kubernetes-with-microk8s#1-overview
b. k0s - https://k0sproject.io/
c. k3s - https://k3s.io/
I will be using microk8s in the examples below.
#### Development tools
There are several frameworks, the most popular of which is the [operator framework](https://operatorframework.io/) with Golang. Since there is currently no Golang SDK for Symbol, I will be using Kopf (Kubernetes Operator Pythonic Framework) to build our first basic Symbol operator.
Apart from that, we need the following tools to get started:
* [kopf](https://kopf.readthedocs.io/en/stable/) Kubernetes Operator Pythonic Framework
* [pykube-ng](https://pykube.readthedocs.io/en/latest/index.html) ( a lightweight Python client library for Kubernetes, which is preferred when developing an Operator.)
* [symbol-shoestring](https://pypi.org/project/symbol-shoestring/) to generate the node config
* [kr8s](https://docs.kr8s.org/en/stable/client.html) might be used for self-healing tasks.
#### Custom Resource Definition(CRD)
Before any Custom Resource(CR) can be created, a Custom Resource Definition(CRD) needs to be defined. The CRD defines the CR's schema.
For the Symbol node, I extracted some of the common fields used to customize a node. These fields are added to the CRD, which the Operator will use to create a Symbol Node. If other fields are needed for your deployment, add them to the CRD.
```
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: symbolnodes.example.com
spec:
group: example.com
names:
kind: SymbolNode
plural: symbolnodes
singular: symbolnode
shortNames:
- sn
scope: Namespaced
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
image:
type: string
volumeSize:
type: string
hostName:
type: string
friendlyName:
type: string
network:
type: string
name:
type: string
required:
- image
- volumeSize
- network
- friendlyName
- hostName
```
Note: I left out the account key since, to handle this correctly, it needs to be in a Secret tied to an external KMS.
To deploy this to your Kubernetes environment - ``microk8s kubectl apply -f manifest/crd.yaml -n symbol``
You will see the new API is available and verify that this works.
```
microk8s kubectl api-resources | grep -i symbolnodes
symbolnodes sn example.com/v1 true SymbolNode
```
For more information on CRD see the [Kubernetes site](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
#### Create Operator code
Now that there is a new Kubernetes API, we will need to write some code that monitors the creation of these new objects and takes the correct action. In our case, creating a new object of ``symbolnodes`` will create a new node.
##### ConfigMaps and Secret
The old method of storing application configuration is to put it on disk. Kubernetes offers several different options.
* ConfigMap is used to store non-sensitive configuration data
* Serect stores sensitive information like passwords, API keys, or TLS certificates
The Operator will need logic to convert the Symbol configuration from files to ConfigMap and Secret in the following steps:
1. Shoestring is used to generate the configuration for the node.
2. Convert the ``key/cert`` folder to Kubernetes Secret
3. Convert the ``resources`` folder to ConfigMap(note: if this was a harvesting node, then config-harvesting.properties would need to be a Secret)
4. Convert the ``seed`` folder to ConfigMap
Bash script used to generate Symbol node config
```
#!/usr/bin/env bash
set -ex
network=$1
friendlyName=$2
host=$3
openssl genpkey -algorithm ed25519 -out ca.key.pem
mkdir -p shoestring
# build catapult configuration
# hard code to the loopback address will overwrite later
cat > shoestring/override.ini << EOL
[node.localnode]
host = 127.0.0.1
friendlyName = ${friendlyName}
EOL
python3 -m shoestring init shoestring/shoestring.ini --package "${network}"
sed -i 's/^apiHttps = true/apiHttps = false/g' shoestring/shoestring.ini
sed -i 's/^caCommonName =/caCommonName = test/g' shoestring/shoestring.ini
sed -i 's/^nodeCommonName =/nodeCommonName = test node/g' shoestring/shoestring.ini
sed -i 's/^features = API | HARVESTER | VOTER/features = PEER/g' shoestring/shoestring.ini
python3 -m shoestring setup --config shoestring/shoestring.ini --directory . --ca-key-path ca.key.pem --overrides shoestring/override.ini --package "${network}"
# k8s can switch IP address on failover, so leave it blank if you have multiple nodes
# if DNS is setup for your node then use this instead.
sed -i "s/^host = 127.0.0.1/host = ${host}/g" userconfig/resources/config-node.properties
```
Here is the code to convert all Symbol's node files to Kubernetes objects.
```
def create_symbol_node(network, friendly_name, host_name):
""" creates a symbol node """
node_path = Path(friendly_name)
if node_path.exists():
shutil.rmtree(node_path)
node_path.mkdir(parents=True)
config_path = node_path / 'shoestring'
config_path.mkdir()
cwd = os.getcwd()
try:
os.chdir(node_path)
subprocess.run(
['bash', '/app/createSymbolNode.sh', network, friendly_name, host_name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
finally:
os.chdir(cwd)
return node_path.absolute()
def _read_file_data_from_directory(config_path, binary_files = False):
""" creates config data needed for a ConfigMap or Secret from files in a folder """
directory_path = Path(config_path)
config_data = {}
binary_data = {}
for file_path in directory_path.rglob("*"):
if file_path.is_file():
key = file_path.name
try:
if binary_files:
# Handle binary files
with open(file_path, 'rb') as f:
content = base64.b64encode(f.read()).decode('utf-8')
binary_data[key] = content
else:
# Handle text files
with open(file_path, 'rt') as f:
content = f.read()
config_data[key] = content
except Exception as e:
raise Exception(f"Error reading file {file_path}: {str(e)}")
return config_data, binary_data
def _create_config_from_files(name, namespace, config_path, config_type, labels = None, binary_files = False):
""" create a ConfigMap or Secret object from files """
config_data, binary_data = _read_file_data_from_directory(config_path, binary_files=binary_files)
# Prepare object
config_dict = {
'apiVersion': 'v1',
'kind': config_type,
'metadata': {
'name': name,
'namespace': namespace
},
'data': config_data
}
# Add binary data if present
if binary_data:
config_dict["binaryData"] = binary_data
# Add labels if provided
if labels:
config_dict["metadata"]["labels"] = labels
# add as a child object
kopf.adopt(config_dict)
return config_dict
def _create_configmap_from_files(api, name, namespace, config_path, labels = None, binary_files = False):
""" create a ConfigMap from files """
configmap_dict = _create_config_from_files(
name, namespace, config_path, "ConfigMap", labels, binary_files
)
# Create the ConfigMap
configmap = pykube.ConfigMap(api, configmap_dict)
configmap.create()
return configmap
def _create_secret_from_files(api, name, namespace, config_path, labels = None, binary_files = False):
""" create a Secret from files """
configmap_dict = _create_config_from_files(
name, namespace, config_path, "Secret", labels, binary_files
)
# Create the Secret
configmap_dict['data'] = {k: base64.b64encode(v.encode()).decode() for k, v in configmap_dict['data'].items()}
secret = pykube.Secret(api, configmap_dict)
secret.create()
return secret
```
##### Create Symbol Node
Now that the Symbol Node resources are created in Kubernetes, the Symbol Node can be created.
A [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) will be used to create a Symbol Node. This allows for easy assignment of a persistent volume to use as storage. This storage will live on even after the pod is removed or recreated, allowing for the storage to failover with the pod.
As part of the StatefulSet definition, the ConfigMaps and Secret created for the node resources will be referenced and mounted into the container. Also note the use of ``hostPort``, which allows the Symbol client to access port 7900 on the host directly.
```
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize the Kubernetes client
api = pykube.HTTPClient(pykube.KubeConfig.from_env())
@kopf.on.login()
def custom_login_fn(**kwargs):
if os.environ.get('ENVIRONMENT', 'dev') == 'prod':
return kopf.login_with_service_account(**kwargs)
else:
return kopf.login_with_kubeconfig(**kwargs)
@kopf.on.create('symbolnodes')
def create_fn(spec, namespace, **kwargs):
logging.info(f'A handler is called with spec: {spec}')
# Get the configuration from the custom resource
size = spec.get('replicas', 1)
image = spec.get('image')
volume_size = spec.get('volumeSize')
network = spec.get('network', 'mainnet')
friendly_name = spec.get('friendlyName', '')
host_name = spec.get('hostName')
name = spec.get('name', f'{host_name}-{network}'.replace('.', '-'))
node_path = create_symbol_node(network, friendly_name, host_name)
try:
logging.info(f'node_path: {str(node_path)}, namespace: {namespace}, name: {name}')
_create_secret_from_files(api, f'certificates-{name}', namespace, node_path / 'keys/cert', {'app': name})
_create_configmap_from_files(api, f'resources-{name}', namespace, node_path / 'userconfig/resources', {'app': name})
_create_configmap_from_files(api, f'seed-{name}', namespace, node_path / 'seed', {'app': name}, True)
finally:
shutil.rmtree(node_path)
# Create StatefulSet object
statefulset = {
'apiVersion': 'apps/v1',
'kind': 'StatefulSet',
'metadata': {
'name': f'{name}-statefulset',
'namespace': namespace
},
'spec': {
'replicas': size,
'selector': {
'matchLabels': {
'app': name
}
},
'template': {
'metadata': {
'labels': {
'app': name
}
},
'spec': {
'containers': [{
'name': 'client',
'image': image,
'command': ['/usr/catapult/bin/catapult.server', '/'],
'env': [{
'name': 'LD_LIBRARY_PATH',
'value': '/usr/catapult/lib:/usr/catapult/deps'
}],
'ports': [
{
'containerPort': 7900,
'hostPort': 7900
}
],
'resources': {
'requests': {
'cpu': '1.0',
'memory': '2Gi'
},
'limits': {
'cpu': '2.0',
'memory': '4Gi'
}
},
'volumeMounts': [{
'name': f'data-{name}',
'mountPath': '/data'
},
{
'name': 'resources',
'mountPath': '/resources',
'readOnly': True
},
{
'name': 'certificates',
'mountPath': '/certificates',
'readOnly': True
},
{
'name': 'seed',
'mountPath': '/seed',
'readOnly': True
}
],
}],
'volumes': [
{
'name': 'resources',
'configMap': {
'name': f'resources-{name}'
}
},
{
'name': 'certificates',
'secret': {
'secretName': f'certificates-{name}'
}
},
{
'name': 'seed',
'configMap': {
'name': f'seed-{name}',
'items': [
{
'key': 'index.dat',
'path': 'index.dat'
},
{
'key': 'proof.index.dat',
'path': 'proof.index.dat'
},
{
'key': 'proof.heights.dat',
'path': '00000/proof.heights.dat'
},
{
'key': '00001.dat',
'path': '00000/00001.dat'
},
{
'key': '00001.proof',
'path': '00000/00001.proof'
},
{
'key': '00001.stmt',
'path': '00000/00001.stmt'
},
{
'key': 'hashes.dat',
'path': '00000/hashes.dat'
}
]
}
}
]
}
},
'volumeClaimTemplates': [{
'metadata': {
'name': f'data-{name}'
},
'spec': {
'accessModes': ['ReadWriteOnce'],
'resources': {
'requests': {
'storage': volume_size
}
}
}
}]
}
}
# Create the resources
logging.info(f'A create handler statefulSet: {statefulset}')
kopf.adopt(statefulset)
pykube.StatefulSet(api, statefulset).create()
return {'node': name}
@kopf.on.update('example.com', 'v1', 'symbolnodes')
def update_fn(spec, status, namespace, logger, **kwargs):
""" Update a symbol node. Note this only supports one node.= """
logging.info(f'A update handler is called with spec: {spec}')
# Update the StatefulSet
name = status['create_fn']['node']
image = spec.get('image')
# pvc_name = status['create_fn'][host_name]
deploy = pykube.StatefulSet.objects(api).get(name=name)
deploy.obj["spec"]["template"]["spec"]["containers"][0]["image"] = image
deploy.update()
@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
""" Configure the operator on startup """
settings.posting.level = logging.INFO
settings.posting.enabled = True
settings.watching.connect_timeout = 300
settings.watching.server_timeout = 300
```
##### Deploy Operator
An Operator is deployed in a container just like everything else in Kubernetes. To do this, we will need a Dockerfile.
Note: All deployment is done to a ``symbol`` namespace.
```
FROM python:3.12
RUN apt-get update && apt-get install -y build-essential python3-dev
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY operatorNode/ .
CMD ["kopf", "run", "symbolNodeOperator.py", "--verbose"]
```
requirements.txt
```
kopf>=1.35.0
pykube-ng>=22.1.0
symbol-shoestring>=0.1.3
```
We can build and push our image now that we have our basic Operator and Dockerfile. You will need to update the tag with the correct Docker image name.
```
docker build -t <symbol-operator> -f Dockerfile .
docker push <symbol-operator>
```
A service account needs to run an Operator in the Kubernetes environment. Below is the deployment YAML used.
```
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: symbol
name: symbolnode-operator-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: symbolnode-operator-clusterrole
rules:
- apiGroups: [example.com]
resources: [clusterkopfpeerings]
verbs: [list, watch, patch, get]
- apiGroups: [apiextensions.k8s.io]
resources: [customresourcedefinitions]
verbs: [list, watch]
- apiGroups: ["", "apps", "batch", "extensions"]
resources: [namespaces, deployments, pods, services, services/proxy, events]
verbs: [list, watch, patch, get, create, delete]
- apiGroups:
[admissionregistration.k8s.io/v1, admissionregistration.k8s.io/v1beta1]
resources: [validatingwebhookconfigurations, mutatingwebhookconfigurations]
verbs: [create, patch]
- apiGroups: [example.com]
resources: [symbolnode, symbolnodes]
verbs: [list, watch, create, patch, delete, get]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: symbol
name: symbolnode-operator-role
rules:
- apiGroups: [example.com]
resources: [kopfpeerings]
verbs: [list, watch, patch, get]
- apiGroups: [example.com, apps, ""]
resources: [symbolnode, symbolnodes, deployments, pods, services, services/proxy, events]
verbs: [list, watch, patch, get, create, delete]
- apiGroups: [batch, extensions]
resources: [jobs]
verbs: [create]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: symbolnode-operator-clusterrolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: symbolnode-operator-clusterrole
subjects:
- kind: ServiceAccount
name: symbolnode-operator-sa
namespace: symbol
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: symbol
name: symbolnode-operator-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: symbolnode-role-namespaced
subjects:
- kind: ServiceAccount
name: symbolnode-operator-sa
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: symbolnode-operator
namespace: symbol
spec:
selector:
matchLabels:
app: symbolnode-operator
template:
metadata:
labels:
app: symbolnode-operator
spec:
serviceAccountName: symbolnode-operator-sa
containers:
- name: symbolnode-operator
imagePullPolicy: Always
image: symbol-operator
command:
- kopf
args:
- run
- symbolNodeOperator.py
- --verbose
- -n
- symbol
resources:
limits:
memory: "128Mi"
cpu: "500m"
env:
- name: ENVIRONMENT
value: prod
```
To deploy the Operator run - ``microk8s kubectl apply -f manifest/deployment.yaml -n symbol``
After the deployment run - ``microk8s kubectl get pod -n symbol`` to verify that it was successful.
``symbol symbolnode-operator-6745bc7ff4-nxvc2 1/1 Running 0 27s
``
##### Create Symbol Node
With the Symbol Node Operator running, creating a Symbol node is as simple as creating a new CR, as shown below by running ``microk8s kubectl create -f manifest/node.yaml -n Symbol ``
```
apiVersion: example.com/v1
kind: SymbolNode
metadata:
name: testnode
spec:
image: symbolplatform/symbol-server:gcc-1.0.3.7
volumeSize: 70Gi
hostName: <IP or Host Name>
friendlyName: testNode
network: sai
```
Now, both the SAI and Operator nodes are running.
```
microk8s kubectl get pod -n symbol
NAME READY STATUS RESTARTS AGE
sai-statefulset-0 1/1 Running 0 7s
symbolnode-operator-6745bc7ff4-4mrqw 1/1 Running 0 2m39s
```
#### Monitoring
Where is the monitoring? It took a while since we first needed to create a node to monitor. :smiley:
##### Monitor pods for failures
To monitor the Symbol Node's pod, create another function for Kopf that listens to events coming from your pods. The code below gets called for every pod event, and it just checks the status for failure. If there is a failure, notify someone. And that would be it.
```
@kopf.on.event('pods')
def event_fn(event, **kwargs):
try:
pod = pykube.Pod(api, event['object'])
status = pod.obj['status']
container_statuses = status.get('containerStatuses', [])
logging.info(f'A monitor event pod: {pod.name} with status: {status} and container_statuses: {container_statuses}')
issues = []
# Check if pod is in a failed state
if status.get("phase") == "Failed":
issues.append({
"type": "pod_failure",
"message": "Pod is in Failed state",
"severity": "critical"
})
# Check container statuses
for container in container_statuses:
if container.get("state", {}).get("waiting"):
reason = container["state"]["waiting"].get("reason")
if reason in ["CrashLoopBackOff", "Error", "ImagePullBackOff"]:
issues.append({
"type": "container_issue",
"message": f"Container {container['name']} is {reason}",
"severity": "critical"
})
if issues:
logger.warning(f"Issues detected in pod {pod.name}:")
for issue in issues:
logger.warning(f"- {issue['type']}: {issue['message']} (Severity: {issue['severity']})")
# You can add additional actions here, such as:
# - Sending notifications to a monitoring system
# - Attempting to restart the pod
return {"status": "monitored", "issues_found": len(issues)}
except Exception as e:
logger.error(f"Error monitoring pod: {str(e)}")
raise kopf.PermanentError(f"Failed to monitor pod: {str(e)}")
```
##### Other Operator use cases
Well, you thought I would be done already...:smile:
I wanted to provide a couple more use cases for the Operator. Some of these are useful, especially if your secrets (keys) are stored in an external KMS/Vault. If you are not using a store for your keys, then just send an alert.
With Kopf, a task can be created that runs on an interval. This is useful for checking your nodes' certificates or voting files.
```
@kopf.timer('symbolnodes', idle=10, interval=60*60*24)
def task_exec(spec, **kwargs):
"""" do something useful """
```
* Check when your node certs will expire.
* use OpenSSL to pull node cert information
* if expires in less than 30 days
* send an alert or
* create new certs and update certs Secrets
* restart the Symbol node.
* Check when your voting keys are expiring.
* Check to see what block voting key expires
* if less than 30 days
* Send an alert or
* create new voting files
* sumbit voting link transactions
* Check node height against the rest of the chain.
* if more than 5 blocks behind then alert.
# Conclusion
This was a lot of information on Kubernetes and how Operators works. The information I gave just touches the surface and if you are interested, can find more at [Kubernetes](https://kubernetes.io/docs/home/).