# Operator, help, please! Not again; my node seems to be offline. I am unsure what happened, but my node stopped, and I don't know when. :thinking_face: :slightly_frowning_face: Your node's day-to-day operation and maintenance can sometimes be a significant overhead. Node failure always happens at inconvenient times, like while on vacation. What can be done to help with these failures? How can I manage my node more efficiently? Would it not be nice to have someone watching your node and making intelligent decisions 24/7? Yes, but that can be expensive. The next best option is to automate the actions required to keep your node healthy(self-healing). There are a lot of open-source tools in the wild that can accomplish this goal, but I will go over Kubernetes Operator. ## Why Kubernetes Kubernetes(k8s) has emerged as the de facto standard for container orchestration. It can be complicated, but it handles complex infrastructure to simplify application deployment. Kubernetes abstracts the underlying infrastructure, so the application does not require components like error handling, scalability, security, redundancy, and other common facilities. These components are now located inside the Kubernetes ecosystem. ![image](https://hackmd.io/_uploads/rkXTsQIPp.png) Kubernetes' control plane comprises several components, including an API server, scheduler, etcd, and controller manager. I will not go into details on each, but more information can be found [here](https://kubernetes.io/docs/concepts/overview/components/#control-plane-components). 1. The API server is the gateway to talk to all the components in the control plane. 2. Etcd is a key value store that keeps the state of the objects in the cluster. 3. Scheduler is a process that assigns pods to run on a node. 4. The Controller Manager(CM) encapsulates the core Kubernetes logic. CM makes sure that all the pieces work correctly. If not, it takes action to bring the system to the desired state. The CM is where a lot of the action happens. You can configure a resource under Kubernetes in two ways: imperatively or declaratively. Imperative commands interact with Kubernetes cluster objects directly, such as creating a pod. Declarative uses manifest files along with the ``kubectl apply`` command. In the latter, Kubernetes is given a desired state, and the controller manager ensures this state is achieved. Kubernetes is great at automating your containers but lacks the knowledge to understand the details of individual applications. To rectify this, Kubernetes allows the extension of its API with custom resources. An Operator is an extension of the Kubernetes. ## Why Kubernetes Operator While Kubernetes excels at managing stateless applications, operators are required to configure more complexly for stateful applications such as blockchains or databases. Stateless applications can be managed by the built-in controllers in the Kubernetes control plane because all stateless applications have similar recovery and scalability steps. More application-specific knowledge is usually required for stateful applications for recovery or scaling. For example, with the Symbol blockchain, when the client shuts down incorrectly due to a node restart, it cannot just be restarted on a different node. The recovery tool will need to run to fix inconsistent data before the client starts. Also, upgrades are not as simple as restarting your Symbol node. It would include these steps. 1. upgrade ConfigMaps and Sercets for resources and certs 2. shutdown the server, broker and MongoDB 3. start MongoDB, broker and server. There are other scenarios, too, such as what happens if MongoDB fails while the broker is still running. How is this handled? These are all application-specific, and this is where the Operator comes in. The operator pattern extends the capabilities of Kubernetes by encapsulating the application's domain-specific knowledge and management logic into an Operator. An Operator automates the operation and lifecycle management of a specific application. It is a software extension that uses the Kubernetes API to monitor and control the state of an application running on the cluster. They act as human operators who know the desired state of an application and will take necessary actions/steps to ensure the application remains in the correct state. ![image](https://hackmd.io/_uploads/rkNzxjUDT.png) Custom resources let you store and retrieve structured data using Kubernetes APIs. When paired with your own custom controller, they provide a declarative API. This API enforces a separation of responsibilities, just like the built-in Kubernetes APIs. I can declare the desired state of my resource, and my Kubernetes controller will keep the current state of my objects in sync with the declared desired state. ![image](https://hackmd.io/_uploads/B1_HjXKv6.png) By adding application-specific knowledge and automating maintenance activities, an Operator can free up time for the end-user to not manually manage their application. The Operator contains one or more controllers. Each controller watches a specific custom resource type. The controller queries the API controller and takes application-specific actions to make the current state match the desired state. A controller implements the controller pattern in Kubernetes, which is a control loop. As shown below, the control plane runs the controllers in a loop. Some controllers built into Kubernetes run in the control plane, while others that are part of an operator run on the worker nodes. ### Creating a Kubernetes Operator #### Setup Kubernetes cluster. There are several ways to set up a Kubernetes cluster. 1. Most already have Docker Desktop. This has a standalone k8s server(https://docs.docker.com/desktop/kubernetes/) 2. There are several standalone Kubernetes servers, and either one is fine. a. MicroK8s is one of my favorites if you use Ubuntu since it just installs with Snap. - https://ubuntu.com/tutorials/install-a-local-kubernetes-with-microk8s#1-overview b. k0s - https://k0sproject.io/ c. k3s - https://k3s.io/ I will be using microk8s in the examples below. #### Development tools There are several frameworks, the most popular of which is the [operator framework](https://operatorframework.io/) with Golang. Since there is currently no Golang SDK for Symbol, I will be using Kopf (Kubernetes Operator Pythonic Framework) to build our first basic Symbol operator. Apart from that, we need the following tools to get started: * [kopf](https://kopf.readthedocs.io/en/stable/) Kubernetes Operator Pythonic Framework * [pykube-ng](https://pykube.readthedocs.io/en/latest/index.html) ( a lightweight Python client library for Kubernetes, which is preferred when developing an Operator.) * [symbol-shoestring](https://pypi.org/project/symbol-shoestring/) to generate the node config * [kr8s](https://docs.kr8s.org/en/stable/client.html) might be used for self-healing tasks. #### Custom Resource Definition(CRD) Before any Custom Resource(CR) can be created, a Custom Resource Definition(CRD) needs to be defined. The CRD defines the CR's schema. For the Symbol node, I extracted some of the common fields used to customize a node. These fields are added to the CRD, which the Operator will use to create a Symbol Node. If other fields are needed for your deployment, add them to the CRD. ``` apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: symbolnodes.example.com spec: group: example.com names: kind: SymbolNode plural: symbolnodes singular: symbolnode shortNames: - sn scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: image: type: string volumeSize: type: string hostName: type: string friendlyName: type: string network: type: string name: type: string required: - image - volumeSize - network - friendlyName - hostName ``` Note: I left out the account key since, to handle this correctly, it needs to be in a Secret tied to an external KMS. To deploy this to your Kubernetes environment - ``microk8s kubectl apply -f manifest/crd.yaml -n symbol`` You will see the new API is available and verify that this works. ``` microk8s kubectl api-resources | grep -i symbolnodes symbolnodes sn example.com/v1 true SymbolNode ``` For more information on CRD see the [Kubernetes site](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) #### Create Operator code Now that there is a new Kubernetes API, we will need to write some code that monitors the creation of these new objects and takes the correct action. In our case, creating a new object of ``symbolnodes`` will create a new node. ##### ConfigMaps and Secret The old method of storing application configuration is to put it on disk. Kubernetes offers several different options. * ConfigMap is used to store non-sensitive configuration data * Serect stores sensitive information like passwords, API keys, or TLS certificates The Operator will need logic to convert the Symbol configuration from files to ConfigMap and Secret in the following steps: 1. Shoestring is used to generate the configuration for the node. 2. Convert the ``key/cert`` folder to Kubernetes Secret 3. Convert the ``resources`` folder to ConfigMap(note: if this was a harvesting node, then config-harvesting.properties would need to be a Secret) 4. Convert the ``seed`` folder to ConfigMap Bash script used to generate Symbol node config ``` #!/usr/bin/env bash set -ex network=$1 friendlyName=$2 host=$3 openssl genpkey -algorithm ed25519 -out ca.key.pem mkdir -p shoestring # build catapult configuration # hard code to the loopback address will overwrite later cat > shoestring/override.ini << EOL [node.localnode] host = 127.0.0.1 friendlyName = ${friendlyName} EOL python3 -m shoestring init shoestring/shoestring.ini --package "${network}" sed -i 's/^apiHttps = true/apiHttps = false/g' shoestring/shoestring.ini sed -i 's/^caCommonName =/caCommonName = test/g' shoestring/shoestring.ini sed -i 's/^nodeCommonName =/nodeCommonName = test node/g' shoestring/shoestring.ini sed -i 's/^features = API | HARVESTER | VOTER/features = PEER/g' shoestring/shoestring.ini python3 -m shoestring setup --config shoestring/shoestring.ini --directory . --ca-key-path ca.key.pem --overrides shoestring/override.ini --package "${network}" # k8s can switch IP address on failover, so leave it blank if you have multiple nodes # if DNS is setup for your node then use this instead. sed -i "s/^host = 127.0.0.1/host = ${host}/g" userconfig/resources/config-node.properties ``` Here is the code to convert all Symbol's node files to Kubernetes objects. ``` def create_symbol_node(network, friendly_name, host_name): """ creates a symbol node """ node_path = Path(friendly_name) if node_path.exists(): shutil.rmtree(node_path) node_path.mkdir(parents=True) config_path = node_path / 'shoestring' config_path.mkdir() cwd = os.getcwd() try: os.chdir(node_path) subprocess.run( ['bash', '/app/createSymbolNode.sh', network, friendly_name, host_name], stdout=subprocess.PIPE, stderr=subprocess.PIPE, ) finally: os.chdir(cwd) return node_path.absolute() def _read_file_data_from_directory(config_path, binary_files = False): """ creates config data needed for a ConfigMap or Secret from files in a folder """ directory_path = Path(config_path) config_data = {} binary_data = {} for file_path in directory_path.rglob("*"): if file_path.is_file(): key = file_path.name try: if binary_files: # Handle binary files with open(file_path, 'rb') as f: content = base64.b64encode(f.read()).decode('utf-8') binary_data[key] = content else: # Handle text files with open(file_path, 'rt') as f: content = f.read() config_data[key] = content except Exception as e: raise Exception(f"Error reading file {file_path}: {str(e)}") return config_data, binary_data def _create_config_from_files(name, namespace, config_path, config_type, labels = None, binary_files = False): """ create a ConfigMap or Secret object from files """ config_data, binary_data = _read_file_data_from_directory(config_path, binary_files=binary_files) # Prepare object config_dict = { 'apiVersion': 'v1', 'kind': config_type, 'metadata': { 'name': name, 'namespace': namespace }, 'data': config_data } # Add binary data if present if binary_data: config_dict["binaryData"] = binary_data # Add labels if provided if labels: config_dict["metadata"]["labels"] = labels # add as a child object kopf.adopt(config_dict) return config_dict def _create_configmap_from_files(api, name, namespace, config_path, labels = None, binary_files = False): """ create a ConfigMap from files """ configmap_dict = _create_config_from_files( name, namespace, config_path, "ConfigMap", labels, binary_files ) # Create the ConfigMap configmap = pykube.ConfigMap(api, configmap_dict) configmap.create() return configmap def _create_secret_from_files(api, name, namespace, config_path, labels = None, binary_files = False): """ create a Secret from files """ configmap_dict = _create_config_from_files( name, namespace, config_path, "Secret", labels, binary_files ) # Create the Secret configmap_dict['data'] = {k: base64.b64encode(v.encode()).decode() for k, v in configmap_dict['data'].items()} secret = pykube.Secret(api, configmap_dict) secret.create() return secret ``` ##### Create Symbol Node Now that the Symbol Node resources are created in Kubernetes, the Symbol Node can be created. A [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) will be used to create a Symbol Node. This allows for easy assignment of a persistent volume to use as storage. This storage will live on even after the pod is removed or recreated, allowing for the storage to failover with the pod. As part of the StatefulSet definition, the ConfigMaps and Secret created for the node resources will be referenced and mounted into the container. Also note the use of ``hostPort``, which allows the Symbol client to access port 7900 on the host directly. ``` # Configure logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # Initialize the Kubernetes client api = pykube.HTTPClient(pykube.KubeConfig.from_env()) @kopf.on.login() def custom_login_fn(**kwargs): if os.environ.get('ENVIRONMENT', 'dev') == 'prod': return kopf.login_with_service_account(**kwargs) else: return kopf.login_with_kubeconfig(**kwargs) @kopf.on.create('symbolnodes') def create_fn(spec, namespace, **kwargs): logging.info(f'A handler is called with spec: {spec}') # Get the configuration from the custom resource size = spec.get('replicas', 1) image = spec.get('image') volume_size = spec.get('volumeSize') network = spec.get('network', 'mainnet') friendly_name = spec.get('friendlyName', '') host_name = spec.get('hostName') name = spec.get('name', f'{host_name}-{network}'.replace('.', '-')) node_path = create_symbol_node(network, friendly_name, host_name) try: logging.info(f'node_path: {str(node_path)}, namespace: {namespace}, name: {name}') _create_secret_from_files(api, f'certificates-{name}', namespace, node_path / 'keys/cert', {'app': name}) _create_configmap_from_files(api, f'resources-{name}', namespace, node_path / 'userconfig/resources', {'app': name}) _create_configmap_from_files(api, f'seed-{name}', namespace, node_path / 'seed', {'app': name}, True) finally: shutil.rmtree(node_path) # Create StatefulSet object statefulset = { 'apiVersion': 'apps/v1', 'kind': 'StatefulSet', 'metadata': { 'name': f'{name}-statefulset', 'namespace': namespace }, 'spec': { 'replicas': size, 'selector': { 'matchLabels': { 'app': name } }, 'template': { 'metadata': { 'labels': { 'app': name } }, 'spec': { 'containers': [{ 'name': 'client', 'image': image, 'command': ['/usr/catapult/bin/catapult.server', '/'], 'env': [{ 'name': 'LD_LIBRARY_PATH', 'value': '/usr/catapult/lib:/usr/catapult/deps' }], 'ports': [ { 'containerPort': 7900, 'hostPort': 7900 } ], 'resources': { 'requests': { 'cpu': '1.0', 'memory': '2Gi' }, 'limits': { 'cpu': '2.0', 'memory': '4Gi' } }, 'volumeMounts': [{ 'name': f'data-{name}', 'mountPath': '/data' }, { 'name': 'resources', 'mountPath': '/resources', 'readOnly': True }, { 'name': 'certificates', 'mountPath': '/certificates', 'readOnly': True }, { 'name': 'seed', 'mountPath': '/seed', 'readOnly': True } ], }], 'volumes': [ { 'name': 'resources', 'configMap': { 'name': f'resources-{name}' } }, { 'name': 'certificates', 'secret': { 'secretName': f'certificates-{name}' } }, { 'name': 'seed', 'configMap': { 'name': f'seed-{name}', 'items': [ { 'key': 'index.dat', 'path': 'index.dat' }, { 'key': 'proof.index.dat', 'path': 'proof.index.dat' }, { 'key': 'proof.heights.dat', 'path': '00000/proof.heights.dat' }, { 'key': '00001.dat', 'path': '00000/00001.dat' }, { 'key': '00001.proof', 'path': '00000/00001.proof' }, { 'key': '00001.stmt', 'path': '00000/00001.stmt' }, { 'key': 'hashes.dat', 'path': '00000/hashes.dat' } ] } } ] } }, 'volumeClaimTemplates': [{ 'metadata': { 'name': f'data-{name}' }, 'spec': { 'accessModes': ['ReadWriteOnce'], 'resources': { 'requests': { 'storage': volume_size } } } }] } } # Create the resources logging.info(f'A create handler statefulSet: {statefulset}') kopf.adopt(statefulset) pykube.StatefulSet(api, statefulset).create() return {'node': name} @kopf.on.update('example.com', 'v1', 'symbolnodes') def update_fn(spec, status, namespace, logger, **kwargs): """ Update a symbol node. Note this only supports one node.= """ logging.info(f'A update handler is called with spec: {spec}') # Update the StatefulSet name = status['create_fn']['node'] image = spec.get('image') # pvc_name = status['create_fn'][host_name] deploy = pykube.StatefulSet.objects(api).get(name=name) deploy.obj["spec"]["template"]["spec"]["containers"][0]["image"] = image deploy.update() @kopf.on.startup() def configure(settings: kopf.OperatorSettings, **_): """ Configure the operator on startup """ settings.posting.level = logging.INFO settings.posting.enabled = True settings.watching.connect_timeout = 300 settings.watching.server_timeout = 300 ``` ##### Deploy Operator An Operator is deployed in a container just like everything else in Kubernetes. To do this, we will need a Dockerfile. Note: All deployment is done to a ``symbol`` namespace. ``` FROM python:3.12 RUN apt-get update && apt-get install -y build-essential python3-dev WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY operatorNode/ . CMD ["kopf", "run", "symbolNodeOperator.py", "--verbose"] ``` requirements.txt ``` kopf>=1.35.0 pykube-ng>=22.1.0 symbol-shoestring>=0.1.3 ``` We can build and push our image now that we have our basic Operator and Dockerfile. You will need to update the tag with the correct Docker image name. ``` docker build -t <symbol-operator> -f Dockerfile . docker push <symbol-operator> ``` A service account needs to run an Operator in the Kubernetes environment. Below is the deployment YAML used. ``` apiVersion: v1 kind: ServiceAccount metadata: namespace: symbol name: symbolnode-operator-sa --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: symbolnode-operator-clusterrole rules: - apiGroups: [example.com] resources: [clusterkopfpeerings] verbs: [list, watch, patch, get] - apiGroups: [apiextensions.k8s.io] resources: [customresourcedefinitions] verbs: [list, watch] - apiGroups: ["", "apps", "batch", "extensions"] resources: [namespaces, deployments, pods, services, services/proxy, events] verbs: [list, watch, patch, get, create, delete] - apiGroups: [admissionregistration.k8s.io/v1, admissionregistration.k8s.io/v1beta1] resources: [validatingwebhookconfigurations, mutatingwebhookconfigurations] verbs: [create, patch] - apiGroups: [example.com] resources: [symbolnode, symbolnodes] verbs: [list, watch, create, patch, delete, get] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: symbol name: symbolnode-operator-role rules: - apiGroups: [example.com] resources: [kopfpeerings] verbs: [list, watch, patch, get] - apiGroups: [example.com, apps, ""] resources: [symbolnode, symbolnodes, deployments, pods, services, services/proxy, events] verbs: [list, watch, patch, get, create, delete] - apiGroups: [batch, extensions] resources: [jobs] verbs: [create] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: symbolnode-operator-clusterrolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: symbolnode-operator-clusterrole subjects: - kind: ServiceAccount name: symbolnode-operator-sa namespace: symbol --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: symbol name: symbolnode-operator-role-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: symbolnode-role-namespaced subjects: - kind: ServiceAccount name: symbolnode-operator-sa --- apiVersion: apps/v1 kind: Deployment metadata: name: symbolnode-operator namespace: symbol spec: selector: matchLabels: app: symbolnode-operator template: metadata: labels: app: symbolnode-operator spec: serviceAccountName: symbolnode-operator-sa containers: - name: symbolnode-operator imagePullPolicy: Always image: symbol-operator command: - kopf args: - run - symbolNodeOperator.py - --verbose - -n - symbol resources: limits: memory: "128Mi" cpu: "500m" env: - name: ENVIRONMENT value: prod ``` To deploy the Operator run - ``microk8s kubectl apply -f manifest/deployment.yaml -n symbol`` After the deployment run - ``microk8s kubectl get pod -n symbol`` to verify that it was successful. ``symbol symbolnode-operator-6745bc7ff4-nxvc2 1/1 Running 0 27s `` ##### Create Symbol Node With the Symbol Node Operator running, creating a Symbol node is as simple as creating a new CR, as shown below by running ``microk8s kubectl create -f manifest/node.yaml -n Symbol `` ``` apiVersion: example.com/v1 kind: SymbolNode metadata: name: testnode spec: image: symbolplatform/symbol-server:gcc-1.0.3.7 volumeSize: 70Gi hostName: <IP or Host Name> friendlyName: testNode network: sai ``` Now, both the SAI and Operator nodes are running. ``` microk8s kubectl get pod -n symbol NAME READY STATUS RESTARTS AGE sai-statefulset-0 1/1 Running 0 7s symbolnode-operator-6745bc7ff4-4mrqw 1/1 Running 0 2m39s ``` #### Monitoring Where is the monitoring? It took a while since we first needed to create a node to monitor. :smiley: ##### Monitor pods for failures To monitor the Symbol Node's pod, create another function for Kopf that listens to events coming from your pods. The code below gets called for every pod event, and it just checks the status for failure. If there is a failure, notify someone. And that would be it. ``` @kopf.on.event('pods') def event_fn(event, **kwargs): try: pod = pykube.Pod(api, event['object']) status = pod.obj['status'] container_statuses = status.get('containerStatuses', []) logging.info(f'A monitor event pod: {pod.name} with status: {status} and container_statuses: {container_statuses}') issues = [] # Check if pod is in a failed state if status.get("phase") == "Failed": issues.append({ "type": "pod_failure", "message": "Pod is in Failed state", "severity": "critical" }) # Check container statuses for container in container_statuses: if container.get("state", {}).get("waiting"): reason = container["state"]["waiting"].get("reason") if reason in ["CrashLoopBackOff", "Error", "ImagePullBackOff"]: issues.append({ "type": "container_issue", "message": f"Container {container['name']} is {reason}", "severity": "critical" }) if issues: logger.warning(f"Issues detected in pod {pod.name}:") for issue in issues: logger.warning(f"- {issue['type']}: {issue['message']} (Severity: {issue['severity']})") # You can add additional actions here, such as: # - Sending notifications to a monitoring system # - Attempting to restart the pod return {"status": "monitored", "issues_found": len(issues)} except Exception as e: logger.error(f"Error monitoring pod: {str(e)}") raise kopf.PermanentError(f"Failed to monitor pod: {str(e)}") ``` ##### Other Operator use cases Well, you thought I would be done already...:smile: I wanted to provide a couple more use cases for the Operator. Some of these are useful, especially if your secrets (keys) are stored in an external KMS/Vault. If you are not using a store for your keys, then just send an alert. With Kopf, a task can be created that runs on an interval. This is useful for checking your nodes' certificates or voting files. ``` @kopf.timer('symbolnodes', idle=10, interval=60*60*24) def task_exec(spec, **kwargs): """" do something useful """ ``` * Check when your node certs will expire. * use OpenSSL to pull node cert information * if expires in less than 30 days * send an alert or * create new certs and update certs Secrets * restart the Symbol node. * Check when your voting keys are expiring. * Check to see what block voting key expires * if less than 30 days * Send an alert or * create new voting files * sumbit voting link transactions * Check node height against the rest of the chain. * if more than 5 blocks behind then alert. # Conclusion This was a lot of information on Kubernetes and how Operators works. The information I gave just touches the surface and if you are interested, can find more at [Kubernetes](https://kubernetes.io/docs/home/).