CHAOS Integration with ANT

# CHAOS Integration with ANT This document summarizes Mobility’s requirements for Chaos Tooling and outlines the Prototype for enabling Chaos Testing using ANT. ## Table of Contents 1. [Overview of ANT](#overview-of-ant) 2. [Mobility Requirements](#mobility-requirements) 3. [POC Work](#poc-work) 4. [Target Cluster Components](#target-cluster-components) 5. [Client Components](#client-components) 6. [Chaos TestCase Flow](#chaos-testcase-flow) 7. [Initial Use Case : Network Jitter ](#initial-use-case-:-network-jitter ) 8. [Scope in/out](#scope-in/out) 9. [References](#references) ## Overview of ANT ANT (ATT Networking Testing) is a Tool used by AT&T Mobility for all the functional testing of deployed Network Functions (NF). ANT helps with compliance & functional testing in accordance to industry standards(IETF, 3GPP, ETSI). The latest trend of deploying the NFs as containerized services in a cloud native way using platforms like Kubernetes (etsi-ifa029/nc reference). It makes imperative that NFs are tested for resiliency and scalability - a Chaos engineering-based test is required. Mobility Team at AT&T is looking at ways to incorporate NFs Chaos testing abilities into the current testing scope using the ANT framework. ## Mobility Requirements The following are high-level Mobility requirements for chaos testing * Ability to support Chaos for `Pod/Node/Network` in `combination/parallel/serial` execution. * Chaos tests should run along with other functional tests within ANT * Need ability to inject network Latency and Packet Loss * Support for node loss scenarios ## Scope of the Work * Chaos Tooling for CNF workloads * ANT support for chaos testing * VNF/VM is not the focus * All chaos tasks are delivered as self contained docker-images ## POC Work ![](https://i.imgur.com/EeW3s4U.png) ## Target Cluster Components #### Argo Workflow A workflow has specific set of actions that it executes in a predefined order. `Argo Workflow` is used for creating the workflow * Argo workflow will be pre-installed on `target` kubernetes cluster * User's selection of `tasks` and `inputs` at ANT Portal are packaged into an `Agro workflow` template and submitted to `target` cluster #### Litmus Litmus is a toolset to do cloud-native Chaos Engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help developers and SREs find weaknesses in their application deployments. * Litmus will be pre-installed on `target` kubernetes cluster * Argo Workflow will trigger Litmus based on the User's input from the ANT portal Litmus broadly defines Kubernetes `chaos experiments` into two categories: `application` or `pod-level` and `platform` or `infra-level` chaos experiments. * pod-level experiments include `pod-delete, container-kill, pod-cpu-hog, pod-network-loss, etc.` * infra-level includes `node-drain, disk-loss, node-cpu-hog, etc.` ## Types of Chaos `Chaos Tasks` are atomic events targeting a very specific narrow action to be executed on the target cluster. * Pod Chaos Task - `Stop/Delete` a Pod * Node Chaos Task - `Stop/Delete` a Node * Network Choas Task - `Drop Packets/Delay Packets/Throttle` interface of a POD or a Node ## Client Components ### ANT-Testcase ### Internals of an ANT-Testcase A typical ANT testcase is a combination of three files 1. Metadata (User Interface and other input parameters definition) 1. Robot (glue) 1. Python (biz logic) ```mermaid graph TB CLI[CLI exec]--Rest API-->Robot User--Test Meta-Data-->UI subgraph ANT TestCase UI[UI]-->Robot subgraph Core Robot--Triggers-->Python end end ``` #### User Interface ANT provides a simple programmatic way to generate custom UI inputs for the testcase. The UI would then be rendered based on the json defined in the metadata file of the TestCase For Data driven tests, Parameters play an important role in test cases. All the required parameters of test case are defined in test case's metadata file. #### Robot Script Using Robot framework, the testcase/chaos logic implemented in python will be executed ANT testcase execution triggers the robot script within ANT engine, which in turn will execute the "Test Cases" #### Python Script ANT allows you to customize the business logic within a python file and execute that for your testcase. More Reading: [ANT Developer Guide](https://wiki.web.att.com/display/ANT/Automated+Network+Testing) #### TestCase with Argo Workflow ``` = ├── Makefile ├── README.rst ├── antchaos │ ├── ant │ │ └── test_scripts │ │ ├── __init__.py │ │ ├── startchaos <-- TestCase Name │ │ │ ├── script_meta_data <-- User Interface │ │ │ │ ├── Default_Filter.json │ │ │ │ ├── UI_Columns.json │ │ │ │ ├── namespace.json │ │ │ │ ├── startchaos.json │ │ │ │ └── vars │ │ │ │ ├── chaos_test_data.json │ │ │ │ ├── chaos_test_settings.json │ │ │ │ ├── k8s_cluster_details.json │ │ │ │ ├── k8s_details.json │ │ │ │ └── logId.json │ │ │ └── test_cases │ │ │ ├── antchaos.robot <-- Robot Script │ │ │ └── lib │ │ │ └── ChaosWorkflowStart.py <-- create argo workflow logic │ │ └── stopchaos <-- TestCase Name │ │ ├── script_meta_data <-- User Interface │ │ │ ├── Default_Filter.json │ │ │ ├── namespace.json │ │ │ ├── stopchaos.json │ │ │ └── vars │ │ │ ├── k8s_cluster_details.json │ │ │ ├── k8s_details.json │ │ │ └── logId.json │ │ └── test_cases │ │ ├── antchaos.robot <-- Robot Script │ │ └── lib │ │ └── ChaosWorkflowStop.py <-- create argo workflow logic │ ├── common <-- Common Python modules │ │ └── __init__.py │ └── tests │ ├── functional │ │ └── __init__.py │ └── units │ └── __init__.py ├── build │ ├── startchaos.zip │ └── stopchaos.zip ├── docs │ ├── Makefile │ ├── conf.py │ ├── history.rst │ ├── index.rst │ ├── installation.rst │ ├── make.bat │ └── readme.rst ├── pylintrc ├── test-requirements.txt ├── tools │ ├── README.rst │ ├── __init__.py │ ├── builder.py │ ├── metadata.py │ └── yapf-with-message.sh └── tox.ini ``` ## Chaos TestCase Flow A testcase as part of ANT will use kubernetes Restful APIs for the following: * Create & Execute Chaos * Stop Chaos * Retrieve Logs ### Sequence Diagram ```mermaid sequenceDiagram participant ATE participant TestCase participant k8s participant Argo Workflow participant Litmus rect rgba(0, 0, 255, .1) note over ATE,Litmus: 1. Authenticate ATE->>+TestCase: Ant user trigger the Chaos Testcase TestCase->>+k8s: authenticate end par chaos actions note over ATE,Litmus:2. Trigger Chaos TestCase->>+k8s: Submit the Argo Workflow to the k8s API k8s ->>+ Argo Workflow: CR starts the workflow Argo Workflow ->>+ Litmus: CR Starts the Chaos Litmus->>-k8s: Creates the Chaos pods k8s-->>-TestCase: cont TestCase-->>-ATE: next steps rect rgba(0, 0, 255, .1) note over ATE,Litmus:3. Stop Chaos ATE ->>+ TestCase: Ant user trigger stop Chaos TestCase TestCase ->>+ k8s: Trigger Stop Chaos Action k8s ->>+Argo Workflow: Fetch logs and Delete CR k8s->>-TestCase: chaos stopped TestCase-->>-ATE: next steps end end note over ATE,Litmus:4. Fetch Logs ATE ->>+ TestCase: fetch test logs TestCase ->>+ k8s : Fetch logs K8S API k8s ->>+Argo Workflow: logs k8s ->>- TestCase: run logs TestCase->>-ATE: test logs ``` ## Initial Use Case : Network Jitter Jitter is a way of delaying or dropping data packets on the interface to simulate congestion in real world. Linux has a tool [TC (Traffic Control)](https://man7.org/linux/man-pages/man8/tc.8.html#:~:text=Tc%20is%20used%20to%20configure,traffic%20for%20better%20network%20behaviour.) to shape the network traffic. At Container level, there are multiple tools for simulating network jitter * [Litmus](https://litmuschaos.io) ## References * [Code Repository](https://gerrit.mtn5.cci.att.com/gitweb?p=nc-airship-tech-evaluation.git;a=tree;f=mobility_chaos/ant_testcases;hb=refs/heads/main) ( Note: Under active development.) * [Kubernetes Restful API](https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CustomObjectsApi.md#create_namespaced_custom_object) * [ANT](https://wiki.web.att.com/display/ANT/Automated+Network+Testing)