Pipeline pooling

# Pipeline pooling ## Flow ![MicrosoftTeams-image (16).png](https://hackmd.io/_uploads/S112MTemT.jpg) ### Options for running the pipeline 1. Run the pool of machines inside an environment - Create a dedicated environment for each pipeline run 2. Run the pipeline inside a large environment - add tags to control the machines which are used for running. 3. Create release specific pools - River, Dev, Main, Performance etc.. ### Creation of the machine: - you need to know the proper requirements[tools/apps/softwares/template] required to configure a machine properly without any failures. - Use existing template for release agent - Don't create randomly, look for the demand and then configure. - Dynamically create based on the demand - One generic script/tool that can adapt to any requirement. - Static pool of machines - On-demand based machine creation and deletion once it is not required Approach decided: - Have a static pool of machines, reuse for running the pipelines - All of these machines will be on single pool. - Revisit afte the complete concept - Dynamically creating based on demand - Next iteration ### Choosing the artifact - Remains like how it is today Approach decided: - If a wrong artifact is configured in the yaml pipeline, we rely on default ADO pipeline itself for error handling ### Choosing the target machines ![Pooling.PNG](https://hackmd.io/_uploads/BJ6o66gma.png) - Capability - Tag Machine - with specific string - Life cycle of machines in pooled env ![image.png](https://hackmd.io/_uploads/SJNnjTlQT.png) - Environment locks - What level locks are to be implemented? Approach decided: - Concept analysis: Explore capabilities and suitability of environment locks an then decide the approach to be taken ### Deploying the artifact - Dynamic backup creation - Only in the second iteration - [Deployment server jobs]( https://helios.healthcare.siemens.com/tfs/Projects/Numaris/_settings/agentqueues?queueId=382&view=jobs) - Analysis in what is actually happening on Deployment server - What options do we've to run in parallel? ### Running the stage in parallel - Nothing specific foreseen, it should run as it is done today - hard coded paralell agents - 15 ### Running the stage in series within the same machine - Multiple Stages run on same machine, reboots in between - Adding dynamic capabilities for the agents and configuring the static demands as part of the stages - Explore yaml depedency analysis to know if there are previous stages run on same machine: https://stackoverflow.com/questions/62758900/how-to-use-a-single-agent-for-multiple-jobs-stages-in-azure-devops-yaml-pipeline ### Running the tests in parallel - Nothing specific foreseen, should continue to work like today ### Reboot in between the stage - Use capabilities such as ReleaseId+EnvId(StageId)+phaseId(in case of parallelism) = unique for every stage run - Explore use of environment lock mechanism to handle the reboot ### Document generation/EGA for parallel pipelines? - Check if there are any consequences of running parallel doc generation? - EGA - 2h, doc gen - 1h --> Can it have induce any delay? ## Corner cases ### Running the pipeline by selecting subset of stages - The agents that needs to be ear mrked should be decided on the fly at the beginning. - Machines needs to be set prior running the subset of stages - There will be a cleanup stage(remove tags/capabilities) - this should be locked (not sure if possible). - YAML knows it during runtime. ### Deployment/Stage runs into an error in between - Define rules when to retain a machine and when not - Define default wait time before we retire the machine - Export basic logs - like SaveLogs, etc.. - Enforce limit of machines that can be retained used by release -> free first machine then retain second machine - Define If any error on the Deployment can be taken by IDP for the analysis and If any FITs then it should be assined it to the team. - Need to retain the machine, for analysis. - Retrigger should select the same machine. (can be resolved by proper combination of capibilities) - Mechanism to remove tags/capabilitis once retrigger/analysis is done (whether success or failure). - Option to add retain the machine for analysis manually is needed - Fix a time period on how long machine has to be retained - How do we handle same failures on multiple pipeline runs? ### Running a developer pipeline Specific pipelines for running E2E tests on a feature branch - Will there be a Full git build enabled for feature branch? - Separate YAML files for every dev release definitions - Specify the stages for each pipelines - Limited allowed parallel release runs on developer releases - Option: Dedicated Pool for Dev releases? ### Hardcoded IP/Hostnames in stages - Analyze if there are any specific stages where hardcoded hostnames or IP addresses are configured. This needs to be adapted ## Use cases ## Introduction This document talks about the various use cases for release pipeline pooling and the things we need to keep in mind for each scenario. ## Assumptions - A pool of machines exist. (Number of machines to be decided later) - All release pipelines (River Release + Team release) use the same pool. ### Single release triggered on River.Night. - Release selects and ear marks few agents to run the tests. - Deploys the latest package on these agents. - Runs the FIT tests. - Releases the agent ones the release is successful. ### First release triggered, second triggered after first is complete. - First release is triggered. - Second release is triggered before the first is complete. - The first release should release the machines ear marked for its FITs execution. ### First release triggered, second release triggered halfway through the first. - First release is triggered. - Second release is triggered before the first is complete. - The second release should not choose the agents ear marked for the first release. - Deployment server (or servers) should be able to facilitate both releases. ### Two release triggered at the same time. ### First release cancelled, second release triggered after first release cancellation complete. - Explore to run "finallly" block for the cancelled pipeline - Next release shall take care of cleanup of capabilities previously ear marked ### Release triggered but no agents available. - Park in the prep pool? ### Release runs but one or more stages fail. ### Release runs, a stage fails, stage is retriggered. ### Release runs, blue screen/black screen during deployment. ### Approach to be followed - Goal: Use yaml as the primary pipeline and classic only on need basis - Select 2 drop only fits and create a small pipeline which can give faster fits - Use the yaml release pipeline machines/Git Hackathon machines/Classic pipeline machines for pipeline pooling experimentation - Deployment server - Analysis of the complete flow - Data flow diagram - Parallel jobs through parallel agents on same machine? - Multiple machines with deployment server? - Uploading of same package to multiple servers in case of parallel setup - any other optimization possible? - Containerization of deployment server? - Environment locking mechanism: - Series of stages - e.g. Ganges stages - within a stage, reboots in between - Impact on locking when there is an error in between - Has to be simulated using a small pipeline and demoed ### Clarifications with Microsoft - How should the pipeline architecture be designed with environments? - Can we have a channel established with Microsoft to review the design and understand what exactly is their recommendation? - PreDefined variables are not available at the complile time , so we are not able to use it under environment properties (Example: Build.BuildNumber in Tags) - How the reboot behavior of the target machines needs to be handled? ## Deployment Server - Stage running on deployment server are queued and queue times are high. - In Git world we expect a large number of pipelines runs, hence the load will be high on the deployment server. ### Multiple deployment server based on vlan. Assumption: - River pipeline is a dedicated pipeline to test quality of dev and main branches inone dedicated vlan. - Have multiple deployment server in different vlans. - Deployment servers are segrgated based on priority. - 1 deployment server and pool for River pipelines. - 1 deployment server and pool for Team release pipelines. - Pools are segregated based on vlans. vlan Suggestion: - River : main and integ - Advantages: - Load is distributed. - Queue time on deployment server reduced. Disadvantage: - ### Single deployment server with multiple agents - Check with dev team if multiple instances of deploment server is possible. Not feasible as multiple instances of deployment server not possible. ### Sync all packages on 1 deployment server 1 vlan - All packages are synced. - Strict retention policy to prevent disk space fill up. Advantages: - One deplyment server to maintain. - Release Pipelines setup remains the same. Disadvanatage: - Space on deployment server might fill up. Enough space might not be there if all packages are synced. - Bottle neck for release pipelines (depending on wait time). ### Deployment server - Need Info - Can multiple instances of deployment server run on same machine. - What is the limit of number of machines for 1 deployment server. - Is it possible to have multiple deployment server under 1 vlan. ### Taget for first iteration of Pooling - Remove one on one mapping from pipeline to machines. - All FITs running in pipeline including Numaris and Host only FITs. - Run 3 parallel pipelines. - Yaml pipeline status in Dashboard. QR Phase 1: - Auto selection of agents - block for release pipeline. - One run that runs on any agent selected dynamically. - Configure of deployment stage is outside of FIT/Deployment. Deployment into FIT stages. Next sprint: - Identify 2/3 fits with small execution time - Create a very basic pipeline that runs on a pool using ADO default pooling mechanism. - Start implementation of agent selection and locking. - Create new vlan, configure deployment server, configure and set up agents. ### Next feature backlog: Done: - Get rid of 2nd agent(Environment code) - Done ToDo: - Move prerequistie files for FIT tests on the git repo e.g. Ganges - Machine allocator dedicated pool with single agent - Move to infra machine - WaitForAgentsOnline optimize/new solution - Current problem: Each deployment needs - Serverless jobs? - More agents? - IDP releasePoolScripts: Enhanced, new Infra-machines? - Move to infra machines, docker? - DeploymentServer optimization, too much traffic - New VLAN creation and movement of the machines - Create machines in the new VLAN - Support to Retain machines - Concept work ongoing in this QR - Developer release - separate new dev pipeline - Separate pools - Security permissions on the agent pools - Don't allow pipelines unless granted to run - Custom system type based dynamic backup for hosts - Running the pipeline by selecting subset of stages - The agents that needs to be ear mrked should be decided on the fly at the beginning. - Test agents name - solve C# code - 'T_' problem - Context: Deployment groups, test agents start with "T_" - Machines needs to be set prior running the subset of stages - There will be a cleanup stage(remove tags/capabilities) - this should be locked (not sure if possible). - YAML knows it during runtime. - Deployment/Stage runs into an error in between - Define rules when to retain a machine and when not - Define default wait time before we retire the machine - Export basic logs - like SaveLogs, etc.. - Enforce limit of machines that can be retained used by release -> free first machine then retain second machine - Define If any error on the Deployment can be taken by IDP for the analysis and If any FITs then it should be assined it to the team. - Need to retain the machine, for analysis. - Retrigger should select the same machine. (can be resolved by proper combination of capabilities) - Mechanism to remove tags/capabilitis once retrigger/analysis is done (whether success or failure). - Option to add retain the machine for analysis manually is needed - Fix a time period on how long machine has to be retained - How do we handle same failures on multiple pipeline runs? - Running a developer pipeline - Specific pipelines for running E2E tests on a feature branch - Will there be a Full git build enabled for feature branch? - Separate YAML files for every dev release definitions - Specify the stages for each pipelines - Limited allowed parallel release runs on developer releases - Option: Dedicated Pool for Dev releases? - Hardcoded IP/Hostnames in stages - Analyze if there are any specific stages where hardcoded hostnames or IP addresses are configured. This needs to be adapted - Mars as part of release pipeline - Carsten Prinz requirement, need to additionally discuss and conclude. - Pet server as part of the release pipeline - Discuss and conclude with Petrowe (Somaraj) ## Feature 1057655: Completion of yaml based release pipeline pooling ### Refining - POC of retain agent concepts - Implement the selected concept. - Pipeline to ensure agents are released back into pool if more than <certain days> - Improve Wait for agents online - Write integration tests to test pooling - Create yaml stage for ANt FI - Identify and execute tests to ensure pooling is working. - Running the pipeline by selecting subset of stages - Based on param inputs - Developer pipeline support - Concept for handling large number of requests. - Estimation of how many machines needed. - Identify Use cases - Lock some stages from being edited e.g: agent release. IP Sprint: - Concept: Try and mitigate deployment server bottleneck - Concept: Dynamically decide number of agents required ### Miscellaneous stages - Analyse pooling for these stages - Scaling - Document generation, - Ega trace runner, - ETP creation, - MR protocol creation #### Maybe In case atlantic agrees, Raise a IDP - Create new machines and migrate to new vlan - Set up new deployment server in new vlan - Test and confirm it works. ## Feature 1046629: Enable Build Dashboard to view Yaml-Release-Results ## Unit test targets: Have unit tests written to check: * Sanctity of yml * Dependency check * Functionality check ## Integration/System tests for pipeline Goal: Verify the changes in yaml and related tools before the actual pipeline runs Aspects to be tested: 1. Scripts, yaml - Before rolling out the actual pipeline, allocate, run the pipeline, deallocate --> If the entire flow works without failure, then, we verify if changes arent making the system unstable 2. Timeout of unavailability of agents 3. Script runs on a different agent pool (currently 1 agent), when the number of runs exceed the number of available agents, what is the result? 4. Subset of stages chosen, check the number of machines allocated/de-allocated ## User Stories: ### Feature 1057655: Completion of yaml based release pipeline pooling - As an IDP-Dev I want to write the yaml template for Ant FI so that the yaml River release pipeline can be made the primary pipeline. (Not pooling, but important) - 3 PD (Suggestion: Test on Gharial, rollback on Gharial and trigger FI) - POC 3 Introduce an Agentless wait Job - 3 PDs - POC 1/2 Integration of De-allocation and Retention as Stage Steps - 3 PDs - As an IDP-Developer, I want to implement "Retain agent" concept and write appropriate tests. (Should we try poc of the different concepts?) - 6 PDs - As an IDP-Dev I want to identify test cases for release pipeline pooling. - As an IDP-Dev I want to create concept for automated testing release pipeline pooling - (Identify test cases, concept for automation, and automating tests) - 8 PDs - As an IDP-Developer, I want to create a service that will release agents that have been retained for more than "x" days. The value of "x" needs to be decided as part of this user story. - 4 PDs - As an IDP-Developer, I want to implement a redesign for "Wait for Agents Online" - 8 PDs - As an IDP-Developer, I want to implement dynamically indetifying the number of agents needed - 8 PDs Total: 43 PDs ### Feature 1046629: Enable Build Dashboard to view Yaml-Release-Results - As an IDP-Developer, I want to update the code to display data from yaml based pipelines. - As an IDP-Developer, I want to write unit tests for yaml based pipeline code. - As an IDP-Developer, I want to host the pipeline statistics code on an IDP owned virtual machine. - As an IDP-Developer, I want to automate the deployment of pipeline statics tool. - As an IDP-Developer, I want to investigate migration of angular to latest version. ### Feature 1062466: Infrastructure for Pooling - As an IDP-Developer, I want to co-ordinate creation of new vlan. - 5 PDs - As an IDP-Developer, I want to create a concept and implement on how agents pools will be utilized in pooling. Include detailed estimations for the infrastructure required. - 8 PDs - Design of the pools - dev only, river only etc - Numaris and drop only - Revisit estimation - Other infra pools like "Wait for agent pool", "Agent allocation/Deallocation" - Document and present - As an IDP-Developer, I want to set up deployment server on the new vlan. - 5 PDs - As an IDP-Devloper, I want to create new Vms and configure agents on these machines. - 13 PDs - Configure VMs ~500 (procurement of hardware might take time) - Register Vms to ipam - Firewall rule adaptation - Configure agents on vms - Initial Host configurator - Test deployment server max load - Discuss with Indus team. - Decide on how many vlans if needed. Total: 31 PDs Risks: Deployment server bottleneck. Hardware availability. - Talk to Marcus and get clarity on refactoring in deployment server. - Ask wurmloch how network segmentation can be done.