# Pipeline pooling
## Flow

### Options for running the pipeline
1. Run the pool of machines inside an environment - Create a dedicated environment for each pipeline run
2. Run the pipeline inside a large environment - add tags to control the machines which are used for running.
3. Create release specific pools - River, Dev, Main, Performance etc..
### Creation of the machine:
- you need to know the proper requirements[tools/apps/softwares/template] required to configure a machine properly without any failures. - Use existing template for release agent
- Don't create randomly, look for the demand and then configure. - Dynamically create based on the demand
- One generic script/tool that can adapt to any requirement.
- Static pool of machines
- On-demand based machine creation and deletion once it is not required
Approach decided:
- Have a static pool of machines, reuse for running the pipelines - All of these machines will be on single pool. - Revisit afte the complete concept
- Dynamically creating based on demand - Next iteration
### Choosing the artifact
- Remains like how it is today
Approach decided:
- If a wrong artifact is configured in the yaml pipeline, we rely on default ADO pipeline itself for error handling
### Choosing the target machines

- Capability
- Tag Machine - with specific string
- Life cycle of machines in pooled env

- Environment locks
- What level locks are to be implemented?
Approach decided:
- Concept analysis: Explore capabilities and suitability of environment locks an then decide the approach to be taken
### Deploying the artifact
- Dynamic backup creation - Only in the second iteration
- [Deployment server jobs](
https://helios.healthcare.siemens.com/tfs/Projects/Numaris/_settings/agentqueues?queueId=382&view=jobs)
- Analysis in what is actually happening on Deployment server
- What options do we've to run in parallel?
### Running the stage in parallel
- Nothing specific foreseen, it should run as it is done today
- hard coded paralell agents - 15
### Running the stage in series within the same machine
- Multiple Stages run on same machine, reboots in between
- Adding dynamic capabilities for the agents and configuring the static demands as part of the stages
- Explore yaml depedency analysis to know if there are previous stages run on same machine: https://stackoverflow.com/questions/62758900/how-to-use-a-single-agent-for-multiple-jobs-stages-in-azure-devops-yaml-pipeline
### Running the tests in parallel
- Nothing specific foreseen, should continue to work like today
### Reboot in between the stage
- Use capabilities such as ReleaseId+EnvId(StageId)+phaseId(in case of parallelism) = unique for every stage run
- Explore use of environment lock mechanism to handle the reboot
### Document generation/EGA for parallel pipelines?
- Check if there are any consequences of running parallel doc generation?
- EGA - 2h, doc gen - 1h --> Can it have induce any delay?
## Corner cases
### Running the pipeline by selecting subset of stages
- The agents that needs to be ear mrked should be decided on the fly at the beginning.
- Machines needs to be set prior running the subset of stages
- There will be a cleanup stage(remove tags/capabilities) - this should be locked (not sure if possible).
- YAML knows it during runtime.
### Deployment/Stage runs into an error in between
- Define rules when to retain a machine and when not
- Define default wait time before we retire the machine
- Export basic logs - like SaveLogs, etc..
- Enforce limit of machines that can be retained used by release -> free first machine then retain second machine
- Define If any error on the Deployment can be taken by IDP for the analysis and If any FITs then it should be assined it to the team.
- Need to retain the machine, for analysis.
- Retrigger should select the same machine. (can be resolved by proper combination of capibilities)
- Mechanism to remove tags/capabilitis once retrigger/analysis is done (whether success or failure).
- Option to add retain the machine for analysis manually is needed
- Fix a time period on how long machine has to be retained
- How do we handle same failures on multiple pipeline runs?
### Running a developer pipeline
Specific pipelines for running E2E tests on a feature branch
- Will there be a Full git build enabled for feature branch?
- Separate YAML files for every dev release definitions
- Specify the stages for each pipelines
- Limited allowed parallel release runs on developer releases
- Option: Dedicated Pool for Dev releases?
### Hardcoded IP/Hostnames in stages
- Analyze if there are any specific stages where hardcoded hostnames or IP addresses are configured. This needs to be adapted
## Use cases
## Introduction
This document talks about the various use cases for release pipeline pooling and the things we need to keep in mind for each scenario.
## Assumptions
- A pool of machines exist. (Number of machines to be decided later)
- All release pipelines (River Release + Team release) use the same pool.
### Single release triggered on River.Night.
- Release selects and ear marks few agents to run the tests.
- Deploys the latest package on these agents.
- Runs the FIT tests.
- Releases the agent ones the release is successful.
### First release triggered, second triggered after first is complete.
- First release is triggered.
- Second release is triggered before the first is complete.
- The first release should release the machines ear marked for its FITs execution.
### First release triggered, second release triggered halfway through the first.
- First release is triggered.
- Second release is triggered before the first is complete.
- The second release should not choose the agents ear marked for the first release.
- Deployment server (or servers) should be able to facilitate both releases.
### Two release triggered at the same time.
### First release cancelled, second release triggered after first release cancellation complete.
- Explore to run "finallly" block for the cancelled pipeline
- Next release shall take care of cleanup of capabilities previously ear marked
### Release triggered but no agents available.
- Park in the prep pool?
### Release runs but one or more stages fail.
### Release runs, a stage fails, stage is retriggered.
### Release runs, blue screen/black screen during deployment.
### Approach to be followed
- Goal: Use yaml as the primary pipeline and classic only on need basis
- Select 2 drop only fits and create a small pipeline which can give faster fits
- Use the yaml release pipeline machines/Git Hackathon machines/Classic pipeline machines for pipeline pooling experimentation
- Deployment server
- Analysis of the complete flow - Data flow diagram
- Parallel jobs through parallel agents on same machine?
- Multiple machines with deployment server?
- Uploading of same package to multiple servers in case of parallel setup - any other optimization possible?
- Containerization of deployment server?
- Environment locking mechanism:
- Series of stages - e.g. Ganges stages
- within a stage, reboots in between
- Impact on locking when there is an error in between
- Has to be simulated using a small pipeline and demoed
### Clarifications with Microsoft
- How should the pipeline architecture be designed with environments?
- Can we have a channel established with Microsoft to review the design and understand what exactly is their recommendation?
- PreDefined variables are not available at the complile time , so we are not able to use it under environment properties (Example: Build.BuildNumber in Tags)
- How the reboot behavior of the target machines needs to be handled?
## Deployment Server
- Stage running on deployment server are queued and queue times are high.
- In Git world we expect a large number of pipelines runs, hence the load will be high on the deployment server.
### Multiple deployment server based on vlan.
Assumption:
- River pipeline is a dedicated pipeline to test quality of dev and main branches inone dedicated vlan.
- Have multiple deployment server in different vlans.
- Deployment servers are segrgated based on priority.
- 1 deployment server and pool for River pipelines.
- 1 deployment server and pool for Team release pipelines.
- Pools are segregated based on vlans.
vlan Suggestion:
- River : main and integ
-
Advantages:
- Load is distributed.
- Queue time on deployment server reduced.
Disadvantage:
-
### Single deployment server with multiple agents
- Check with dev team if multiple instances of deploment server is possible.
Not feasible as multiple instances of deployment server not possible.
### Sync all packages on 1 deployment server 1 vlan
- All packages are synced.
- Strict retention policy to prevent disk space fill up.
Advantages:
- One deplyment server to maintain.
- Release Pipelines setup remains the same.
Disadvanatage:
- Space on deployment server might fill up. Enough space might not be there if all packages are synced.
- Bottle neck for release pipelines (depending on wait time).
### Deployment server - Need Info
- Can multiple instances of deployment server run on same machine.
- What is the limit of number of machines for 1 deployment server.
- Is it possible to have multiple deployment server under 1 vlan.
### Taget for first iteration of Pooling
- Remove one on one mapping from pipeline to machines.
- All FITs running in pipeline including Numaris and Host only FITs.
- Run 3 parallel pipelines.
- Yaml pipeline status in Dashboard.
QR Phase 1:
- Auto selection of agents - block for release pipeline.
- One run that runs on any agent selected dynamically.
- Configure of deployment stage is outside of FIT/Deployment. Deployment into FIT stages.
Next sprint:
- Identify 2/3 fits with small execution time - Create a very basic pipeline that runs on a pool using ADO default pooling mechanism.
- Start implementation of agent selection and locking.
- Create new vlan, configure deployment server, configure and set up agents.
### Next feature backlog:
Done:
- Get rid of 2nd agent(Environment code)
- Done
ToDo:
- Move prerequistie files for FIT tests on the git repo e.g. Ganges
- Machine allocator dedicated pool with single agent
- Move to infra machine
- WaitForAgentsOnline optimize/new solution - Current problem: Each deployment needs
- Serverless jobs?
- More agents?
- IDP releasePoolScripts: Enhanced, new Infra-machines?
- Move to infra machines, docker?
- DeploymentServer optimization, too much traffic
- New VLAN creation and movement of the machines
- Create machines in the new VLAN
- Support to Retain machines
- Concept work ongoing in this QR
- Developer release
- separate new dev pipeline
- Separate pools
- Security permissions on the agent pools
- Don't allow pipelines unless granted to run
- Custom system type based dynamic backup for hosts
- Running the pipeline by selecting subset of stages
- The agents that needs to be ear mrked should be decided on the fly at the beginning.
- Test agents name - solve C# code - 'T_' problem
- Context: Deployment groups, test agents start with "T_"
- Machines needs to be set prior running the subset of stages
- There will be a cleanup stage(remove tags/capabilities) - this should be locked (not sure if possible).
- YAML knows it during runtime.
- Deployment/Stage runs into an error in between
- Define rules when to retain a machine and when not
- Define default wait time before we retire the machine
- Export basic logs - like SaveLogs, etc..
- Enforce limit of machines that can be retained used by release -> free first machine then retain second machine
- Define If any error on the Deployment can be taken by IDP for the analysis and If any FITs then it should be assined it to the team.
- Need to retain the machine, for analysis.
- Retrigger should select the same machine. (can be resolved by proper combination of capabilities)
- Mechanism to remove tags/capabilitis once retrigger/analysis is done (whether success or failure).
- Option to add retain the machine for analysis manually is needed
- Fix a time period on how long machine has to be retained
- How do we handle same failures on multiple pipeline runs?
- Running a developer pipeline
- Specific pipelines for running E2E tests on a feature branch
- Will there be a Full git build enabled for feature branch?
- Separate YAML files for every dev release definitions
- Specify the stages for each pipelines
- Limited allowed parallel release runs on developer releases
- Option: Dedicated Pool for Dev releases?
- Hardcoded IP/Hostnames in stages
- Analyze if there are any specific stages where hardcoded hostnames or IP addresses are configured. This needs to be adapted
- Mars as part of release pipeline
- Carsten Prinz requirement, need to additionally discuss and conclude.
- Pet server as part of the release pipeline
- Discuss and conclude with Petrowe (Somaraj)
## Feature 1057655: Completion of yaml based release pipeline pooling
### Refining
- POC of retain agent concepts
- Implement the selected concept.
- Pipeline to ensure agents are released back into pool if more than <certain days>
- Improve Wait for agents online
- Write integration tests to test pooling
- Create yaml stage for ANt FI
- Identify and execute tests to ensure pooling is working.
- Running the pipeline by selecting subset of stages
- Based on param inputs
- Developer pipeline support
- Concept for handling large number of requests.
- Estimation of how many machines needed.
- Identify Use cases
- Lock some stages from being edited e.g: agent release.
IP Sprint:
- Concept: Try and mitigate deployment server bottleneck
- Concept: Dynamically decide number of agents required
### Miscellaneous stages - Analyse pooling for these stages - Scaling
- Document generation,
- Ega trace runner,
- ETP creation,
- MR protocol creation
#### Maybe
In case atlantic agrees, Raise a IDP
- Create new machines and migrate to new vlan
- Set up new deployment server in new vlan - Test and confirm it works.
## Feature 1046629: Enable Build Dashboard to view Yaml-Release-Results
## Unit test targets:
Have unit tests written to check:
* Sanctity of yml
* Dependency check
* Functionality check
## Integration/System tests for pipeline
Goal: Verify the changes in yaml and related tools before the actual pipeline runs
Aspects to be tested:
1. Scripts, yaml - Before rolling out the actual pipeline, allocate, run the pipeline, deallocate --> If the entire flow works without failure, then, we verify if changes arent making the system unstable
2. Timeout of unavailability of agents
3. Script runs on a different agent pool (currently 1 agent), when the number of runs exceed the number of available agents, what is the result?
4. Subset of stages chosen, check the number of machines allocated/de-allocated
## User Stories:
### Feature 1057655: Completion of yaml based release pipeline pooling
- As an IDP-Dev I want to write the yaml template for Ant FI so that the yaml River release pipeline can be made the primary pipeline.
(Not pooling, but important) - 3 PD
(Suggestion: Test on Gharial, rollback on Gharial and trigger FI)
- POC 3 Introduce an Agentless wait Job - 3 PDs
- POC 1/2 Integration of De-allocation and Retention as Stage Steps - 3 PDs
- As an IDP-Developer, I want to implement "Retain agent" concept and write appropriate tests. (Should we try poc of the different concepts?) - 6 PDs
- As an IDP-Dev I want to identify test cases for release pipeline pooling. - As an IDP-Dev I want to create concept for automated testing release pipeline pooling - (Identify test cases, concept for automation, and automating tests) - 8 PDs
- As an IDP-Developer, I want to create a service that will release agents that have been retained for more than "x" days. The value of "x" needs to be decided as part of this user story. - 4 PDs
- As an IDP-Developer, I want to implement a redesign for "Wait for Agents Online" - 8 PDs
- As an IDP-Developer, I want to implement dynamically indetifying the number of agents needed - 8 PDs
Total: 43 PDs
### Feature 1046629: Enable Build Dashboard to view Yaml-Release-Results
- As an IDP-Developer, I want to update the code to display data from yaml based pipelines.
- As an IDP-Developer, I want to write unit tests for yaml based pipeline code.
- As an IDP-Developer, I want to host the pipeline statistics code on an IDP owned virtual machine.
- As an IDP-Developer, I want to automate the deployment of pipeline statics tool.
- As an IDP-Developer, I want to investigate migration of angular to latest version.
### Feature 1062466: Infrastructure for Pooling
- As an IDP-Developer, I want to co-ordinate creation of new vlan. - 5 PDs
- As an IDP-Developer, I want to create a concept and implement on how agents pools will be utilized in pooling. Include detailed estimations for the infrastructure required. - 8 PDs
- Design of the pools - dev only, river only etc
- Numaris and drop only
- Revisit estimation
- Other infra pools like "Wait for agent pool", "Agent allocation/Deallocation"
- Document and present
- As an IDP-Developer, I want to set up deployment server on the new vlan. - 5 PDs
- As an IDP-Devloper, I want to create new Vms and configure agents on these machines. - 13 PDs
- Configure VMs ~500 (procurement of hardware might take time)
- Register Vms to ipam
- Firewall rule adaptation
- Configure agents on vms - Initial Host configurator
- Test deployment server max load
- Discuss with Indus team.
- Decide on how many vlans if needed.
Total: 31 PDs
Risks:
Deployment server bottleneck.
Hardware availability.
- Talk to Marcus and get clarity on refactoring in deployment server.
- Ask wurmloch how network segmentation can be done.