Promoter Current status

# Promoter Current status ###### tags: `Design` ## About Promoter - [ ] https://docs.openstack.org/tripleo-docs/latest/ci/chasing_promotions.html (out-of-date, but provides good info on how promotions work) - [ ] https://drive.google.com/file/d/15z9lmOaDMD7EXn-wbe_CXAIwewXOHHuj/view?usp=sharing - [ ] https://drive.google.com/file/d/1e7h67ytwB3Au_Q3T3RcTfKoQFjWbVXhW/view?usp=sharing - [ ] https://drive.google.com/file/d/1TykL8mPLT5nacYjMW3Pu8qiZYjd7qRPE/view?usp=sharing ## Testing Refer to https://hackmd.io/kJqHSTWWRMOIfIhvDMGFLg for molecule and zuul jobs design. ## Promoter server instances We have three instances See: https://docs.google.com/document/d/1bHVPN-lcXyY7imLdvj2J7I55s-vtF3ffPG-y2Cj8u38/edit#heading=h.xdi7lcpdgdn9 ## Document the manual activities ### Git pull the promoter code in rdo ```bash # Don't forget to stop the promoter service sudo systemctl stop dlrnapi-promoter # The second step is to save all the changes that are directly in the the server git stash # Pull the later version of the promoter code git pull # Apply the saved changes to the latest version of the promoter code git stash pop # (you may need to resolve the conflicts) ``` ### Git pull the promoter code in internal Updating the code in the internal server is a suicide. The code in the internal promoter has diverged a lot from the promoter master. There are changes in all the places to make upstream code work with downstream configuration. You can use the same step for the RDO server, but be aware that you may need to manually fix A LOT of conflicts. ### Git pull the promoter code in vexxhost Promoter code in vexxhost promoter was never updated after the first deployment, and there's really no need to update until vexxhost will become the only working promoter server. Unless of course there's a fix that is needed for the promotion of the non components pipeline. In that case the same procedure can be use as RDO. The code there is not hevily modified and the only changes I'm expecting are in the release config and in the criteria. Q: why are we not running master here ^^^ * can't think of a reason ... zbr concerns about performance? * needs reprovisioning... so we are blocked on getting the deployment playbook work finished. ### Selecting only portion of promotion workflow In a regular promotion workflow, three piece are promoted: * containers * qcow images and * dlrn hashes. These three parts are handled by three different clients in the code respectively: * containers_client * qcow_client * dlrn_client By default, all three clients are invoked, but it's possible to modify a release configuration to run only one of those clients by adding the variable allowed_clients. For example ```ini allowed_clients: qcow_client,dlrn_client ``` the variable accepts a comma separated list of clients. In this case, the containers promotion will be skipped and only qcow and dlrn promotions will run. ### Manually forcing a promotion. The promoter command has two subcommands: promote-all and force-promote Regular promoter runs call the promote-all subcommand, which will do all the promotions for the specified distro/release pair. The force-promote subcommand may be used to skip: * the candidate hash selection from the candidate label * the verification of the criteria (successful jobs) for the selected candidate hash So effectively the force-promote subcommand unconditionally promotes whatever hash is passed as argument. The subcommand contains detailed help. Here's an example ```bash python dlrnapi_promoter.py \ --config-file /home/centos/ci-config/ci-scripts/dlrnapi_promoter/config/CentOS-8/master.ini \ force-promote \ --commit-hash f8c9720499600baad488b956210e4bbe3b5da5bc \ --distro-hash 8c0e50eaa860d4e4b316ba0f7cba8ef6142b5e09 --aggregate-hash ad7d4b111c58647200378977824cba7a \ tripleo-ci-testing \ current-tripleo ``` This command will force the promotion of the hashes specified, starting from tripleo-ci-testing label to current-tripleo label, using the configuration in CentOS-8/master.ini PLEASE NOTE that this operation is quite risky as it will override check on the hashes and it should be done having four eyes on board at all time ## Workflow in the promoter * dlrn-promoter service in systemd calls /home/centos/ci-config/ci-scripts/dlrnapi_promoter/dlrn-promoter-service.sh * the dlrn-promoter-service.sh script loops forever on these tasks: * [CURRENTLY DISABLED] updating the code to the latest version * calls dlrn-promoter.sh script * pause for a specified amount of time (currently 12 hours) * the dlrn-promoter.sh script will: * specify a list of distro/release pair to promote * set arguments for the promoter code call * loop over the distro/release list * launching with a soft and hard timeout the promoter promote-all subcommand with all the arguments needed ### Locking TWO PROMOTER INSTANCES CANNOT RUN AT THE SAME TIME. To ensure we are not running two promoters at the same time, there's a locking mechanism in place, that uses abstract sockets (so nobody will have to deal with lock files and stale executions). If we run a promoter subcommand but another is in execution, the command will bail out (and yes there's a test for this particular part to ensure this works correctly) ### Logging * The logging subsystem logs to file and to console. * the console logging is automatically activated when the promoter command is run from the console, it's disabled otherwise * The logging file is defined in the configuration * The general idea for the current logging format is to be able to understand in each line: * at which step of the promotion the code is * what hash is currently being evaluated. Unfortunately that means that every line has a lot of informations in it. A typical logging line is as follows: ``` promoter Candidate hash 'aggregate: f2248610c021a5d77a0bfd73afd01707, commit: 13aa52f00ad3a5c5f8986761e803bf5 8ffffa9e9, distro: 408e9aac933209728a79ffab270d6d3bf672b10b, component: tripleo, timestamp: 1594599171': missing jobs ['periodic-tripleo-ci-centos- 8-scenario004-standalone-master'] ``` The promoter will always try to tell you what's the hash being considered, and what's the action that is doing. So in this example we can look at the last part to understand that a particular hash is missing successful job, and the first part contains the details of the hash that hash missing jobs. #### Container promotion log unfortunately the container promotion logs are offered by ansible, and the python code just recopies them in the logs output. Debugging a container promotion error may be really difficult as the output is oddly formatted. One suggestion is to search for the keyword: promoter fail to get the task that failed in the promotion process. ## Work already planned for the short term and should be merged before the 10th of August. #### Promoter deployment https://hackmd.io/kJqHSTWWRMOIfIhvDMGFLg#WIP #### Promoter Code * implementation of qcow images promotion in python * STATUS: all changes are ready to be reviewed an merged. * Changes in order: ~~* https://review.rdoproject.org/r/28429~~ ~~* https://review.rdoproject.org/r/27626~~ * https://review.rdoproject.org/r/28142 * implementation of containers promotion in python * STATUS: still POC * I have a POC in https://review.rdoproject.org/r/26010. It's a huge change with 1000 lines of code. Lots of variables are considered there, and various aspects of the promotion (multiarch, multiple engines docker/podman, multiple registry APIs docker.io/quay.io) * Amol is already planning to move forward this task, starting from the ashes of the POC, proposing patches step by step. * implementation of the new configuration engine to make internal promoter instance easily maintainable. * STATUS: lots of changes ready to be reviewed, but some rebase and rework needed for the latest modification in current configuration code * PANDA working to deliver this by 03 August 2020 (panda out 03-12) * Changes in order: * ~~https://review.rdoproject.org/r/27648~~ * ~~https://review.rdoproject.org/r/27985~~ * ~~https://review.rdoproject.org/r/28017~~ * ~~https://review.rdoproject.org/r/28018~~ * https://review.rdoproject.org/r/28081 * (sync 29Jul ): up to here doesn't wire up - still needs some work - should merge rest up to this one at least this week * (sync 18Ago): Arx is trying to fix this patch * https://review.rdoproject.org/r/28019 * https://review.rdoproject.org/r/28020 * https://review.rdoproject.org/r/28014 ### New config engine: A new configuration system is needed to enable the promoter code to run in the same way on both upstream and dowstream, by changing only the configuration needed. The new config engine is described here: https://docs.google.com/presentation/d/1rERVavpiKJrvtDpNj25nFiv4tWw51OmSIJFSOzoUWyc/edit#slide=id.g732f445e13_0_0 I will prepare a demo to show how to deal with the new configuration format. ------ MARIOS NOTES. panda : ****1 vex 1 rdo 1 internal vex 2 months ago static code from 6 months ago... mostly for stable/ old pipeline... rdo most updated. .... (master or not? manual update) ... no continuous deployment.... recent version updated last week . centos8 master ussuri train running there... everything tht needs the md5 stuffs. internal deployed 1 month ago needs necessary modifications cos of hardcoded things that must be changed like image promotion needed to modify some bash Q are these in review? different/branched/internal project? document it - status- how to update - how to manually run Q: plan? running master on centos8 things? how do we get there is there a clear path to that ^^^ with the things you have in progress already we have to keep two versions running to support legacy workflow... does the new code support legacy workflow? playbooks that deploy promoter not idempotent yet e.g. removing docker packages for d/stream lets modify it so it can deal with both up/dstream then deploy it internally ### Issue in this patch : https://review.rdoproject.org/r/#/c/28081/ * Issue 1: --release-config is required parameter here https://review.rdoproject.org/r/#/c/28081/15/ci-scripts/dlrnapi_promoter/stage.py@80 but haven't define in ci-scripts/dlrnapi_promoter/config_environments/global_defaults.yaml so it's causing this issue error: stage.py setup: error: the following arguments are required: --release-config * temporarily fix this isssue by defining the release_config var in global_defaults.yaml * Issue 2: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/promoter-staging/containers/file:///tmp/promoter-staging/containers/build-containers-main.yaml' * this extra path "/tmp/promoter-staging/containers/" is getting added here https://review.rdoproject.org/r/#/c/28081/15/ci-scripts/dlrnapi_promoter/stage_config.py@83 * IMO, containers_list_exclude_config: file:///tmp/promoter-staging/containers/build-containers-main.yaml should be containers_list_exclude_config: build-containers-main.yaml

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.