--- title: 'Reproduce Upstream CI failure on my machine aka Lab deployment with libvirt reproducer' disqus: tripleo --- ###### tags: Reproducer Reproduce Upstream CI failure on my machine aka Lab deployment with libvirt reproducer === ## Table of Contents [TOC] ## Requirements 1. Hardware machine a. 8 core cpu, 32 GB memory, 60GB freespace b. CentOS-8, :::warning RHEL-8 is not supported for now by reproducer and requires some work due to podman <> docker conflict ::: ## Hardware Prepare > Access to your testbox ```Shell= ssh -A testbox ``` :::info You can use the root or non-root user. Non-root users should have sudo access. ::: > Update packages to latest ```Shell= sudo dnf -y update ``` > Install package ```Shell= sudo dnf -y install gcc git libguestfs-tools libvirt tmux tuned virt-install qemu-kvm ``` > Install Ansible ```Shell= sudo dnf install epel-release sudo dnf makecache sudo dnf install ansible ``` > Configure KSM and tuned to enable overcommitment of RAM ```Shell= sudo systemctl enable ksm --now sudo systemctl enable ksmtuned --now ``` > Enable tuning for a virtual host ```Shell= sudo systemctl enable tuned --now sudo tuned-adm profile virtual-host ``` > Install dnf-utils and enable docker-ce repository ```Shell= sudo yum install -y dnf-utils sudo yum-config-manager \ --add-repo \ https://download.docker.com/linux/centos/docker-ce.repo ``` > Install Docker :::info By default RHEL-8 comes with *runc.x86_64* which is required for podman. In order to make docker working we need to install and use containerd instead. ::: ```Shell= dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm sudo dnf install -y docker-ce docker-ce-cli systemctl start docker ``` > Check Docker by running ```Shell= docker ps ``` ## Software prepare for reproducer ### Prepare ssh keys > Reproducer script interacts with https://review.opendev.org and https://review.rdoproject.org a lot. To be able to build packages, download patches and their dependencies we need to create ssh keys ```Shell= ssh-keygen -q -b 4096 -t rsa -f ~/.ssh/id_rsa -N "" -C "Reproducer_CI" cat ~/.ssh/id_rsa.pub ``` > Add keys to https://review.opendev.org and https://review.rdoproject.org using [Openstack Fist timers guide](https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html) :::warning Test access by running, using your username ```Shell= ssh -p 29418 holser@review.opendev.org gerrit ls-projects ssh -p 29418 holser@review.rdoproject.org gerrit ls-projects ``` ::: ## Prepare images > Please download images that will be used by reproducer. You can fetch images from: * https://nb01.opendev.org/images * https://nb02.opendev.org/images * https://nb04.opendev.org/images ```Shell= pushd /var/lib/libvirt/images curl -4SL -O https://nb01.opendev.org/images/centos-8-0000078534.qcow2 md5sum centos-8-0000078534.qcow2 popd ``` :::info Some images are ~10Gb and some are ~5GB. The smaller one may not have python or yum installed. You may need to add those packages using ```Shell= sudo virt-customize -a centos-8-0000070956.qcow2 --run-command \ 'dnf -y install python3 yum screen' ``` I would recommend to read [Modify Images Guide](https://docs.openstack.org/image-guide/modify-images.html) which is very useful if you need to customize image for some experiments ::: ## Reproducing job > Find a job you want to reproduce. In my case it's **tripleo-ci-centos-8-scenario004-standalone** of https://review.opendev.org/#/c/725782/ > > Open Zuul Build of that job https://zuul.opendev.org/t/openstack/build/6ea638ff55504bc4be15416af3b181ac > and download install-deps.sh launcher-env-setup-playbook.yaml launcher-playbook.yaml reproducer-zuul-based-quickstart.sh reproducer-zuul-based-quickstart.tar ```Shell= mkdir reproduce_job cd reproduce_job wget -r -np -nd -R "index.html*" https://d964d012afab0e138249-be2db655edae902b1f8d9628c9b7e990.ssl.cf2.rackcdn.com/751861/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/4faf003/logs/reproducer-quickstart/ `` Create extra.yaml ```yaml= mirror_path: mirror.regionone.rdo-cloud.rdoproject.org custom_nameserver: 10.38.5.26 deploy_timeout: 360 compute_memory: 14096 compute_vcpu: 1 control_memory: 18192 control_vcpu: 8 undercloud_vcpu: 2 undercloud_memory: 18192 force_cached_images: true image_cache_expire_days: 30 vxlan_networking: false toci_vxlan_networking: false modify_image_vc_root_password: r00tme mergers: 2 ansible_python_interpreter: /usr/bin/python3 mirror_fqdn: mirror.regionone.rdo-cloud.rdoproject.org pypi_fqdn: mirror01.ord.rax.opendev.org images: - name: undercloud url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2 md5sum: d90a6fa7188653ad0eae68bb3b7b9461 type: qcow2 - name: overcloud url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2 md5sum: d90a6fa7188653ad0eae68bb3b7b9461 type: qcow2 ``` > Run reproducer ```Shell= bash ./reproducer-zuul-based-quickstart.sh -w /var/tmp/reproduce -l -e @extra.yaml -e os_autohold_node=true -e zuul_build_sshkey_cleanup=false -e container_mode=docker -e upstream_gerrit_user=holser -e rdo_gerrit_user=holser ``` ## Cleanup ```Shell= docker rm -f tripleo-ci-reproducer_logs_1 tripleo-ci-reproducer_fingergw_1 \ tripleo-ci-reproducer_executor_1 tripleo-ci-reproducer_web_1 \ tripleo-ci-reproducer_merger1_1 tripleo-ci-reproducer_merger0_1 \ tripleo-ci-reproducer_scheduler_1 tripleo-ci-reproducer_launcher_1 \ tripleo-ci-reproducer_mysql_1 tripleo-ci-reproducer_zk_1 \ tripleo-ci-reproducer_gerrit_1 tripleo-ci-reproducer_logs_1 \ tripleo-ci-reproducer_gerritconfig_1 rm -rf /var/cache/tripleo-quickstart/ rm -rf /var/tmp/reproduce/ rm -rf ~/tripleo-ci-reproducer ``` ## Debuging There are a lot of possible issues with reproducer. I am not going to describe them all. The engineer with good debugging skills will be able to find them. Going back to my issue, after running ``` TASK [ansible-role-tripleo-ci-reproducer : Wait for job to start] *********************************************************** task path: /var/tmp/reproduce/roles/ansible-role-tripleo-ci-reproducer/tasks/launch-job.yaml:63 FAILED - RETRYING: Wait for job to start (30 retries left). FAILED - RETRYING: Wait for job to start (29 retries left). FAILED - RETRYING: Wait for job to start (28 retries left). FAILED - RETRYING: Wait for job to start (27 retries left). FAILED - RETRYING: Wait for job to start (26 retries left). FAILED - RETRYING: Wait for job to start (25 retries left). FAILED - RETRYING: Wait for job to start (24 retries left). FAILED - RETRYING: Wait for job to start (23 retries left). FAILED - RETRYING: Wait for job to start (22 retries left). FAILED - RETRYING: Wait for job to start (21 retries left). FAILED - RETRYING: Wait for job to start (20 retries left). FAILED - RETRYING: Wait for job to start (19 retries left). FAILED - RETRYING: Wait for job to start (18 retries left). FAILED - RETRYING: Wait for job to start (17 retries left). FAILED - RETRYING: Wait for job to start (16 retries left). FAILED - RETRYING: Wait for job to start (15 retries left). FAILED - RETRYING: Wait for job to start (14 retries left). FAILED - RETRYING: Wait for job to start (13 retries left). FAILED - RETRYING: Wait for job to start (12 retries left). FAILED - RETRYING: Wait for job to start (11 retries left). FAILED - RETRYING: Wait for job to start (10 retries left). FAILED - RETRYING: Wait for job to start (9 retries left). FAILED - RETRYING: Wait for job to start (8 retries left). FAILED - RETRYING: Wait for job to start (7 retries left). FAILED - RETRYING: Wait for job to start (6 retries left). FAILED - RETRYING: Wait for job to start (5 retries left). FAILED - RETRYING: Wait for job to start (4 retries left). FAILED - RETRYING: Wait for job to start (3 retries left). FAILED - RETRYING: Wait for job to start (2 retries left). FAILED - RETRYING: Wait for job to start (1 retries left). fatal: [localhost]: FAILED! => {"access_control_allow_origin": "*", "attempts": 30, "cache_control": "public, max-age=1", "changed": false, "connection": "close", "content": "[]", "content_length": "2", "content_type": "application/json; charset=utf-8", "cookies": {}, "cookies_string": "", "date": "Thu, 06 Aug 2020 18:20:24 GMT", "elapsed": 0, "json": [], "last_modified": "Thu, 06 Aug 2020 18:20:24 GMT", "msg": "OK (2 bytes)", "redirected": false, "server": "CherryPy/18.6.0", "status": 200, "url": "http://localhost:9000/api/tenant/tripleo-ci-reproducer/status/change/1001,1"} ``` In this case we need to check the logs of zuul-scheduler container ```Shell=bash docker logs $(docker ps | awk '/zuul-scheduler/ {print $1}') ``` So, I see ``` [root@ ~]# docker logs e4d48be6e8ba | tail -30 [WARNING]: No inventory was parsed, only implicit localhost is available [WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all' # review.opendev.org:29418 SSH-2.0-GerritCodeReview_2.13.12-11-g1707fec (SSHD-CORE-1.2.0) # review.rdoproject.org:29418 SSH-2.0-GerritCodeReview_2.14.7-sf (SSHD-CORE-1.4.0) # gerrit:29418 SSH-2.0-GerritCodeReview_2.16.7 (SSHD-CORE-2.0.0) 2020-08-06 18:25:22,342 - paramiko.transport - ERROR - raise ValueError("q must be exactly 160, 224, or 256 bits long") 2020-08-06 18:25:22,342 - paramiko.transport - ERROR - ValueError: q must be exactly 160, 224, or 256 bits long 2020-08-06 18:25:22,342 - paramiko.transport - ERROR - 2020-08-06 18:25:22,342 - gerrit.GerritWatcher - ERROR - Exception on ssh event stream with opendev.org: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/zuul/driver/gerrit/gerritconnection.py", line 341, in _run key_filename=self.keyfile) File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 446, in connect passphrase, File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 680, in _auth self._transport.auth_publickey(username, key) File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 1580, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 236, in wait_for_response raise e File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 2109, in run handler(self.auth_handler, m) File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 298, in _parse_service_accept sig = self.private_key.sign_ssh_data(blob) File "/usr/local/lib/python3.7/site-packages/paramiko/dsskey.py", line 116, in sign_ssh_data ).private_key(backend=default_backend()) File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 244, in private_key return backend.load_dsa_private_numbers(self) File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 772, in load_dsa_private_numbers dsa._check_dsa_private_numbers(numbers) File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 144, in _check_dsa_private_numbers _check_dsa_parameters(parameters) File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 136, in _check_dsa_parameters raise ValueError("q must be exactly 160, 224, or 256 bits long") ValueError: q must be exactly 160, 224, or 256 bits long ``` So, In this particular case ssh-key is not added to https://review.opendev.org. Once I added public part of ssh key to https://review.opendev.org the job went through without any issues.