changed 5 years ago
Published Linked with GitHub
tags: Reproducer

Reproduce Upstream CI failure on my machine aka Lab deployment with libvirt reproducer

Table of Contents

Requirements

  1. Hardware machine
    a. 8 core cpu, 32 GB memory, 60GB freespace
    b. CentOS-8,

    RHEL-8 is not supported for now by reproducer and requires some work due to podman <> docker conflict

Hardware Prepare

Access to your testbox

ssh -A testbox

You can use the root or non-root user.
Non-root users should have sudo access.

Update packages to latest

sudo dnf -y update

Install package

sudo dnf -y install gcc git libguestfs-tools libvirt tmux tuned virt-install qemu-kvm

Install Ansible

sudo dnf install epel-release sudo dnf makecache sudo dnf install ansible

Configure KSM and tuned to enable overcommitment of RAM

sudo systemctl enable ksm --now sudo systemctl enable ksmtuned --now

Enable tuning for a virtual host

sudo systemctl enable tuned --now sudo tuned-adm profile virtual-host

Install dnf-utils and enable docker-ce repository

sudo yum install -y dnf-utils sudo yum-config-manager \ --add-repo \ https://download.docker.com/linux/centos/docker-ce.repo

Install Docker

By default RHEL-8 comes with runc.x86_64 which is required for podman. In order to make docker working we need to install and use containerd instead.

dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm sudo dnf install -y docker-ce docker-ce-cli systemctl start docker

Check Docker by running

docker ps

Software prepare for reproducer

Prepare ssh keys

Reproducer script interacts with https://review.opendev.org and https://review.rdoproject.org a lot. To be able to build packages, download patches and their dependencies we need to create ssh keys

ssh-keygen -q -b 4096 -t rsa -f ~/.ssh/id_rsa -N "" -C "Reproducer_CI" cat ~/.ssh/id_rsa.pub

Add keys to https://review.opendev.org and https://review.rdoproject.org
using Openstack Fist timers guide

Test access by running, using your username

ssh -p 29418 holser@review.opendev.org gerrit ls-projects ssh -p 29418 holser@review.rdoproject.org gerrit ls-projects

Prepare images

Please download images that will be used by reproducer. You can fetch images from:

pushd /var/lib/libvirt/images curl -4SL -O https://nb01.opendev.org/images/centos-8-0000078534.qcow2 md5sum centos-8-0000078534.qcow2 popd

Some images are ~10Gb and some are ~5GB. The smaller one may not have python or yum installed. You may need to add those packages using

sudo virt-customize -a centos-8-0000070956.qcow2 --run-command \ 'dnf -y install python3 yum screen'

I would recommend to read Modify Images Guide which is very useful if you need to customize image for some experiments

Reproducing job

Find a job you want to reproduce. In my case it's tripleo-ci-centos-8-scenario004-standalone of https://review.opendev.org/#/c/725782/

Open Zuul Build of that job https://zuul.opendev.org/t/openstack/build/6ea638ff55504bc4be15416af3b181ac
and download install-deps.sh launcher-env-setup-playbook.yaml launcher-playbook.yaml reproducer-zuul-based-quickstart.sh reproducer-zuul-based-quickstart.tar

mkdir reproduce_job cd reproduce_job wget -r -np -nd -R "index.html*" https://d964d012afab0e138249-be2db655edae902b1f8d9628c9b7e990.ssl.cf2.rackcdn.com/751861/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/4faf003/logs/reproducer-quickstart/ `` Create extra.yaml ```yaml= mirror_path: mirror.regionone.rdo-cloud.rdoproject.org custom_nameserver: 10.38.5.26 deploy_timeout: 360 compute_memory: 14096 compute_vcpu: 1 control_memory: 18192 control_vcpu: 8 undercloud_vcpu: 2 undercloud_memory: 18192 force_cached_images: true image_cache_expire_days: 30 vxlan_networking: false toci_vxlan_networking: false modify_image_vc_root_password: r00tme mergers: 2 ansible_python_interpreter: /usr/bin/python3 mirror_fqdn: mirror.regionone.rdo-cloud.rdoproject.org pypi_fqdn: mirror01.ord.rax.opendev.org images: - name: undercloud url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2 md5sum: d90a6fa7188653ad0eae68bb3b7b9461 type: qcow2 - name: overcloud url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2 md5sum: d90a6fa7188653ad0eae68bb3b7b9461 type: qcow2

Run reproducer

bash ./reproducer-zuul-based-quickstart.sh -w /var/tmp/reproduce -l -e @extra.yaml -e os_autohold_node=true -e zuul_build_sshkey_cleanup=false -e container_mode=docker -e upstream_gerrit_user=holser -e rdo_gerrit_user=holser

Cleanup

docker rm -f tripleo-ci-reproducer_logs_1 tripleo-ci-reproducer_fingergw_1 \ tripleo-ci-reproducer_executor_1 tripleo-ci-reproducer_web_1 \ tripleo-ci-reproducer_merger1_1 tripleo-ci-reproducer_merger0_1 \ tripleo-ci-reproducer_scheduler_1 tripleo-ci-reproducer_launcher_1 \ tripleo-ci-reproducer_mysql_1 tripleo-ci-reproducer_zk_1 \ tripleo-ci-reproducer_gerrit_1 tripleo-ci-reproducer_logs_1 \ tripleo-ci-reproducer_gerritconfig_1 rm -rf /var/cache/tripleo-quickstart/ rm -rf /var/tmp/reproduce/ rm -rf ~/tripleo-ci-reproducer

Debuging

There are a lot of possible issues with reproducer. I am not going to describe them all. The engineer with good debugging skills will be able to find them. Going back to my issue, after running

TASK [ansible-role-tripleo-ci-reproducer : Wait for job to start] ***********************************************************
task path: /var/tmp/reproduce/roles/ansible-role-tripleo-ci-reproducer/tasks/launch-job.yaml:63
FAILED - RETRYING: Wait for job to start (30 retries left).
FAILED - RETRYING: Wait for job to start (29 retries left).
FAILED - RETRYING: Wait for job to start (28 retries left).
FAILED - RETRYING: Wait for job to start (27 retries left).
FAILED - RETRYING: Wait for job to start (26 retries left).
FAILED - RETRYING: Wait for job to start (25 retries left).
FAILED - RETRYING: Wait for job to start (24 retries left).
FAILED - RETRYING: Wait for job to start (23 retries left).
FAILED - RETRYING: Wait for job to start (22 retries left).
FAILED - RETRYING: Wait for job to start (21 retries left).
FAILED - RETRYING: Wait for job to start (20 retries left).
FAILED - RETRYING: Wait for job to start (19 retries left).
FAILED - RETRYING: Wait for job to start (18 retries left).
FAILED - RETRYING: Wait for job to start (17 retries left).
FAILED - RETRYING: Wait for job to start (16 retries left).
FAILED - RETRYING: Wait for job to start (15 retries left).
FAILED - RETRYING: Wait for job to start (14 retries left).
FAILED - RETRYING: Wait for job to start (13 retries left).
FAILED - RETRYING: Wait for job to start (12 retries left).
FAILED - RETRYING: Wait for job to start (11 retries left).
FAILED - RETRYING: Wait for job to start (10 retries left).
FAILED - RETRYING: Wait for job to start (9 retries left).
FAILED - RETRYING: Wait for job to start (8 retries left).
FAILED - RETRYING: Wait for job to start (7 retries left).
FAILED - RETRYING: Wait for job to start (6 retries left).
FAILED - RETRYING: Wait for job to start (5 retries left).
FAILED - RETRYING: Wait for job to start (4 retries left).
FAILED - RETRYING: Wait for job to start (3 retries left).
FAILED - RETRYING: Wait for job to start (2 retries left).
FAILED - RETRYING: Wait for job to start (1 retries left).
fatal: [localhost]: FAILED! => {"access_control_allow_origin": "*", "attempts": 30, "cache_control": "public, max-age=1", "changed": false, "connection": "close", "content": "[]", "content_length": "2", "content_type": "application/json; charset=utf-8", "cookies": {}, "cookies_string": "", "date": "Thu, 06 Aug 2020 18:20:24 GMT", "elapsed": 0, "json": [], "last_modified": "Thu, 06 Aug 2020 18:20:24 GMT", "msg": "OK (2 bytes)", "redirected": false, "server": "CherryPy/18.6.0", "status": 200, "url": "http://localhost:9000/api/tenant/tripleo-ci-reproducer/status/change/1001,1"}

In this case we need to check the logs of zuul-scheduler container

 docker logs $(docker ps | awk '/zuul-scheduler/ {print $1}')

So, I see

[root@ ~]# docker logs e4d48be6e8ba | tail -30
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'
# review.opendev.org:29418 SSH-2.0-GerritCodeReview_2.13.12-11-g1707fec (SSHD-CORE-1.2.0)
# review.rdoproject.org:29418 SSH-2.0-GerritCodeReview_2.14.7-sf (SSHD-CORE-1.4.0)
# gerrit:29418 SSH-2.0-GerritCodeReview_2.16.7 (SSHD-CORE-2.0.0)
2020-08-06 18:25:22,342 - paramiko.transport - ERROR -     raise ValueError("q must be exactly 160, 224, or 256 bits long")
2020-08-06 18:25:22,342 - paramiko.transport - ERROR - ValueError: q must be exactly 160, 224, or 256 bits long
2020-08-06 18:25:22,342 - paramiko.transport - ERROR -
2020-08-06 18:25:22,342 - gerrit.GerritWatcher - ERROR - Exception on ssh event stream with opendev.org:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zuul/driver/gerrit/gerritconnection.py", line 341, in _run
    key_filename=self.keyfile)
  File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 446, in connect
    passphrase,
  File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 680, in _auth
    self._transport.auth_publickey(username, key)
  File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 1580, in auth_publickey
    return self.auth_handler.wait_for_response(my_event)
  File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 236, in wait_for_response
    raise e
  File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 2109, in run
    handler(self.auth_handler, m)
  File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 298, in _parse_service_accept
    sig = self.private_key.sign_ssh_data(blob)
  File "/usr/local/lib/python3.7/site-packages/paramiko/dsskey.py", line 116, in sign_ssh_data
    ).private_key(backend=default_backend())
  File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 244, in private_key
    return backend.load_dsa_private_numbers(self)
  File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 772, in load_dsa_private_numbers
    dsa._check_dsa_private_numbers(numbers)
  File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 144, in _check_dsa_private_numbers
    _check_dsa_parameters(parameters)
  File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 136, in _check_dsa_parameters
    raise ValueError("q must be exactly 160, 224, or 256 bits long")
ValueError: q must be exactly 160, 224, or 256 bits long

So, In this particular case ssh-key is not added to https://review.opendev.org. Once I added public part of ssh key to https://review.opendev.org the job went through without any issues.

Select a repo