owned this note
owned this note
Published
Linked with GitHub
---
title: 'Reproduce Upstream CI failure on my machine aka Lab deployment with libvirt reproducer'
disqus: tripleo
---
###### tags: Reproducer
Reproduce Upstream CI failure on my machine aka Lab deployment with libvirt reproducer
===
## Table of Contents
[TOC]
## Requirements
1. Hardware machine
a. 8 core cpu, 32 GB memory, 60GB freespace
b. CentOS-8,
:::warning
RHEL-8 is not supported for now by reproducer and requires some work due to podman <> docker conflict
:::
## Hardware Prepare
> Access to your testbox
```Shell=
ssh -A testbox
```
:::info
You can use the root or non-root user.
Non-root users should have sudo access.
:::
> Update packages to latest
```Shell=
sudo dnf -y update
```
> Install package
```Shell=
sudo dnf -y install gcc git libguestfs-tools libvirt tmux tuned virt-install qemu-kvm
```
> Install Ansible
```Shell=
sudo dnf install epel-release
sudo dnf makecache
sudo dnf install ansible
```
> Configure KSM and tuned to enable overcommitment of RAM
```Shell=
sudo systemctl enable ksm --now
sudo systemctl enable ksmtuned --now
```
> Enable tuning for a virtual host
```Shell=
sudo systemctl enable tuned --now
sudo tuned-adm profile virtual-host
```
> Install dnf-utils and enable docker-ce repository
```Shell=
sudo yum install -y dnf-utils
sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
```
> Install Docker
:::info
By default RHEL-8 comes with *runc.x86_64* which is required for podman. In order to make docker working we need to install and use containerd instead.
:::
```Shell=
dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm
sudo dnf install -y docker-ce docker-ce-cli
systemctl start docker
```
> Check Docker by running
```Shell=
docker ps
```
## Software prepare for reproducer
### Prepare ssh keys
> Reproducer script interacts with https://review.opendev.org and https://review.rdoproject.org a lot. To be able to build packages, download patches and their dependencies we need to create ssh keys
```Shell=
ssh-keygen -q -b 4096 -t rsa -f ~/.ssh/id_rsa -N "" -C "Reproducer_CI"
cat ~/.ssh/id_rsa.pub
```
> Add keys to https://review.opendev.org and https://review.rdoproject.org
using [Openstack Fist timers guide](https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html)
:::warning
Test access by running, using your username
```Shell=
ssh -p 29418 holser@review.opendev.org gerrit ls-projects
ssh -p 29418 holser@review.rdoproject.org gerrit ls-projects
```
:::
## Prepare images
> Please download images that will be used by reproducer. You can fetch images from:
* https://nb01.opendev.org/images
* https://nb02.opendev.org/images
* https://nb04.opendev.org/images
```Shell=
pushd /var/lib/libvirt/images
curl -4SL -O https://nb01.opendev.org/images/centos-8-0000078534.qcow2
md5sum centos-8-0000078534.qcow2
popd
```
:::info
Some images are ~10Gb and some are ~5GB. The smaller one may not have python or yum installed. You may need to add those packages using
```Shell=
sudo virt-customize -a centos-8-0000070956.qcow2 --run-command \
'dnf -y install python3 yum screen'
```
I would recommend to read [Modify Images Guide](https://docs.openstack.org/image-guide/modify-images.html) which is very useful if you need to customize image for some experiments
:::
## Reproducing job
> Find a job you want to reproduce. In my case it's **tripleo-ci-centos-8-scenario004-standalone** of https://review.opendev.org/#/c/725782/
>
> Open Zuul Build of that job https://zuul.opendev.org/t/openstack/build/6ea638ff55504bc4be15416af3b181ac
> and download install-deps.sh launcher-env-setup-playbook.yaml launcher-playbook.yaml reproducer-zuul-based-quickstart.sh reproducer-zuul-based-quickstart.tar
```Shell=
mkdir reproduce_job
cd reproduce_job
wget -r -np -nd -R "index.html*" https://d964d012afab0e138249-be2db655edae902b1f8d9628c9b7e990.ssl.cf2.rackcdn.com/751861/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/4faf003/logs/reproducer-quickstart/
``
Create extra.yaml
```yaml=
mirror_path: mirror.regionone.rdo-cloud.rdoproject.org
custom_nameserver: 10.38.5.26
deploy_timeout: 360
compute_memory: 14096
compute_vcpu: 1
control_memory: 18192
control_vcpu: 8
undercloud_vcpu: 2
undercloud_memory: 18192
force_cached_images: true
image_cache_expire_days: 30
vxlan_networking: false
toci_vxlan_networking: false
modify_image_vc_root_password: r00tme
mergers: 2
ansible_python_interpreter: /usr/bin/python3
mirror_fqdn: mirror.regionone.rdo-cloud.rdoproject.org
pypi_fqdn: mirror01.ord.rax.opendev.org
images:
- name: undercloud
url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2
md5sum: d90a6fa7188653ad0eae68bb3b7b9461
type: qcow2
- name: overcloud
url: file:///var/lib/libvirt/images/centos-8-0000078534.qcow2
md5sum: d90a6fa7188653ad0eae68bb3b7b9461
type: qcow2
```
> Run reproducer
```Shell=
bash ./reproducer-zuul-based-quickstart.sh -w /var/tmp/reproduce -l -e @extra.yaml -e os_autohold_node=true -e zuul_build_sshkey_cleanup=false -e container_mode=docker -e upstream_gerrit_user=holser -e rdo_gerrit_user=holser
```
## Cleanup
```Shell=
docker rm -f tripleo-ci-reproducer_logs_1 tripleo-ci-reproducer_fingergw_1 \
tripleo-ci-reproducer_executor_1 tripleo-ci-reproducer_web_1 \
tripleo-ci-reproducer_merger1_1 tripleo-ci-reproducer_merger0_1 \
tripleo-ci-reproducer_scheduler_1 tripleo-ci-reproducer_launcher_1 \
tripleo-ci-reproducer_mysql_1 tripleo-ci-reproducer_zk_1 \
tripleo-ci-reproducer_gerrit_1 tripleo-ci-reproducer_logs_1 \
tripleo-ci-reproducer_gerritconfig_1
rm -rf /var/cache/tripleo-quickstart/
rm -rf /var/tmp/reproduce/
rm -rf ~/tripleo-ci-reproducer
```
## Debuging
There are a lot of possible issues with reproducer. I am not going to describe them all. The engineer with good debugging skills will be able to find them. Going back to my issue, after running
```
TASK [ansible-role-tripleo-ci-reproducer : Wait for job to start] ***********************************************************
task path: /var/tmp/reproduce/roles/ansible-role-tripleo-ci-reproducer/tasks/launch-job.yaml:63
FAILED - RETRYING: Wait for job to start (30 retries left).
FAILED - RETRYING: Wait for job to start (29 retries left).
FAILED - RETRYING: Wait for job to start (28 retries left).
FAILED - RETRYING: Wait for job to start (27 retries left).
FAILED - RETRYING: Wait for job to start (26 retries left).
FAILED - RETRYING: Wait for job to start (25 retries left).
FAILED - RETRYING: Wait for job to start (24 retries left).
FAILED - RETRYING: Wait for job to start (23 retries left).
FAILED - RETRYING: Wait for job to start (22 retries left).
FAILED - RETRYING: Wait for job to start (21 retries left).
FAILED - RETRYING: Wait for job to start (20 retries left).
FAILED - RETRYING: Wait for job to start (19 retries left).
FAILED - RETRYING: Wait for job to start (18 retries left).
FAILED - RETRYING: Wait for job to start (17 retries left).
FAILED - RETRYING: Wait for job to start (16 retries left).
FAILED - RETRYING: Wait for job to start (15 retries left).
FAILED - RETRYING: Wait for job to start (14 retries left).
FAILED - RETRYING: Wait for job to start (13 retries left).
FAILED - RETRYING: Wait for job to start (12 retries left).
FAILED - RETRYING: Wait for job to start (11 retries left).
FAILED - RETRYING: Wait for job to start (10 retries left).
FAILED - RETRYING: Wait for job to start (9 retries left).
FAILED - RETRYING: Wait for job to start (8 retries left).
FAILED - RETRYING: Wait for job to start (7 retries left).
FAILED - RETRYING: Wait for job to start (6 retries left).
FAILED - RETRYING: Wait for job to start (5 retries left).
FAILED - RETRYING: Wait for job to start (4 retries left).
FAILED - RETRYING: Wait for job to start (3 retries left).
FAILED - RETRYING: Wait for job to start (2 retries left).
FAILED - RETRYING: Wait for job to start (1 retries left).
fatal: [localhost]: FAILED! => {"access_control_allow_origin": "*", "attempts": 30, "cache_control": "public, max-age=1", "changed": false, "connection": "close", "content": "[]", "content_length": "2", "content_type": "application/json; charset=utf-8", "cookies": {}, "cookies_string": "", "date": "Thu, 06 Aug 2020 18:20:24 GMT", "elapsed": 0, "json": [], "last_modified": "Thu, 06 Aug 2020 18:20:24 GMT", "msg": "OK (2 bytes)", "redirected": false, "server": "CherryPy/18.6.0", "status": 200, "url": "http://localhost:9000/api/tenant/tripleo-ci-reproducer/status/change/1001,1"}
```
In this case we need to check the logs of zuul-scheduler container
```Shell=bash
docker logs $(docker ps | awk '/zuul-scheduler/ {print $1}')
```
So, I see
```
[root@ ~]# docker logs e4d48be6e8ba | tail -30
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'
# review.opendev.org:29418 SSH-2.0-GerritCodeReview_2.13.12-11-g1707fec (SSHD-CORE-1.2.0)
# review.rdoproject.org:29418 SSH-2.0-GerritCodeReview_2.14.7-sf (SSHD-CORE-1.4.0)
# gerrit:29418 SSH-2.0-GerritCodeReview_2.16.7 (SSHD-CORE-2.0.0)
2020-08-06 18:25:22,342 - paramiko.transport - ERROR - raise ValueError("q must be exactly 160, 224, or 256 bits long")
2020-08-06 18:25:22,342 - paramiko.transport - ERROR - ValueError: q must be exactly 160, 224, or 256 bits long
2020-08-06 18:25:22,342 - paramiko.transport - ERROR -
2020-08-06 18:25:22,342 - gerrit.GerritWatcher - ERROR - Exception on ssh event stream with opendev.org:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/zuul/driver/gerrit/gerritconnection.py", line 341, in _run
key_filename=self.keyfile)
File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 446, in connect
passphrase,
File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 680, in _auth
self._transport.auth_publickey(username, key)
File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 1580, in auth_publickey
return self.auth_handler.wait_for_response(my_event)
File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 236, in wait_for_response
raise e
File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 2109, in run
handler(self.auth_handler, m)
File "/usr/local/lib/python3.7/site-packages/paramiko/auth_handler.py", line 298, in _parse_service_accept
sig = self.private_key.sign_ssh_data(blob)
File "/usr/local/lib/python3.7/site-packages/paramiko/dsskey.py", line 116, in sign_ssh_data
).private_key(backend=default_backend())
File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 244, in private_key
return backend.load_dsa_private_numbers(self)
File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 772, in load_dsa_private_numbers
dsa._check_dsa_private_numbers(numbers)
File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 144, in _check_dsa_private_numbers
_check_dsa_parameters(parameters)
File "/usr/local/lib/python3.7/site-packages/cryptography/hazmat/primitives/asymmetric/dsa.py", line 136, in _check_dsa_parameters
raise ValueError("q must be exactly 160, 224, or 256 bits long")
ValueError: q must be exactly 160, 224, or 256 bits long
```
So, In this particular case ssh-key is not added to https://review.opendev.org. Once I added public part of ssh key to https://review.opendev.org the job went through without any issues.