owned this note
owned this note
Published
Linked with GitHub
# Ruck and Rover notes #29
###### tags: `ruck_rover`
:::info
Important links for ruck rover's [ruck/rover links to help](https://hackmd.io/07z0xroHTFi2IbX93P5ZfQ)
**Ruck Rover - Unified Sprint #29
Dates: June 18 - July 8th
Tripleo CI team ruck|rover: Soniya Vyas, Sagi Shnaidman, Ronelle Landy
OSP CI team ruck|rover: psedlak, tkorol?
Previous notes: https://hackmd.io/YAqFJrKMThGghTW4P2tabA
**Next #30 notes: https://hackmd.io/6Bx0FXwlRNCc75l39NSKvg**
:::
[TOC]
---
## on-going issues
:::danger
## TripleO
### zuul patch
https://review.opendev.org/#/c/738668/
- https://trello.com/c/ZrPzcm1x/1558-cixlp1883977tripleociproa-spike-in-node-failures-in-rdo-3rd-party-zuul - again 7 failures today
Bug: [ spike in node failures in rdo 3rd party zuul ](https://bugs.launchpad.net/tripleo/+bug/1883977)
https://review.rdoproject.org/zuul/builds?result=NODE_FAILURE
- [1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud](https://bugs.launchpad.net/tripleo/+bug/1884518)
### gate
Gate failures are at 28 - um make that 32 now:
* lools like we have a number of failures on RAX - asked open opensrack-infra - pls follow up
### periodic / 3rd party
## OSP
### osp15 0624.n.2
* first failed in phase1 due missing container images
* after rebuild it passed phase1
* FAILED in phase2 - ceph-jobs fail OC deploy
* once rhcs4beta is out of the way (ceph-external or IR revert see below)
* jobs are also affected by ansible-2.9.10
* cloned as https://bugzilla.redhat.com/show_bug.cgi?id=1850978
* mistake solved by rhos-release fix, it should have been using ansible 2.8
* **EnableRhcs4Beta: true** is missing due to bug in infrared
* while need for this should be removed in z3, not yet in this CVE compose
* https://bugzilla.redhat.com/show_bug.cgi?id=1794530
* https://bugzilla.redhat.com/show_bug.cgi?id=1790906
* there is delivery ticket for considering shipping the drop-Beta change https://trello.com/c/MAT790lU
* this seems to be rejected as such change (non-beta ceph4) has too big risk/impact
* see newer/latest comments in jira ticket for which the breaking IR change was done
* https://projects.engineering.redhat.com/browse/RHOSINFRA-3277
* **revert/fix of infrared issue in https://review.gerrithub.io/c/redhat-openstack/infrared/+/496293**
* OSP15 + rhos-release/ansible fixed
* https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase2-15_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph-ssl/5/
* tested in 16.1 phase1 - all ok, no change in ceph config
* https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/83/
:::
### Reviews / Fixes
::: spoiler PATCHES
**Remove docker_host configuration from env files**
https://review.opendev.org/737235 for bug [1884371](https://bugs.launchpad.net/tripleo/+bug/1884371)
**Configure docker host for local container build**
https://review.opendev.org/737234 for bug [1884371](https://bugs.launchpad.net/tripleo/+bug/1884371)
### RETRY_LIMIT
https://bugs.launchpad.net/tripleo/+bug/1885701
https://bugs.launchpad.net/tripleo/+bug/1885697
https://bugs.launchpad.net/tripleo/+bug/1885286
### SegFaults
https://bugs.launchpad.net/tripleo/+bug/1885728
### node failures
https://bugs.launchpad.net/tripleo/+bug/1885715
### image builds
need to land ( image builds )
https://review.opendev.org/#/c/738434/
https://review.opendev.org/#/c/738469/
:::
### Bugs reported
::: spoiler BUGS
[1884371 - periodic master - queens jobs using docker.io in container prep](https://bugs.launchpad.net/tripleo/+bug/1884371)
[1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud ](https://bugs.launchpad.net/tripleo/+bug/1884518)
[1885279 -
TestVolumeBootPattern.test_volume_boot_pattern tests on master are failing on updating to cirros-0.5.1 image](https://bugs.launchpad.net/tripleo/+bug/1885279)
[1885286 - Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate](https://bugs.launchpad.net/tripleo/+bug/1885286)
[1884287 -
ipa-server install error: 2020-06-19T13:10:53Z DEBUG The ipa-server-install command failed, exception: DNSZoneAlreadyExists: DNS zone](https://bugs.launchpad.net/tripleo/+bug/1884287)
[1885314 -
OVB master job running on vexxhost show some nodes failing introspection step](https://bugs.launchpad.net/tripleo/+bug/1885314)
[1885315 - tripleo-buildimage-overcloud-full-centos-8 is failing on the update of libnghttp2 package](https://bugs.launchpad.net/tripleo/+bug/1885315)
[1886068 - multinode-ipa tests are failing standalone deployment - 'regsubst' parameter 'target' expects a value of type Array or String, got Undef- ](https://bugs.launchpad.net/tripleo/+bug/1886068)
:::
---
[TOC]
---
### Reviews / Fixes
::: spoiler PATCHES
:::
### Bugs reported
::: spoiler PATCHES
:::
---
:::info
add dates in decending order so the latest date is at the top. Break out TripleO and OSP sections.
:::
## July 2nd
### TripleO
scenario010-ovn-provider-standalone fails
## July 1st
### TripleO
reported:
- ~~1885865~~ Periodic C8 Ceph Integration/ Ceph ansible integration jobs are failing Error: msg": "The conditional check 'release is search(\"queens|rocky|stein|train\")' failed - sandeep proposed patch
- ~~1885911~~: C8 Ceph Ansible integration train/ussuri jobs pulling master bits - sandeep proposed patch
### OSP
* auto-promote of passed phase compose symlinks is broken, already for last few days
* FIXED today July 1st
* `bash: /home/boston/lhh/puddle-promote: No such file or directory`
* this means once phase2 passes, symlinks are not updated and phase3 when started will consume wrong - previous - compose instead
* 16.0 compose promoted manually (p3 aborted as it was container grade test compose)
* 16.1 and 13 in progress - if auto-promote not fixed, it will trigger p3 with wrong compose
## June 29th
### TripleO
reported:
1885642 tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades queens fails on pip install contextlib2 between 0.5 < 0.6 when queens
1885637 tripleo-upgrade needs to update yum repos prior to yum updates for upgrade jobs.
## June 28th
### TripleO
<sshnaidm>
- rocky push containers job is timing out consistently,
suspect of trunk registry problem, need to wait for infra folks to look tomorrow
</sshnaidm>
## June 26th
### TripleO
<rlandy>
* rdocloud promoter - had a restart ... need panda's help to fix it
up and running now.
* https://review.opendev.org/#/c/738025/ - need to get through - revert cirros change on master
https://bugs.launchpad.net/tripleo/+bug/1885279
* https://bugs.launchpad.net/tripleo/+bug/1885286 -
Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate
</rlandy>
### OSP
* psedlak on pto friday 26. 6. 2020 (and tkorol/tlv not on friday)
* afazekas/wznoinsk to provide cover fire
#### 16.1
* **new compose for RC 0625.n.0**
* passed phase1
* phase2 in progress - check it on friday morning
* seems most passed, two jobs left ...
* previous compose **0623.n.0** failed in phase1
* failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy
* caused by update of ansible-2.9.10 arrived in rhel-8.2
* fix should be on the way but it is possibly affecting also other parts, not just the first one identified
## June 25th
### TripleO
<soniya>We are not facing stack delete failure currently</soniya>
## June 24th
### TripleO
<soniya>No more stacks failed to delete issues
~~too many Post failures and retry_limits issues in Upstream gate jobs~~
Most of above issues are resolved and patches have been merged</soniya>
Multiple failures on each release - tracking here
Also stuck stacks are still an issue:
https://bugs.launchpad.net/tripleo/+bug/1884845
Soniya, please see if this is still an issue tomorrow - ie: are we getting more stacks failing to delete?
</rlandy>
<rlandy>
TODO: Component job failures - especially tripleo
TODO for tomorrow - promote train, queens
</rlandy>
### OSP
* 16.1
* 0622.n.2 fails in phase1 in UC install
* `puppet-user: Error: Evaluation Error: Error while evaluating a Function Call, Could not find class ::panko::client for undercloud-0.redhat.local (file: /var/lib/tripleo-config/puppet_step_config.pp, line: 51, column: 1) on node undercloud-0.redhat.local"], ...`
* new compose **0623.n.0** in phase1
* passed the UC stage so previous puppet issue is resolved
* failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy
* rhos-slave-00..03 (nodes in rdu2 rhev-ci-vms) are having issues with dhcp not providing dns info, without dns they are broken (no git cloning etc)
* ~~discovered yesterday, they are off in jenkins now~~
* manually injected rdu2 nameservers in their resolv.conf
* ~~but that will not survive with NetworkManager updating it according to dhcp info~~
* they do not use dhcp at all, but static config, so simply dns entry is missing there
* manual fix still works 3 of 4 are back online for now
* 4th (the 00) one to be used for debug/testing the issue and PnT ticket to be filed then
* resolved, details in https://projects.engineering.redhat.com/browse/RHOSINFRA-3513
## June 22nd
### TripleO
- ~~<soniya> noticed multinode fs10-stein failed: https://bugs.launchpad.net/tripleo/+bug/1884487
log: https://logserver.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-stein/9138146/logs/undercloud/home/zuul/undercloud_install.log.txt.gz</soniya>~~
dup of 1884371
-
<soniya>failed centos8-multinode job once in the Upstream gate jobs - https://review.opendev.org/#/c/736089/3
recheck is given</soniya>
<soniya> failed centos-8-standalone-on-multinode-ipa job once in the Upstream gate jobs - https://review.opendev.org/#/c/736521/</soniya>
:warning: <sshnaidm>[1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud ](https://bugs.launchpad.net/tripleo/+bug/1884518)</sshnaidm>
<sshnaidm>running https://review.rdoproject.org/r/#/c/28004 for reproducing</sshnaidm>
## June 21st
### TripleO
master pipeline failing on container registry settings
https://bugs.launchpad.net/tripleo/+bug/1884371
### OSP
* mostly everything ok (no new compose with issues or major outage)
* still bit of fallout after last weeks shutdown of tlv and jenkins
* based on https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/ there are few jobs running(=hanging) for days
* likely last case of the job which lost its node while running https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/job/DFG-all-unified-16_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-native-default/201/
* qe-ciosp-03 was moved to tlv2 jenkins but this job depends on it https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/job/gate-infrared-openstack-ci-osp7/528/
* via gate-infrared-openstack node label, unkown to me at this moment why the label exists and why it would have just single slave
* also at end of week it was discovered that on rhos-qe-jenkins it was allowed for regular users to manually create&modify jobs (should be only via jjb in git), it was corrected (noted here in case anyone hits issues with that)
## June 19th
### TripleO
[14:38:37] <weshay|ruck> unit test / DLRN FAIL fix https://review.opendev.org/#/c/736816/2
~~[14:38:52] <weshay|ruck> preventative action https://review.opendev.org/#/c/736823/3~~ - merging
0/<soniya>looking at periodic jobs... </soniya>
<wes>noticed ovn-fs010 failed.. debugging and waiting another failure to raise a lp. https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-ovn-provider-standalone-master/2b62fd0/</wes>
<rlandy>pinged kforde re: stacks in rdocloud</rlandy>
<rlandy>rdocloud bmc-template image updated per sagi's email</rlandy>
<rlandy> do we need to w+ ~~https://review.opendev.org/#/c/736816/ [DONE]~~ and https://review.opendev.org/#/c/736823?/</rlandy>
<rlandy>https://logserver.rdoproject.org/openstack-component-security/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-master-component-security-promote-consistent-to-component-ci-testing/3082b35/job-output.txt
2020-06-19 15:05:00.530944 | primary | /home/zuul/workspace/dlrnapi_venv/bin/activate: line 31: $1: unbound variable
Testing: https://review.rdoproject.org/r/28170 Add VIRTUAL_ENV_DISABLE_PROMPT to avoid unbound error
^^ didn't work - revert did ... https://review.rdoproject.org/r/#/c/28171/
</rlandy>
<rlandy> multinode-ip failures:
* https://bugs.launchpad.net/tripleo/+bug/1884287
* Failed to download packages: Cannot download Packages/python3-qrcode-core-5.1-12.module_el8.2.0+370+b142e101.noarch.rpm: All mirrors were tried
https://bugs.launchpad.net/tripleo/+bug/1884570
</rlandy>
<rlandy>
https://review.rdoproject.org/r/28173 Reduce where OVB jobs are run - overloaded clouds
</rlandy>
### OSP
## June 18th
### TripleO
* https://bugs.launchpad.net/tripleo/+bug/1884115 FTR
<openstack> Launchpad bug 1884115 in tripleo "AH00534: httpd: Configuration error: More than one MPM loaded." [High,Triaged] - Assigned to Emilien Macchi (emilienm)
* https://bugs.launchpad.net/tripleo/+bug/1884138
tripleoclient fails to build on delorean: ModuleNotFoundError: No module named 'tripleo_common.actions.config'
https://review.opendev.org/#/c/736816/
https://review.opendev.org/#/c/736823/
see Ravi's comment "I think it would need https://review.opendev.org/#/c/736944/"
* Stacks in RDO cloud are VERY stuck - will talk to infra tomorrow (rlandy)
### OSP