# Ruck and Rover notes #29 ###### tags: `ruck_rover` :::info Important links for ruck rover's [ruck/rover links to help](https://hackmd.io/07z0xroHTFi2IbX93P5ZfQ) **Ruck Rover - Unified Sprint #29 Dates: June 18 - July 8th Tripleo CI team ruck|rover: Soniya Vyas, Sagi Shnaidman, Ronelle Landy OSP CI team ruck|rover: psedlak, tkorol? Previous notes: https://hackmd.io/YAqFJrKMThGghTW4P2tabA **Next #30 notes: https://hackmd.io/6Bx0FXwlRNCc75l39NSKvg** ::: [TOC] --- ## on-going issues :::danger ## TripleO ### zuul patch https://review.opendev.org/#/c/738668/ - https://trello.com/c/ZrPzcm1x/1558-cixlp1883977tripleociproa-spike-in-node-failures-in-rdo-3rd-party-zuul - again 7 failures today Bug: [ spike in node failures in rdo 3rd party zuul ](https://bugs.launchpad.net/tripleo/+bug/1883977) https://review.rdoproject.org/zuul/builds?result=NODE_FAILURE - [1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud](https://bugs.launchpad.net/tripleo/+bug/1884518) ### gate Gate failures are at 28 - um make that 32 now: * lools like we have a number of failures on RAX - asked open opensrack-infra - pls follow up ### periodic / 3rd party ## OSP ### osp15 0624.n.2 * first failed in phase1 due missing container images * after rebuild it passed phase1 * FAILED in phase2 - ceph-jobs fail OC deploy * once rhcs4beta is out of the way (ceph-external or IR revert see below) * jobs are also affected by ansible-2.9.10 * cloned as https://bugzilla.redhat.com/show_bug.cgi?id=1850978 * mistake solved by rhos-release fix, it should have been using ansible 2.8 * **EnableRhcs4Beta: true** is missing due to bug in infrared * while need for this should be removed in z3, not yet in this CVE compose * https://bugzilla.redhat.com/show_bug.cgi?id=1794530 * https://bugzilla.redhat.com/show_bug.cgi?id=1790906 * there is delivery ticket for considering shipping the drop-Beta change https://trello.com/c/MAT790lU * this seems to be rejected as such change (non-beta ceph4) has too big risk/impact * see newer/latest comments in jira ticket for which the breaking IR change was done * https://projects.engineering.redhat.com/browse/RHOSINFRA-3277 * **revert/fix of infrared issue in https://review.gerrithub.io/c/redhat-openstack/infrared/+/496293** * OSP15 + rhos-release/ansible fixed * https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase2-15_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph-ssl/5/ * tested in 16.1 phase1 - all ok, no change in ceph config * https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/83/ ::: ### Reviews / Fixes ::: spoiler PATCHES **Remove docker_host configuration from env files** https://review.opendev.org/737235 for bug [1884371](https://bugs.launchpad.net/tripleo/+bug/1884371) **Configure docker host for local container build** https://review.opendev.org/737234 for bug [1884371](https://bugs.launchpad.net/tripleo/+bug/1884371) ### RETRY_LIMIT https://bugs.launchpad.net/tripleo/+bug/1885701 https://bugs.launchpad.net/tripleo/+bug/1885697 https://bugs.launchpad.net/tripleo/+bug/1885286 ### SegFaults https://bugs.launchpad.net/tripleo/+bug/1885728 ### node failures https://bugs.launchpad.net/tripleo/+bug/1885715 ### image builds need to land ( image builds ) https://review.opendev.org/#/c/738434/ https://review.opendev.org/#/c/738469/ ::: ### Bugs reported ::: spoiler BUGS [1884371 - periodic master - queens jobs using docker.io in container prep](https://bugs.launchpad.net/tripleo/+bug/1884371) [1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud ](https://bugs.launchpad.net/tripleo/+bug/1884518) [1885279 - TestVolumeBootPattern.test_volume_boot_pattern tests on master are failing on updating to cirros-0.5.1 image](https://bugs.launchpad.net/tripleo/+bug/1885279) [1885286 - Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate](https://bugs.launchpad.net/tripleo/+bug/1885286) [1884287 - ipa-server install error: 2020-06-19T13:10:53Z DEBUG The ipa-server-install command failed, exception: DNSZoneAlreadyExists: DNS zone](https://bugs.launchpad.net/tripleo/+bug/1884287) [1885314 - OVB master job running on vexxhost show some nodes failing introspection step](https://bugs.launchpad.net/tripleo/+bug/1885314) [1885315 - tripleo-buildimage-overcloud-full-centos-8 is failing on the update of libnghttp2 package](https://bugs.launchpad.net/tripleo/+bug/1885315) [1886068 - multinode-ipa tests are failing standalone deployment - 'regsubst' parameter 'target' expects a value of type Array or String, got Undef- ](https://bugs.launchpad.net/tripleo/+bug/1886068) ::: --- [TOC] --- ### Reviews / Fixes ::: spoiler PATCHES ::: ### Bugs reported ::: spoiler PATCHES ::: --- :::info add dates in decending order so the latest date is at the top. Break out TripleO and OSP sections. ::: ## July 2nd ### TripleO scenario010-ovn-provider-standalone fails ## July 1st ### TripleO reported: - ~~1885865~~ Periodic C8 Ceph Integration/ Ceph ansible integration jobs are failing Error: msg": "The conditional check 'release is search(\"queens|rocky|stein|train\")' failed - sandeep proposed patch - ~~1885911~~: C8 Ceph Ansible integration train/ussuri jobs pulling master bits - sandeep proposed patch ### OSP * auto-promote of passed phase compose symlinks is broken, already for last few days * FIXED today July 1st * `bash: /home/boston/lhh/puddle-promote: No such file or directory` * this means once phase2 passes, symlinks are not updated and phase3 when started will consume wrong - previous - compose instead * 16.0 compose promoted manually (p3 aborted as it was container grade test compose) * 16.1 and 13 in progress - if auto-promote not fixed, it will trigger p3 with wrong compose ## June 29th ### TripleO reported: 1885642 tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades queens fails on pip install contextlib2 between 0.5 < 0.6 when queens 1885637 tripleo-upgrade needs to update yum repos prior to yum updates for upgrade jobs. ## June 28th ### TripleO <sshnaidm> - rocky push containers job is timing out consistently, suspect of trunk registry problem, need to wait for infra folks to look tomorrow </sshnaidm> ## June 26th ### TripleO <rlandy> * rdocloud promoter - had a restart ... need panda's help to fix it up and running now. * https://review.opendev.org/#/c/738025/ - need to get through - revert cirros change on master https://bugs.launchpad.net/tripleo/+bug/1885279 * https://bugs.launchpad.net/tripleo/+bug/1885286 - Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate </rlandy> ### OSP * psedlak on pto friday 26. 6. 2020 (and tkorol/tlv not on friday) * afazekas/wznoinsk to provide cover fire #### 16.1 * **new compose for RC 0625.n.0** * passed phase1 * phase2 in progress - check it on friday morning * seems most passed, two jobs left ... * previous compose **0623.n.0** failed in phase1 * failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy * caused by update of ansible-2.9.10 arrived in rhel-8.2 * fix should be on the way but it is possibly affecting also other parts, not just the first one identified ## June 25th ### TripleO <soniya>We are not facing stack delete failure currently</soniya> ## June 24th ### TripleO <soniya>No more stacks failed to delete issues ~~too many Post failures and retry_limits issues in Upstream gate jobs~~ Most of above issues are resolved and patches have been merged</soniya> Multiple failures on each release - tracking here Also stuck stacks are still an issue: https://bugs.launchpad.net/tripleo/+bug/1884845 Soniya, please see if this is still an issue tomorrow - ie: are we getting more stacks failing to delete? </rlandy> <rlandy> TODO: Component job failures - especially tripleo TODO for tomorrow - promote train, queens </rlandy> ### OSP * 16.1 * 0622.n.2 fails in phase1 in UC install * `puppet-user: Error: Evaluation Error: Error while evaluating a Function Call, Could not find class ::panko::client for undercloud-0.redhat.local (file: /var/lib/tripleo-config/puppet_step_config.pp, line: 51, column: 1) on node undercloud-0.redhat.local"], ...` * new compose **0623.n.0** in phase1 * passed the UC stage so previous puppet issue is resolved * failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy * rhos-slave-00..03 (nodes in rdu2 rhev-ci-vms) are having issues with dhcp not providing dns info, without dns they are broken (no git cloning etc) * ~~discovered yesterday, they are off in jenkins now~~ * manually injected rdu2 nameservers in their resolv.conf * ~~but that will not survive with NetworkManager updating it according to dhcp info~~ * they do not use dhcp at all, but static config, so simply dns entry is missing there * manual fix still works 3 of 4 are back online for now * 4th (the 00) one to be used for debug/testing the issue and PnT ticket to be filed then * resolved, details in https://projects.engineering.redhat.com/browse/RHOSINFRA-3513 ## June 22nd ### TripleO - ~~<soniya> noticed multinode fs10-stein failed: https://bugs.launchpad.net/tripleo/+bug/1884487 log: https://logserver.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-stein/9138146/logs/undercloud/home/zuul/undercloud_install.log.txt.gz</soniya>~~ dup of 1884371 - <soniya>failed centos8-multinode job once in the Upstream gate jobs - https://review.opendev.org/#/c/736089/3 recheck is given</soniya> <soniya> failed centos-8-standalone-on-multinode-ipa job once in the Upstream gate jobs - https://review.opendev.org/#/c/736521/</soniya> :warning: <sshnaidm>[1884518 - OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud ](https://bugs.launchpad.net/tripleo/+bug/1884518)</sshnaidm> <sshnaidm>running https://review.rdoproject.org/r/#/c/28004 for reproducing</sshnaidm> ## June 21st ### TripleO master pipeline failing on container registry settings https://bugs.launchpad.net/tripleo/+bug/1884371 ### OSP * mostly everything ok (no new compose with issues or major outage) * still bit of fallout after last weeks shutdown of tlv and jenkins * based on https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/ there are few jobs running(=hanging) for days * likely last case of the job which lost its node while running https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/job/DFG-all-unified-16_director-rhel-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph-native-default/201/ * qe-ciosp-03 was moved to tlv2 jenkins but this job depends on it https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/running/job/gate-infrared-openstack-ci-osp7/528/ * via gate-infrared-openstack node label, unkown to me at this moment why the label exists and why it would have just single slave * also at end of week it was discovered that on rhos-qe-jenkins it was allowed for regular users to manually create&modify jobs (should be only via jjb in git), it was corrected (noted here in case anyone hits issues with that) ## June 19th ### TripleO [14:38:37] <weshay|ruck> unit test / DLRN FAIL fix https://review.opendev.org/#/c/736816/2 ~~[14:38:52] <weshay|ruck> preventative action https://review.opendev.org/#/c/736823/3~~ - merging 0/<soniya>looking at periodic jobs... </soniya> <wes>noticed ovn-fs010 failed.. debugging and waiting another failure to raise a lp. https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-ovn-provider-standalone-master/2b62fd0/</wes> <rlandy>pinged kforde re: stacks in rdocloud</rlandy> <rlandy>rdocloud bmc-template image updated per sagi's email</rlandy> <rlandy> do we need to w+ ~~https://review.opendev.org/#/c/736816/ [DONE]~~ and https://review.opendev.org/#/c/736823?/</rlandy> <rlandy>https://logserver.rdoproject.org/openstack-component-security/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-master-component-security-promote-consistent-to-component-ci-testing/3082b35/job-output.txt 2020-06-19 15:05:00.530944 | primary | /home/zuul/workspace/dlrnapi_venv/bin/activate: line 31: $1: unbound variable Testing: https://review.rdoproject.org/r/28170 Add VIRTUAL_ENV_DISABLE_PROMPT to avoid unbound error ^^ didn't work - revert did ... https://review.rdoproject.org/r/#/c/28171/ </rlandy> <rlandy> multinode-ip failures: * https://bugs.launchpad.net/tripleo/+bug/1884287 * Failed to download packages: Cannot download Packages/python3-qrcode-core-5.1-12.module_el8.2.0+370+b142e101.noarch.rpm: All mirrors were tried https://bugs.launchpad.net/tripleo/+bug/1884570 </rlandy> <rlandy> https://review.rdoproject.org/r/28173 Reduce where OVB jobs are run - overloaded clouds </rlandy> ### OSP ## June 18th ### TripleO * https://bugs.launchpad.net/tripleo/+bug/1884115 FTR <openstack> Launchpad bug 1884115 in tripleo "AH00534: httpd: Configuration error: More than one MPM loaded." [High,Triaged] - Assigned to Emilien Macchi (emilienm) * https://bugs.launchpad.net/tripleo/+bug/1884138 tripleoclient fails to build on delorean: ModuleNotFoundError: No module named 'tripleo_common.actions.config' https://review.opendev.org/#/c/736816/ https://review.opendev.org/#/c/736823/ see Ravi's comment "I think it would need https://review.opendev.org/#/c/736944/" * Stacks in RDO cloud are VERY stuck - will talk to infra tomorrow (rlandy) ### OSP