owned this note changed 5 years ago
Published Linked with GitHub

Ruck and Rover notes #29

tags: ruck_rover

Important links for ruck rover's ruck/rover links to help
**Ruck Rover - Unified Sprint #29
Dates: June 18 - July 8th

Tripleo CI team ruck|rover: Soniya Vyas, Sagi Shnaidman, Ronelle Landy
OSP CI team ruck|rover: psedlak, tkorol?

Previous notes: https://hackmd.io/YAqFJrKMThGghTW4P2tabA
Next #30 notes: https://hackmd.io/6Bx0FXwlRNCc75l39NSKvg


on-going issues

TripleO

zuul patch

https://review.opendev.org/#/c/738668/

gate

Gate failures are at 28 - um make that 32 now:

  • lools like we have a number of failures on RAX - asked open opensrack-infra - pls follow up

periodic / 3rd party

OSP

osp15 0624.n.2

Reviews / Fixes

PATCHES

Remove docker_host configuration from env files
https://review.opendev.org/737235 for bug 1884371
Configure docker host for local container build
https://review.opendev.org/737234 for bug 1884371

RETRY_LIMIT

https://bugs.launchpad.net/tripleo/+bug/1885701
https://bugs.launchpad.net/tripleo/+bug/1885697
https://bugs.launchpad.net/tripleo/+bug/1885286

SegFaults

https://bugs.launchpad.net/tripleo/+bug/1885728

node failures

https://bugs.launchpad.net/tripleo/+bug/1885715

image builds

need to land ( image builds )
https://review.opendev.org/#/c/738434/
https://review.opendev.org/#/c/738469/

Bugs reported

BUGS

1884371 - periodic master - queens jobs using docker.io in container prep
1884518 - OVB metalsmith deployment fails: Failed to attach VIF to bare metal node, Node is locked by host undercloud
1885279 -
TestVolumeBootPattern.test_volume_boot_pattern tests on master are failing on updating to cirros-0.5.1 image

1885286 - Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate
1884287 -
ipa-server install error: 2020-06-19T13:10:53Z DEBUG The ipa-server-install command failed, exception: DNSZoneAlreadyExists: DNS zone

1885314 -
OVB master job running on vexxhost show some nodes failing introspection step

1885315 - tripleo-buildimage-overcloud-full-centos-8 is failing on the update of libnghttp2 package
1886068 - multinode-ipa tests are failing standalone deployment - 'regsubst' parameter 'target' expects a value of type Array or String, got Undef-



Reviews / Fixes

PATCHES

Bugs reported

PATCHES

add dates in decending order so the latest date is at the top. Break out TripleO and OSP sections.

July 2nd

TripleO

scenario010-ovn-provider-standalone fails

July 1st

TripleO

reported:

  • 1885865 Periodic C8 Ceph Integration/ Ceph ansible integration jobs are failing Error: msg": "The conditional check 'release is search("queens|rocky|stein|train")' failed - sandeep proposed patch
  • 1885911: C8 Ceph Ansible integration train/ussuri jobs pulling master bits - sandeep proposed patch

OSP

  • auto-promote of passed phase compose symlinks is broken, already for last few days
    • FIXED today July 1st
    • bash: /home/boston/lhh/puddle-promote: No such file or directory
    • this means once phase2 passes, symlinks are not updated and phase3 when started will consume wrong - previous - compose instead
    • 16.0 compose promoted manually (p3 aborted as it was container grade test compose)
    • 16.1 and 13 in progress - if auto-promote not fixed, it will trigger p3 with wrong compose

June 29th

TripleO

reported:
1885642 tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades queens fails on pip install contextlib2 between 0.5 < 0.6 when queens

1885637 tripleo-upgrade needs to update yum repos prior to yum updates for upgrade jobs.

June 28th

TripleO

  • rocky push containers job is timing out consistently,
    suspect of trunk registry problem, need to wait for infra folks to look tomorrow

June 26th

TripleO

OSP

  • psedlak on pto friday 26. 6. 2020 (and tkorol/tlv not on friday)
    • afazekas/wznoinsk to provide cover fire

16.1

  • new compose for RC 0625.n.0
    • passed phase1
    • phase2 in progress - check it on friday morning
      • seems most passed, two jobs left
  • previous compose 0623.n.0 failed in phase1
    • failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy
    • caused by update of ansible-2.9.10 arrived in rhel-8.2
    • fix should be on the way but it is possibly affecting also other parts, not just the first one identified

June 25th

TripleO

<soniya>We are not facing stack delete failure currently</soniya>

June 24th

TripleO

<soniya>No more stacks failed to delete issues
too many Post failures and retry_limits issues in Upstream gate jobs
Most of above issues are resolved and patches have been merged</soniya>
Multiple failures on each release - tracking here

Also stuck stacks are still an issue:
https://bugs.launchpad.net/tripleo/+bug/1884845
Soniya, please see if this is still an issue tomorrow - ie: are we getting more stacks failing to delete?
</rlandy>

OSP

  • 16.1

    • 0622.n.2 fails in phase1 in UC install
      • puppet-user: Error: Evaluation Error: Error while evaluating a Function Call, Could not find class ::panko::client for undercloud-0.redhat.local (file: /var/lib/tripleo-config/puppet_step_config.pp, line: 51, column: 1) on node undercloud-0.redhat.local"], ...
    • new compose 0623.n.0 in phase1
      • passed the UC stage so previous puppet issue is resolved
      • failed in ceph-ansible - failed to create temporary directory, cix https://trello.com/c/zDFpdGiy
  • rhos-slave-00..03 (nodes in rdu2 rhev-ci-vms) are having issues with dhcp not providing dns info, without dns they are broken (no git cloning etc)

    • discovered yesterday, they are off in jenkins now
    • manually injected rdu2 nameservers in their resolv.conf
    • but that will not survive with NetworkManager updating it according to dhcp info
      • they do not use dhcp at all, but static config, so simply dns entry is missing there
    • manual fix still works 3 of 4 are back online for now
    • 4th (the 00) one to be used for debug/testing the issue and PnT ticket to be filed then
    • resolved, details in https://projects.engineering.redhat.com/browse/RHOSINFRA-3513

June 22nd

TripleO

<soniya>failed centos8-multinode job once in the Upstream gate jobs - https://review.opendev.org/#/c/736089/3
recheck is given</soniya>

<soniya> failed centos-8-standalone-on-multinode-ipa job once in the Upstream gate jobs - https://review.opendev.org/#/c/736521/</soniya>

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
<sshnaidm>1884518 - OVB metalsmith deployment fails: Failed to attach VIF to bare metal node, Node is locked by host undercloud </sshnaidm>
<sshnaidm>running https://review.rdoproject.org/r/#/c/28004 for reproducing</sshnaidm>

June 21st

TripleO

master pipeline failing on container registry settings
https://bugs.launchpad.net/tripleo/+bug/1884371

OSP

June 19th

TripleO

[14:38:37] <weshay|ruck> unit test / DLRN FAIL fix https://review.opendev.org/#/c/736816/2

[14:38:52] <weshay|ruck> preventative action https://review.opendev.org/#/c/736823/3 - merging

0/<soniya>looking at periodic jobs </soniya>

<wes>noticed ovn-fs010 failed.. debugging and waiting another failure to raise a lp. https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-ovn-provider-standalone-master/2b62fd0/</wes>

<rlandy>pinged kforde re: stacks in rdocloud</rlandy>

<rlandy>rdocloud bmc-template image updated per sagi's email</rlandy>

<rlandy> do we need to w+ https://review.opendev.org/#/c/736816/ [DONE] and https://review.opendev.org/#/c/736823?/</rlandy>

<rlandy>https://logserver.rdoproject.org/openstack-component-security/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-master-component-security-promote-consistent-to-component-ci-testing/3082b35/job-output.txt
2020-06-19 15:05:00.530944 | primary | /home/zuul/workspace/dlrnapi_venv/bin/activate: line 31: $1: unbound variable
Testing: https://review.rdoproject.org/r/28170 Add VIRTUAL_ENV_DISABLE_PROMPT to avoid unbound error

^^ didn't work - revert did https://review.rdoproject.org/r/#/c/28171/
</rlandy>

<rlandy> multinode-ip failures:

OSP

June 18th

TripleO

OSP

Select a repo