owned this note changed 3 years ago
Published Linked with GitHub

Ruck Rover 2022-10-14 - 2022-10-20

tags: ruck_rover
Next RR notes: https://hackmd.io/wtT4lbOSSeuLcRS2aPTQAQ
Previous RR notes: https://hackmd.io/J4_ZyTvITtS51Wvmd5feRw
ruck & rover: marios & dasm

RDO Cockpit / RHOS Cockpit

RDO Promoter / RHOS Promoter

OpenStack Program Meeting 2022

Zuul Status:

Active bugs


Oct 21

New/Transient/No bug yet:

d/stream

rhel8/16.2 - still hitting registry issues https://bugzilla.redhat.com/show_bug.cgi?id=2135432#c6 & rekicked manually the openstack-periodic-integration-rhos-16.2 1 currently running
centos8 components (ibm cloud) are stuck and holding component lines e.g. https://review.rdoproject.org/zuul/buildset/e41b7c8fbba142b0b0be4d5929ca6739 15 hours in progress
https://bugs.launchpad.net/tripleo/+bug/1984237 -> hitting check and also periodic integration https://review.rdoproject.org/zuul/build/e2c88a92218c4f1f98b4e03010d13b3f

Oct 20


Oct 19

pingd on rhos-ops:

​​​​     ~~~
​​​​     <bhagyashris> Hi Team, we are still hitting retry limit issue and that is causing promtion blocker at downstream'
​​​​    <bhagyashris> 2022-10-19 05:42:29.219305 | primary |   "msg": "Failure downloading http://download.devel.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm, Request failed: <urlopen error [Errno -2] Name or service not known>",
​​​​    
​​​​    <bhagyashris> fbo, wznoinsk|ruck ^
​​​​    <bhagyashris> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-cloudops/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-9-scenario002-standalone-cloudops-rhos-17.1/ab8ba26/job-output.txt

​​​​    <dpawlik> cc kforde ^^
​​​​    <dpawlik> do we have some network issues?

​​​​    <dpawlik> only one outage topic is related to network: https://groups.google.com/u/0/a/redhat.com/g/outage-list/c/h8-ZkLuspxk 
​​​​    <dpawlik> and its not related
​​​​    <dpawlik> bhagyashris: did you hold the node and check if its reachable?

​​​​    <bhagyashris> dpawlik, currently in the running integration line job we are hitting this issue 
​​​​    <bhagyashris> 2022-10-18 18:03:17.050564 | TASK [get_hash : get md5 file]
​​​​    <bhagyashris> 2022-10-18 18:03:37.609246 | primary | ERROR
​​​​    <bhagyashris> 2022-10-18 18:03:37.609628 | primary | {
​​​​    <bhagyashris> 2022-10-18 18:03:37.609678 | primary |   "dest": "/home/zuul/workspace/delorean.repo.md5",
​​​​    <bhagyashris> 2022-10-18 18:03:37.609705 | primary |   "elapsed": 20,
​​​​    <bhagyashris> 2022-10-18 18:03:37.609732 | primary |   "msg": "Request failed: <urlopen error [Errno -2] Name or service not known>",
​​​​    <bhagyashris> 2022-10-18 18:03:37.609781 | primary |   "url": "https://osp-trunk.hosted.upshift.rdu2.redhat.com/rhel8-osp17-1/promoted-components/delorean.repo.md5"
​​​​    <bhagyashris> 2022-10-18 18:03:37.609805 | primary | }
​​​​    <bhagyashris> locally it's accessible "https://osp-trunk.hosted.upshift.rdu2.redhat.com/rhel8-osp17-1/promoted-components/delorean.repo.md5
​​​​    <bhagyashris> not sure why it's causing issue on job node 
​​​​    * evallesp (~evallesp@10.39.194.108) has joined
​​​​    <bhagyashris> dpawlik, added this job https://code.engineering.redhat.com/gerrit/c/testproject/+/431169/6/.zuul.yaml on node hold
​​​​    <bhagyashris> https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/status/change/431169,6
​​​​    <bhjf> Title: Zuul (at sf.hosted.upshift.rdu2.redhat.com)
​​​​    <dpawlik> bhagyashris: on vexxhost we have partially same issue: on some host it can not reach trunk.rdoproject.org server
​​​​    <dpawlik> they fix that, it was something wrong with the host
​​​​     ~~~
  • List of hashes that we can promote:
    • rhos16-2 on rhel8:

      • ac7a781ab85cfc2c9b1a1b6aad4a50ab:
        • Missing Jobs:
          • periodic-tripleo-ci-rhel-8-bm_envD-3ctlr_1comp-featureset035-rhos-16.2
          • periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-16.2
          • periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-internal-rhos-16.2
          • periodic-tripleo-ci-rhel-8-ovb-1ctlr_2comp-featureset020-internal-rhos-16.2
    • rhos17-1 on rhel9:

      • 85b7a0a2481df9e73096a6bc88dc71f7
        • Missing Jobs:
          • periodic-tripleo-ci-rhel-9-ovb-3ctlr_1comp-featureset001-internal-rhos-17.1
          • periodic-tripleo-ci-rhel-9-ovb-3ctlr_1comp-featureset035-internal-rhos-17.1
          • periodic-tripleo-ci-rhel-9-ovb-1ctlr_2comp-featureset020-rbac-internal-rhos-17.1
          • periodic-tripleo-ci-rhel-9-ovb-1ctlr_2comp-featureset020-internal-rhos-17.1
    • Component line:


Oct 18

pinged on rhos-ops: ~~~

​​​​        <bhagyashris> evallesp, wznoinsk|ruck hey currently we are facing this issue for ovb jobs https://bugzilla.redhat.com/show_bug.cgi?id=2135616

​​​​        <bhagyashris> and this one https://bugzilla.redhat.com/show_bug.cgi?id=2135432  we hit on friday and yesterday on container build push job looks like it's intermittent but some how feeling like infra is not stable

​​​​        <bhagyashris> and one more is "Could not resolve host: download.devel.redhat.com" is also coming intermittently 
​​​​        <bhagyashris> could you please check

​​​​        <dpawlik> bhagyashris: did you check outage list
​​​​        <dpawlik> if there are some DNS maintenance? 
​​​​        <apevec> for upshift registry, I pinged internal pnt infra gchat there where rlandy reported registry issues last week, no new replies yet

​​​​        <evallesp> Yesterday I found some DNS errors as well... I though it was similar the internal SSO.
​​​​        <apevec> bhagyashris (IRC): which nameservers do we have now in resolve.conf ?
​​​​        <apevec> there's other thread in pnt-infra gchat about some nameservers not working
​​​​        <apevec> > 10.11.142.1  seems to not work
​​​​        <apevec> > These are the resolvers within RDU2 near RHOS-D:
​​​​        <apevec> nameserver 10.11.5.160
​​​​        <apevec> nameserver 10.11.5.19

​​​​        <apevec> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-internal-clients-rhos-16.2/a2166e8/logs/hostvars-variables.yaml
​​​​        <apevec>     ansible_dns:
​​​​        <apevec>         nameservers:
​​​​        <apevec>         - 10.11.5.19
​​​​        <apevec>         - 10.5.30.45
​​​​        <bhagyashris> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/fe9fbf1/logs/undercloud/etc/resolv.conf
​​​​        <apevec> nameserver 10.11.5.19
​​​​        <apevec> nameserver 10.5.30.45
​​​​        <apevec> ok so first one is what pnt-infra said, but what is the other one

​​​​        <dpawlik> if someone is wondering why upstream zuul does not take any new request: "2022-10-18 07:29:32,336 DEBUG zuul.GithubRateLimitHandler: GitHub API rate limit (ansible-collections/community.digitalocean, 20166502) resource: core, remaining: 12500, reset: 1666081772"

​​​​        <apevec> ah opendev doesn't get some free unlimited account?
​​​​        <dpawlik> dunno
​​​​        <dpawlik> I don't think they are using GH a lot
​​​​        <dpawlik> just a mirror, most things are on opendev side
​​​​        <apevec> bhagyashris (IRC): so in which tasks Failed to discover available identity versions happens, can you point to the code and how we can reproduce outside CI job?

​​​​        <bhagyashris> apevec, here is the log https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-internal-clients-rhos-16.2/a2166e8/job-output.txt
​​​​        <bhagyashris> let me pass the taskwhere it failed

​​​​        <bhagyashris> some where in ovb-manage: Create stack it failed
​​​​        <apevec> is ovb-manage not producing more debug info?
​​​​        <bhagyashris> https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/roles/ovb-manage/tasks/ovb-create-stack.yml#L43
​​​​        <bhagyashris> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-internal-clients-rhos-16.2/a2166e8/logs/bmc-console.log
​​​​        <bhagyashris> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-internal-clients-rhos-16.2/a2166e8/logs/failed_ovb_stack.log

​​​​        <marios> apevec: https://bugzilla.redhat.com/show_bug.cgi?id=2135616#c3 keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13000/v3/auth/tokens
​​​​        <bhjf> Bug 2135616: urgent, unspecified, ---, ---, rhos-maint, distribution, NEW , Failed to discover available identity versions when contacting https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13000/v3. Attempting to parse version from URL.

​​​​        <apevec> bhagyashris (IRC): marios (IRC) https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13000/v3/auth/tokens is reachable from my laptop on VPN, was it temp failure then, is it working now?
​​​​        <apevec> if still failing, can we hold the node ?
​​​​        <apevec> but not sure how we do that with an OVB node?
​​​​        <apevec> this is failing on OC nodes?

​​​​        <apevec> <bhagyashris> "https://sf.hosted.upshift.rdu2...." <- hmm in this case cloud-init failed b/c > [  224.941268] cloud-init[1292]: Failed to start openstack-bmc-baremetal-81610_3.service: Unit not found.
​​​​        <apevec> marios (IRC): which machine's console is what we see bmc-console.log ? It's CentOS 7 ??
​​​​        <apevec> CentOS Linux 7 (Core)
​​​​        <apevec> Kernel 3.10.0-1127.10.1.el7.x86_64 on an x86_64
​​​​        <apevec> and using public centos mirrors: bmc-81610 login: [   54.231122] cloud-init[1292]: * base: centos.mirrors.hoobly.com
​​​​        <apevec> [   54.232969] cloud-init[1292]: * centos-ceph-nautilus: mirror.steadfastnet.com
​​​​        <apevec> [   54.233245] cloud-init[1292]: * centos-nfs-ganesha28: mirror.siena.edu
​​​​        <apevec> [   54.234583] cloud-init[1292]: * centos-openstack-stein: centos.hivelocity.net
​​​​        <apevec> [   54.235488] cloud-init[1292]: * centos-qemu-ev: mirror.umd.edu
​​​​        <apevec> [   54.236472] cloud-init[1292]: * epel: forksystems.mm.fcix.net
​​​​        <apevec> [   54.238592] cloud-init[1292]: * extras: mirror.umd.edu
​​​​        <apevec> [   54.239339] cloud-init[1292]: * updates: mirror.datto.com
​​​​        <apevec> then using https://trunk.rdoproject.org/centos7/current/

​​​​        <apevec> after this keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13000/v3/auth/tokens: ('Connection aborted.', error(104, 'Connection reset by peer'))
​​​​        <apevec> it continues like error didn't happen, should probably stop, are those systemd unit files generated on the fly?
​​​​        <apevec> [  224.790606] cloud-init[1292]: + systemctl daemon-reload
​​​​        <apevec> [  224.887689] cloud-init[1292]: + systemctl enable config-bmc-ips
​​​​        <apevec> [  224.901780] cloud-init[1292]: Failed to execute operation: No such file or directory
​​​​        <apevec> [  224.902855] cloud-init[1292]: + systemctl start config-bmc-ips
​​​​        <apevec> [  224.907713] cloud-init[1292]: Failed to start config-bmc-ips.service: Unit not found.
​​​​        <marios|call> apevec: yeah the bmc is still in c7 
​​​​        <apevec> sigh
​​​​        <apevec> that's unsupported ;)
​​​​        <apevec> I mean really, OSC must be old, also it should retry few times
​​​​        <apevec> https://trunk.rdoproject.org/centos7/current/ is 2020-04-13
​​​​        <apevec> in any case, bhagyashris (IRC) do we still see that failure or is intermittent ?
​​​​        <apevec>  * in any case, bhagyashris (IRC) do we still see that failure or is it random ?
​​​​        <apevec> I still don't have a clear case to report to PSI ops

​​​​        <apevec> before I start looking deeper into OVB code, is stable/2.0 the branch currently in use, based on C7 ?
​​​​        <apevec> and new dev is in master, based on CS9 ?
​​​​        <apevec> (while at it, what are the current blockers to move OVB to CS9  ?)
​​​​        ~~~
  • Component line:

Oct 17


Oct 14


Select a repo