owned this note changed 5 years ago
Published Linked with GitHub

Ruck and rover notes #28

tags: ruck_rover

Important links for ruck rover's ruck/rover links to help
Ruck Rover - Unified Sprint 28
Dates: May 28 - June 17

Tripleo CI team ruck|rover: Folco (rfolco) / Pooja (pojadhav)
OSP CI team ruck|rover: Vadim (vgriner), Waldemar (wznoinsk)

Previous notes: https://hackmd.io/2MdkNAUuT7aBcM0Yck4xnw
Next #29 notes: https://hackmd.io/XcuH2OIVTMiuxyrqSF6ocw


on-going issues

TripleO

https://bugs.launchpad.net/tripleo/+bug/1883430
https://bugs.launchpad.net/tripleo/+bug/1883439

gate

https://bugs.launchpad.net/tripleo/+bug/1883909
https://review.opendev.org/#/c/736183/

https://bugs.launchpad.net/tripleo/+bug/1883910

RDO CI

  • ussuri:
    • full-tempest-scenario: failing in different tests
    • all ovb: timeout/node failure
  • master keeps failing same jobs, see history

OSP

  • OSP17 still without attention (!) because of fires in OSP<16
  • outage of tlv labs is over
    • jenkins back online
    • queue is big again after not running since yesterday (expected)
    • there are issues with jenkins failing to connect to tlv located slaves
      • affects qe-generic-tlv-01..03 slaves
      • affects seal slaves (eg seal47 used as extra hw for phase1, non-critical)
      • it shows abort/cancellation + ioexceptions in log as http://pastebin.test.redhat.com/876490
      • seems that the issue is network related, although our tests (ping/mtu/stability) so far come empty (all seems working as usuall)
      • wiping jar cache, as reconfiguring slaves in jenkins had no effect

add dates in decending order so the latest date is at the top. Break out TripleO and OSP sections.

Reviews / Fixes

PATCHES
  1. https://review.opendev.org/734112 Fix image_sanity check
  2. https://review.opendev.org/733699 Fix periodic condition - sanity
  3. https://review.opendev.org/#/c/730763/ train image build nv
  4. https://review.opendev.org/733676 cirros 0.5.1 by default
  5. https://review.opendev.org/#/c/733170 enable networksecgrouptest
  6. https://review.opendev.org/732420 ipv6 skip list
  7. https://review.opendev.org/#/c/733114 pin dib
  8. https://review.opendev.org/732618 fix c8 image builds
  9. https://review.rdoproject.org/r/27724 fix fs035 train timeouts
  10. https://review.rdoproject.org/r/#/c/27901 scen10 == fs062
  11. https://review.opendev.org/#/c/732464 scenario010 nv
  12. ~~https://review.rdoproject.org/r/#/c/27845/ fix image sanity in ~~
  13. https://review.opendev.org/#/c/733659/ py3 c7

Launchpad Bugs Reported

BUGS
Bugzilla Name status Review
1878190 periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master job is consistently failing because of some tesmpest test are failing Triged 727192

Bugs w/ CI tags (ci, alert, promotion-blocker)
https://tinyurl.com/ycnkznfh

June 15th

TripleO

  • train

    • scen10
    • fs020
  • master

    • scen10-ovn
    • tempest-skipped
    • fs020

June 12th

OSP

  • rhos-qe-jenkins queue is too big (>200 jobs)
  • OSP16
    • RHOS-16.1-RHEL-8-20200610.n.0 promoted phase2, phase3 started
    • there is new RHOS-16.1-RHEL-8-20200611.n.0
      • two DFG-octavia jobs failed, tvignaud already retriggered them
        • they failed somewhere in OC deploy (not investigated, not doing so now/today-friday)

June 11th

Tripleo

OSP

  • OSP13z12 some p3 still in progress (seems some reruns too)
  • OSP16.1 20200610.n.0 compose passed p1, has multiple failure in p2 (single job so far)
    • https://projects.engineering.redhat.com/browse/RHOSINFRA-3315 (rarely happening flaky, likely we understand it now, will attempt at fix)
    • https://projects.engineering.redhat.com/browse/RHOSINFRA-3266 (long standing flaky, expected to be solved rhel-8.2 upgrade of slaves)
    • psedlak: all phase2 jobs passed
      • after individual rerun due to the issues above
      • so manual rerun of phase2-multijob with REEVALUATE+PROMOTE option is needed to promote/trigger p3
      • but holding back with promotion:
        • p3 multijobs atm have throttling limit 36 hours (can run again ~1am friday utc, also this will be dropped/changed in future)
        • lot of p3 is still in progress for previous compose (and at least some rely on passed_phase2 symlink atm, to be fixed by improving how UMB triggering works RHOSINFRA-3485)
        • also there is 150 jobs in queue still (it currently affects gates and such too)
        • i plan to trigger promoting+reevaluation on friday morning brq time

June 10th

TripleO

June 9th

Tripleo

OSP

  • new composes for OSP13 and 16.1 - p1/2 in progress check results on wednesday

June 8th

Tripleo

June 5th

Tripleo

c7 py2 jobs broken >> https://review.opendev.org/#/c/726579 REVERTED
ussuri container build >> https://review.opendev.org/#/c/733790

master

ussuri

  • image build
  • container build
details
Failed to open connection to "system" message bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
....
2020-06-05 09:28:47.339 |   Installing       : ansible-pacemaker-1.0.4-0.20200526160932.5847167   304/317Error unpacking rpm package ansible-pacemaker-1.0.4-0.20200526160932.5847167.el8.noarch
2020-06-05 09:28:47.345 |
2020-06-05 09:28:47.346 |   Installing       : crudini-0.9.3-1.el8.noarch                         305/317
2020-06-05 09:28:47.346 | error: unpacking of archive failed on file /usr/share/ansible/plugins/modules/pacemaker_cluster.py;5eda101c: cpio: open failed - Inappropriate ioctl for device
2020-06-05 09:28:47.346 | error: ansible-pacemaker-1.0.4-0.20200526160932.5847167.el8.noarch: install failed
2020-06-05 09:28:47.346 |

OSP

  • OSP17 still without attention (!) because of fires in OSP<16
  • foreign jobs still invading p1/p2 views
  • OSP16.1
  • from yesterday to still followup:
    • osp13 two red (one packstack one ospd)
    • osp15 RED phase1, RED/Yellow octavia in phase2
      • latest 15 build seems old RHOS_TRUNK-15.0-RHEL-8-20200520.n.0
      • so maybe not new issues, but i do not see these in CIX board
      • job status is from 15 days or 6 days old, so just safety reruns exposing infra issue and not a product one (but i do not see them passed for this puddle in history)
      • investigation/rerun definitelly needed (but priority of other osp?)

June 4th

Tripleo

OSP

  • OSP17 still without attention (!) because of fires in OSP<16
  • foreign jobs still invading p1/p2 views
  • psedlak: what is the overall status? (once we sync up keep just the one with issues)
    • osp10 all blue
    • osp12 tab is empty (should be removed?)
    • tkorol: osp13 two red (one packstack one ospd)
    • osp14 empty p1/p2 section
    • psedlak: osp15 RED phase1, RED/Yellow octavia in phase2
      • latest 15 build seems old RHOS_TRUNK-15.0-RHEL-8-20200520.n.0
      • so maybe not new issues, but i do not see these in CIX board
      • job status is from 15 days or 6 days old, so just safety reruns exposing infra issue and not a product one (but i do not see them passed for this puddle in history)
      • investigation/rerun definitelly needed (but priority of other osp?)
    • osp16.0 all blue (>2weeks)
    • osp16.1 blue
    • osp17 phase1 is RED, p2 not run yet
    • puddle-status indicates only 16.1-p2 and 17 not promoted?
    • infra-monitor-job is having quite few issues - but not new seems already for some time (maybe related to pre-testing tlv2 slaves move?)

June 3rd

Tripleo

RDO

promoted train and ussuri

master

June 2nd

Tripleo

June 1st

Tripleo

scenario004 / 001 master busted by
https://bugs.launchpad.net/tripleo/+bug/1881670

  • revert up

scenario010 https://review.opendev.org/#/c/732464/1

  • moving to non-voting until fixed

manually promoted ussuri.. -> taking it out of loop on promoter server as it's busted.


rfolco notes:

testproject

master

train

ussuri
(mostlt green except by)


also see
https://logserver.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-scenario010-standalone-train/7b60b68/logs/undercloud/var/log/containers/neutron/server.log.txt.gz

-06-01 05:53:11.663 32 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=4): PgAclAddCommand(direction=to-lport, log=False, name=[], may_exist=False, entity=pg_d8d3c626_ef73_4b41_9f6a_4503278c5312, priority=1002, action=allow-related, external_ids={'neutron:security_group_rule_id': 'c7029504-a1a9-42d4-9247-a460fdfdb4cf'}, match=outport == @pg_d8d3c626_ef73_4b41_9f6a_4503278c5312 && ip4 && ip4.src == $pg_d8d3c626_ef73_4b41_9f6a_4503278c5312_ip4, severity=[]) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2020-06-01 05:53:11.668 36 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node 8689956c-4f66-404a-ad4a-11ec99f1fcd5 (host: standalone.localdomain) handling event "create" for row ac4a1a42-ae6d-4708-9a8b-7e9655ff3000 (table: ACL) notify /usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:462
2020-06-01 05:53:11.669 36 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node 8689956c-4f66-404a-ad4a-11ec99f1fcd5 (host: standalone.localdomain) handling event "create" for row f9302631-b1c2-4473-9f52-e98bc5660ace (table: ACL) notify /usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:462
2020-06-01 05:53:11.669 34 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node e5210224-234a-4070-a5a3-282594bdc96e (host: standalone.localdomain) handling event "create" for row 8a9c4091-4aff-439f-8aa9-fc32d9d28cf7 (table: ACL) notify /usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:462
2020-06-01 05:53:11.670 36 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node 8689956c-4f66-404a-ad4a-11ec99f1fcd5 (host: standalone.localdomain) handling event "create" for row 10004d08-787e-4e30-a623-74e8a5c2394d (table: ACL) notify /usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:462
2020-06-01 05:53:11.671 36 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node 8689956c-4f66-404a-ad4a-11ec99f1fcd5 (host: standalone.localdomain) handling event "create" for row 7ca26059-bbc4-4f57-9b0e-e8e6c257466c (table: Port_Group) notify /usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:462
2020-06-01 05:53:11.690 32 INFO networking_ovn.db.revision [req-40f346c8-bfeb-4e3f-b42f-96540da554f3 3c550cf5718d489e899d2b974d076c59 c3bac0775f3f4f709b305f72cf217853 - default default] Successfully bumped revision number for resource d8d3c626-ef73-4b41-9f6a-4503278c5312 (type: security_groups) to 1
2020-06-01 05:53:11.704 32 DEBUG oslo_concurrency.lockutils [req-5ef76d30-94e8-46aa-82d3-08631918685e 3c550cf5718d489e899d2b974d076c59 c3bac0775f3f4f709b305f72cf217853 - - -] Lock "event-dispatch" acquired by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:327
2020-06-01 05:53:11.746 32 INFO neutron.pecan_wsgi.hooks.translation [req-40f346c8-bfeb-4e3f-b42f-96540da554f3 3c550cf5718d489e899d2b974d076c59 c3bac0775f3f4f709b305f72cf217853 - default default] POST failed (client error): There was a conflict when trying to complete your request.

source: https://opendev.org/openstack/openstack-ansible-os_tempest/src/branch/master/tasks/tempest_resources.yml#L146

​​​​https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-7-scenario010-standalone-train&pipeline=openstack-periodic-24hr
​​​​
​​​​Issue reported in launchpad : https://bugs.launchpad.net/tripleo/+bug/1881584
​​​​
​​​​
​​​​* 

May 30th

Tripleo

tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053 failing continuously
https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053

2020-05-30 06:22:04.768735 | primary | TASK [repo-setup : Get DLRN hash - passed tag - component-based] ***************
2020-05-30 06:22:04.768795 | primary | Saturday 30 May 2020  06:22:04 +0000 (0:00:00.083)       0:00:19.015 **********
2020-05-30 06:22:05.556357 | primary | fatal: [undercloud]: FAILED! => {
2020-05-30 06:22:05.558103 | primary |     "changed": true,
2020-05-30 06:22:05.558392 | primary |     "cmd": "set -euo pipefail\ndlrn_base=https://trunk.rdoproject.org/centos7-master\nif [ -e /etc/ci/mirror_info.sh ]; then\n  source /etc/ci/mirror_info.sh\n  NODEPOOL_RDO_PROXY=${NODEPOOL_RDO_PROXY:-https://trunk.rdoproject.org}\n  dlrn_base=${dlrn_base/https:\\/\\/trunk.rdoproject.org/$NODEPOOL_RDO_PROXY}\nfi\ncurl -s --fail --show-error ${dlrn_base}/current-tripleo/delorean.repo.md5\n",
2020-05-30 06:22:05.558433 | primary |     "delta": "0:00:00.318829",
2020-05-30 06:22:05.558497 | primary |     "end": "2020-05-30 06:22:05.541197",
2020-05-30 06:22:05.558536 | primary |     "rc": 22,
2020-05-30 06:22:05.558579 | primary |     "start": "2020-05-30 06:22:05.222368"
2020-05-30 06:22:05.558589 | primary | }
2020-05-30 06:22:05.558599 | primary |
2020-05-30 06:22:05.558613 | primary | STDERR:
2020-05-30 06:22:05.558622 | primary |
2020-05-30 06:22:05.558663 | primary | curl: (22) The requested URL returned error: 404 Not Found
2020-05-30 06:22:05.558673 | primary |
2020-05-30 06:22:05.558682 | primary |
2020-05-30 06:22:05.558693 | primary | MSG:
2020-05-30 06:22:05.558703 | primary |
2020-05-30 06:22:05.558723 | primary | non-zero return code

gate issue solved

May 29th

Tripleo

build image issue (fs002):
fix https://review.opendev.org/#/c/731823
test https://review.rdoproject.org/r/27845 Test 731823

fs035 ussuri 3rd party:
https://review.rdoproject.org/r/27846 Add fs035 (ussuri) 3rd party job to layout

  • Gate:

    • tripleo-ci-centos-7-standalone-upgrade-train failed two time with same error:
    ​​​ 2020-05-29 05:26:40 | 2020-05-29 05:26:40.206 139137 INFO osc_lib.shell [-] command: tripleo upgrade -> tripleoclient.v1.tripleo_upgrade.Upgrade (auth=False)[00m
    ​​​ 2020-05-29 05:26:40 | 2020-05-29 05:26:40.209 139137 ERROR tripleoclient.v1.tripleo_upgrade.Upgrade [-] User interaction required, cannot confirm.[00m
    ​​​ 2020-05-29 05:26:40 | 2020-05-29 05:26:40.210 139137 ERROR openstack [-] User did not confirm upgrade, so exiting. Consider using the --yes parameter if you prefer to skip this warning in the future: UndercloudUpgradeNotConfirmed: User did not confirm upgrade, so exiting. Consider using the --yes parameter if you prefer to skip this warning in the future[00m
    ​​​ 2020-05-29 05:26:40 | 2020-05-29 05:26:40.210 139137 INFO osc_lib.shell [-] END return value: 1[00m
    

    https://zuul.openstack.org/builds?pipeline=gate&job_name=tripleo-ci-centos-7-standalone-upgrade-train

    https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_597/727889/1/gate/tripleo-ci-centos-7-standalone-upgrade-train/597b47b/logs/undercloud/home/zuul/standalone_upgrade.log

    https://bugs.launchpad.net/tripleo/+bug/1881306 reported here.

    https://review.opendev.org/#/c/731782/ here is the fix.

  • RDO CI Failures:

    • Ussuri - periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-ussuri consistenly failing with below error
    ​​2020-05-28 17:26:34.276657 | primary | libguestfs: trace: set_verbose true
    ​​2020-05-28 17:26:34.276695 | primary | libguestfs: trace: set_verbose = 0
    ​​2020-05-28 17:26:34.276733 | primary | libguestfs: trace: set_memsize 2048
    ​​2020-05-28 17:26:34.276770 | primary | libguestfs: trace: set_memsize = 0
    ​​2020-05-28 17:26:34.276808 | primary | libguestfs: trace: set_smp 2
    ​​2020-05-28 17:26:34.276844 | primary | libguestfs: trace: set_smp = 0
    ​​2020-05-28 17:26:34.277414 | primary | libguestfs: trace: set_network true
    ​​2020-05-28 17:26:34.277479 | primary | libguestfs: trace: set_network = 0
    ​​2020-05-28 17:26:34.277564 | primary | libguestfs: trace: add_drive    "overcloud-full.qcow2" "readonly:false" "protocol:file" "discard:besteffort"
    ​​2020-05-28 17:26:34.277618 | primary | libguestfs: trace: add_drive = -1 (error)
    ​​2020-05-28 17:26:34.278384 | primary | virt-customize: error: libguestfs error: overcloud-full.qcow2: No such file
    ​​2020-05-28 17:26:34.278417 | primary | or directory
    ​​2020-05-28 17:26:34.278430 | primary | libguestfs: trace: close
    ​​2020-05-28 17:26:34.278904 | primary | libguestfs: closing guestfs handle 0x55d953ea8070 (state 0)
    ​​2020-05-28 17:26:34.278920 | primary | /bin/virt-copy-out: access: overcloud-full.qcow2: No such file or directory
    

    https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-latest-released&job_name=periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-ussuri

    https://review.opendev.org/#/c/731498/ this fix is up for the issue

OSP

May 28th (handoff)

Tripleo

OSP

Completed Items

Select a repo