owned this note changed 5 years ago
Published Linked with GitHub

Ruck and rover notes #26

tags: ruck_rover

ruck/rover primer: https://docs.openstack.org/tripleo-docs/latest/ci/ruck_rover_primer.html

Infrared gerrit: https://review.gerrithub.io/q/project:redhat-openstack/infrared

Infrared doc: https://infrared.readthedocs.io/en/latest/

Cockpit: http://tripleo-cockpit.usersys.redhat.com/d/9DmvErfZz/cockpit?orgId=1

Internal Cockpit (WIP) http://tripleo-cockpit.usersys.redhat.com/?orgId=1
http://cistatus.tripleo.org/
https://trello.com/b/j4IcIomh/production-chain-escalation
http://rhos-release.virt.bos.redhat.com:3030/rhosp

Debugging Tools https://docs.google.com/document/d/1VZhje7ZN9sk4E31fYVrPxpqMJGz5ZhHRfhte_RYMXxg/edit#

Review.rdoproject.org dashboard: https://review.rdoproject.org/grafana/?orgId=1&var-datasource=default&var-server=registry.rdoproject.org.rdocloud&var-inter=$__auto_interval_inter

CentOS pre-release rpm updates for minor releases http://mirror.centos.org/centos/7/cr/x86_64/Packages/

hackmd.io rh-openstack-dev
https://hackmd.io/team/rh-openstack-ci?nav=overview

Internal software factory: https://sf.hosted.upshift.rdu2.redhat.com

upstream rsync mirror logs: files.openstack.org/mirror/logs/rsync-mirrors/centos.log

TRELLO RETROSPECTIVE https://trello.com/b/0VFswmht/rdo-infra-retrospective?menu=filter&filter=label:UniSprint21

Internal Dashboard - https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16/ OSP-10 - OSP-16

RHOS INFRA INFRARED ISSUES https://projects.engineering.redhat.com/issues/?filter=34183

CIX escalation https://mojo.redhat.com/docs/DOC-1098748#jive_content_id_CIX_Escalation_Automation_and_email_format

CIX board https://trello.com/b/j4IcIomh/production-chain-escalation

Nodepool image logs: https://softwarefactory-project.io/nodepool-log/

We may want to move this etherpad to something internal at this point

please add your (colored) name here: time to move to hackmd WDYT? +1 (either now - start of the sprint/rr - or in 3 weeks)
marios (baby blue) fhubik("green lantern") wznoinsk (orange)Amnon(Marrooned)

POST BELOW THIS

Dates: April 16 - May 7
Tripleo CI team ruck|rover: Gabriele (panda) && Amol (akahat)
OSP CI team ruck|rover (April 24 - May 15): Filip (fhubik), Vadim (vgriner)

Previous notes: link

Issues to track on-going

put these issues in the spoiler.

tripleo

@akahat FYI.. @arxcruz is investigating the tempest
failures in stein.

@TheG Please work the networking team to bring https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-scenario010-ovn-provider-standalone online.

CentOS-7 OVB jobs are RED fs001
https://bugs.launchpad.net/tripleo/+bug/1875731
https://bugs.launchpad.net/tripleo/+bug/1876972
TRAIN: GREEN
STEIN: Tempest fail ( arx is looking at it )
ROCKY: Tempest fail ( @arxcruz FYI)
QUEENS: Tempest fail ( @arxcruz FYI)

Thank you!

OSP

Bugzillas Reported

Bugzilla Name status Review
1873770 OVB fs001 in centos8 master fails to push certificates contents to controllers Incomplete
1873892 Non root login prevented on overcloud machines Fixed Release
1874019 scenario009-multinode.yaml and openshift.yaml is missing In Progress
1875352 keystone container failed to start in scenario000 Triged
1875871 periodic rocky jobs failing with missing name argument for pcs Triged
1875846 Overcloud stack creation failed because of failed dependencies. Closed
1875833 The WebSocket timed out before the Workflow completed in rocky/stain jobs New
1876087 Queens, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps failing. Timeout Triged
1876096 Queens: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern tests failed Triged
1876672 Python 2 - AttributeError: 'module' object has no attribute 'get_makefile_name' Fixed Release
1876893 Error: error removing container - device or resource busy In Progress
1877031 queens tripleo-ci-centos-7-undercloud-upgrades broken for ansible version

8 May

TripleO

OSP

holiday in CZ (fhubik), off day in TLV (vgriner)

7 May

TripleO

OSP

6 May

TripleO

  • Tripleo
    • 1877031: queens tripleo-ci-centos-7-undercloud-upgrades broken for ansible version

OSP

  • New OSP16+16.1 contents going thru CI
  • still digging out reports from older puddles, escpecially p2 results
  • UMB issues, manualy triggering, manual promoting
    • conditions for p2->p3 triggering not clear still
  • current passed_phase2 links: 16: 20200427.n.0 / 16.1: 20200428.n.0

5 May

TripleO

OSP

4 May

TripleO

  • 1876672: Python 2 - AttributeError: 'module' object has no attribute 'get_makefile_name'
    • Affected on CentOS7 jobs.

OSP

1 May

TripleO

Pacemaker patch for queens, rocky
https://review.rdoproject.org/r/#/c/27035/

Latest patch for ssh deployment failures:
https://review.opendev.org/#/c/723824/

containers-multinode Featureset010 is failing in queens, stein and rock jobs.. There is only one tempest test configured in that job.
Since there is only one tempest tempest test, instead of skipping it I have to turn tempest off completely.
Arx, Soniya..I need your help to diagnose and bug queens, rocky stein for the test failure if it appears to be unique per release.
We have queens: https://bugs.launchpad.net/tripleo/+bug/1876087
We need bugs on:
Stein: https://logserver.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-stein/05b9be9/logs/tempest.html.gz
Rocky: https://logserver.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-rocky/e5f9368/logs/tempest.html.gz

These all could be related, but we need to confirm that of course.  Thank you for interrupting your schedule to help check these out.

1st May

Tripleo

Investigating periodic queens failures

  • Validate tempest failures:
    • periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset030-queens(1/1 ssh test failing)
    • periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-queens (22/48 test failing)
    • periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-queens (1/1 ssh test failing)
    • periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens (1/1 ssh test failing)
  • Validate tempest timeout
    • periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-queens
  • Deployment failed
    • periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset037-updates-queens (failure running haproxy_init_bundle container)

reviewing to unblock queens
https://review.opendev.org/724703 (still failing, network hiccups)

OSP

holiday in CZ (fhubik), off day in TLV (vgriner)

30 April

Tripleo

  • Promotion blocker
    • 1876087: Network tests are failing on queens jobs
    • 1876073: Zuul CI is giving false positive on role-addition and molecule consistently
    • 1876096: Queens: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern tests failed

OSP

  • octavia issues - another patch being tested
  • octavia meetings about explaining stuff on going
  • OSP16.1 passed_phase2 link create finally, report sent
    • release status clarified
  • OSP16.0 multijobs passed also finally
    • TODO mjob still broken before
  • still no work on 17

29th April

Tripleo

OSP

  • resolving situation around 16 and 16.1 related to Octavia jobs
  • product platform meeting doc updated with details
    • gathering data from Jenkins not trivial
  • Still no time for 17
  • more details below day before

28th April

Tripleo

OSP

fhubik dealing with foreing jobs invading p2 for no reason

27th April

Tripleo

OSP

  • Phase 3 of OSP16.1 was triggered with puddle, RHOS-16.1-RHEL-8-20200424.n.0 that passed phase2.
  • many OSP13 jobs failed on a patch for OSP16.1 changing the int to float.
    • had to revert the change.
    • still need to resolve the problem on float vs int product version in Infrared.
  • 15 fine, 16 fine, 16.1 octavia issues - retriggering
  • asked from RelDel to look at 17 p1 issues
  • maybe help needed in osp16.1 l3 ha esca debug?
  • R&R handover info afazekas -> fhubik

24th April

Tripleo

  • master centos8 failing to build containers because of a network issue while downloading repo info for openvswitch

  • tempest component still failing on the IP allocation error. Tempest guys are aware of the CIX bug there but are joining the investigation

  • Promoter raises an error during promotion, but the promotion is not affected. Investigating promotion code

    • Root cause is a mismatch between api request and response expectation. Promotion actually takes place but we identify the response as an error.
  • Stein is 2 days behind. looking at logs.

  • Need to watch

    • periodic-tripleo-ci-centos-8-scenario010-ovn-provider-standalone-master
    • If this job failed again on the image upload.

23rd April

Tripleo

  • Waived master featureset20 job to get a promotion. It fails in three known tempest tests

22nd April

Tripleo

  • Jobs:

    • periodic-tripleo-ceph-integration-rhel-8-scenario{001,004}-standalone

    Replaced with

    Jobs need to watch:

    • periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master
      (Tempest failure. Failed with exception: "Request Timeout")

    Bugs:

promotion issues

020-04-22 17:14:47,863 25243 ERROR promoter Candidate hash 'aggregate: b3720367a6a0349abcfb06939bed3101, commit: 50837618bdbc4ee18ba25da00a4d98cae9744d68, distro: 99ace58fa85ff53a3de0c282131df46336f81d66, component: ui, timestamp: None': client dlrn_client FAILED promotion attempt to current-tripleo
2020-04-22 17:14:47,863 25243 ERROR promoter API returned different promoted hash
Traceback (most recent call last):
File "/home/centos/ci-config-refactored/ci-scripts/dlrnapi_promoter/logic.py", line 140, in promote
candidate_label=candidate_label)
File "/home/centos/ci-config-refactored/ci-scripts/dlrnapi_promoter/dlrn_client.py", line 359, in promote
candidate_label=candidate_label)
File "/home/centos/ci-config-refactored/ci-scripts/dlrnapi_promoter/dlrn_client.py", line 550, in promote_hash
raise PromotionError("API returned different promoted hash")
PromotionError: API returned different promoted hash
2020-04-22 17:14:47,866 25243 ERROR promoter Error while trying to promote tripleo-ci-testing to current-tripleo
2020-04-22 17:14:47,866 25243 WARNING promoter Candidate label 'tripleo-ci-testing': NO candidate hash promoted to current-tripleo
2020-04-22 17:14:47,866 25243 INFO promoter Candidate label 'current-tripleo': Attempting promotion to 'current-tripleo-rdo'
2020-04-22 17:14:48,810 25243 INFO promoter Candidate label 'current-tripleo': Fetched 10 hashes
2020-04-22 17:14:49,527 25243 WARNING promoter Target label 'current-tripleo-rdo': No hashes fetched. This could mean that the target label is new or it's the wrong label

21st April

Tripleo

Gate jobs failing:

Periodic jobs:

20th April

Tripleo

17th April

Tripleo

  • noticing the latest patches for glance https://review.opendev.org/#/c/712533/ are not consistently resolving previous scenario01/02 issues.. watching

  • @TheG tripleo-ci-centos-7-containerized-undercloud-upgrades should be voting for everything that is not master let's take a look

  • @5rFAC3bRTASHvK6LfOxGWA Amol please watch periodic-openstack-master and openstack-periodic-latest-released in review.rdoproject.org

    • periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-train (Known failure)
    • periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-train (Known failure)
    • periodic-tripleo-rhel-8-train-containers-build-pushfailure (non-voting)
Select a repo