owned this note
owned this note
Published
Linked with GitHub
# Ruck and rover notes #25
###### tags: `ruck_rover`
## INFO AND LINKS
:::spoiler
ruck/rover primer: https://docs.openstack.org/tripleo-docs/latest/ci/ruck_rover_primer.html
Infrared gerrit: https://review.gerrithub.io/q/project:redhat-openstack/infrared
Infrared doc: https://infrared.readthedocs.io/en/latest/
Cockpit: http://tripleo-cockpit.usersys.redhat.com/d/9DmvErfZz/cockpit?orgId=1
Internal Cockpit (WIP) http://tripleo-cockpit.usersys.redhat.com/?orgId=1
http://cistatus.tripleo.org/
https://trello.com/b/j4IcIomh/production-chain-escalation
http://rhos-release.virt.bos.redhat.com:3030/rhosp
Debugging Tools https://docs.google.com/document/d/1VZhje7ZN9sk4E31fYVrPxpqMJGz5ZhHRfhte_RYMXxg/edit#
Review.rdoproject.org dashboard: https://review.rdoproject.org/grafana/?orgId=1&var-datasource=default&var-server=registry.rdoproject.org.rdocloud&var-inter=$__auto_interval_inter
CentOS pre-release rpm updates for minor releases http://mirror.centos.org/centos/7/cr/x86_64/Packages/
hackmd.io rh-openstack-dev
https://hackmd.io/team/rh-openstack-ci?nav=overview
Internal software factory: https://sf.hosted.upshift.rdu2.redhat.com
upstream rsync mirror logs: files.openstack.org/mirror/logs/rsync-mirrors/centos.log
TRELLO RETROSPECTIVE https://trello.com/b/0VFswmht/rdo-infra-retrospective?menu=filter&filter=label:UniSprint21
Internal Dashboard - https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16/ OSP-10 - OSP-16
RHOS INFRA INFRARED ISSUES https://projects.engineering.redhat.com/issues/?filter=34183
CIX escalation https://mojo.redhat.com/docs/DOC-1098748#jive_content_id_CIX_Escalation_Automation_and_email_format
CIX board https://trello.com/b/j4IcIomh/production-chain-escalation
Nodepool image logs: https://softwarefactory-project.io/nodepool-log/
We may want to move this etherpad to something internal at this point
please add your (colored) name here: time to move to hackmd WDYT? +1 (either now - start of the sprint/rr - or in 3 weeks)
marios (baby blue) fhubik("green lantern") wznoinsk (orange)Amnon(Marrooned)
:::
## POST BELOW THIS
:::warning
Dates: March 26 - April 15th
Tripleo CI team ruck|rover: Wes (weshay) && Sandeep ysandeep
OSP CI team ruck|rover: Attila Fazekas and Ariel Opincaru
Previous notes: [link](https://hackmd.io/7MBqFHurTA2e5H8kYRwgag?view)
:::
### Issues to track on-going
put these issues in the spoiler.
:::danger
#### tripleo
https://bugs.launchpad.net/tripleo/+bug/1872881 - Cinder volume failed to build and went to ERROR state - No valid backend was found ( Stderr: ' Volume group "cinder-volumes" not found\n)
:::
:::danger
#### OSP
- 16.0 update
this job showed up this weak and failing:
* https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16/job/phase2-16-update-from-ga-HA_no_ceph-ipv4/
- 16.x net/metadata issue
* https://trello.com/c/efUFNGmO/1435-cixbz1822201ospphase2neutronosp16networkingmetadata-temporary-outages
- 13.0 update
this job showed up this weak and failing:
* https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP13/job/phase2-13-update-from-ga-HA_no_ceph-ipv4/
* https://projects.engineering.redhat.com/browse/RHOSINFRA-3173
* https://projects.engineering.redhat.com/browse/RHOSINFRA-3174
Looks like the tests are failing on the before update path (GA)
- 16.1 (rhel-8.2)
- https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16.1/job/phase2-16.1_director-rhel-8.2-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph/
- is all issue explained by: https://trello.com/c/JSaOEzLG/1437-cixbz1822305ospphase2cephradososp16osp161glance-related-test-random-failures ?
- today puddle more wtf -> expect more bugs
:::
### 16th April
#### Tripleo
* https://bugs.launchpad.net/tripleo/+bug/1873249 - "[master] scenario 001/002 Deployments failing with:- Failed to add dependency: Unit file tripleo_ceilometer_gnocchi_upgrade.service does not exist."
https://review.opendev.org/#/c/718545/19 has caused it, we see a patch which cleansup healthchecks https://review.opendev.org/#/c/720061/, not sure if that will fix the issue - We have tagged Emilien on #tripleo to confirm
### 15th April
#### Tripleo
* **HOT PROMOTION BLOCKER affecting check/gate** https://bugs.launchpad.net/tripleo/+bug/1872881 - Cinder volume failed to build and went to ERROR state - No valid backend was found ( Stderr: ' Volume group "cinder-volumes" not found\n)
https://review.opendev.org/#/c/720132/ - Patch is up and awaiting merge but need to investigate rca to understand why we starting hitting this issue now.
### 14th April
#### Tripleo
* FYI.. rdo team have updated puppet in centos8-ussuri to puppet-6(it's major release update). They have tested it so they don't expect issues but just in case.
* https://review.rdoproject.org/r/#/c/26418/ - we tested Revert of "only use tripleo-ansible in required-projects for train+" and its working. testproject passed https://review.rdoproject.org/r/#/c/26415/
### 13th April
#### Tripleo
* **Promotion Blocker - Compute component promotion pipeline affected**
https://bugs.launchpad.net/tripleo/+bug/1872399 - Deployment failed because "nova_wait_for_api_service" container failed to start (nova_api_wsgi_error - ModuleNotFoundError: No module named 'dataclasses')
Patch is up - https://review.rdoproject.org/r/#/c/26402/
~~~
The dataclasses library was recently added as requirements [1] and nova is it's first user[2] - so now this new dep needs to be added in RDO, once added it needs to be added in nova rpm spec file(Workflow details here[3]).
It is only needed for python3.6, The dataclasses library has been added to the standard library in Python 3.7
[1] https://github.com/openstack/requirements/commit/e7c7dbfc8e09f07ba19cb4474b13f98470ae16b7
[2] https://review.opendev.org/#/c/704643
[3] https://www.rdoproject.org/documentation/requirements/#adding-a-new-requirement-to-rdo
~~~
* Stable/stein patches for tripleo projects failing without logs:-
example: https://review.opendev.org/#/c/718728/ - RDO Third Party CI check (1 rechecks) - ERROR Unable to find role in /var/lib/zuul/builds/bccb53b5c55944b0b856dfe135ea802a/ansible/pre_playbook_7/role_1/tripleo-ansible
No bz but patch is already up: https://review.opendev.org/#/c/718468/ merged but it seems like we needs release team involvement for deleting - stein branch - https://opendev.org/openstack/tripleo-ansible/src/branch/stable/stein ? - I have pinged on #openstack-release (smcginnis suggested to request infra guys to delete that branch).
* @sandeep @wes is trying https://review.rdoproject.org/r/#/c/26415/
Chatter with smcginnis:-
~~~
ysandeep|rover> #openstack-release Hello! Need help with patch https://review.opendev.org/#/c/718468/ - this patch was regarding removal of stein branch from tripleo-ansible, patch got merged but we still see https://opendev.org/openstack/tripleo-ansible/src/branch/stable/stein - do we need any manual step needed for the cleanup?
<smcginnis> ysandeep|rover: Correct. It was noted in that commit, but not super clear. You will now need to request someone from infra delete the branch. It needed to be removed from the release deliverable first to make sure it didn't get accidentally re-added after manual deletion by the release automation.
~~~
pinged on #openstack-infra - **awaiting response from infra guys.**
* Master pcsd service not starting on overcloud nodes. @ysandeep have you seen this one?
* https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/5c2765d/logs/overcloud-controller-0/var/log/extra/failed_services.txt.gz
* https://bugs.launchpad.net/tripleo/+bug/1867602
~~~
### 9th April
#### Tripleo
* https://bugs.launchpad.net/tripleo/+bug/1871809 - periodic-tripleo-ci-rhel-8-standalone-train job failing with "Failed to parse dlrn hash"
Last successful run for this job was on 16th March, since then its failing --> @weshayutin do you have history about this?
~~~
http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org:8080/rdo/rhel8-train/9a/07/9a07da081ab55116e871add699d18371aeaed356_c0bb2d14/ - missing delorean.repo
~~~
* https://bugs.launchpad.net/tripleo/+bug/1871818 - Intermittently tempest run fails because SSH connection to instance fails - "ERROR ovsdbapp.backend.ovs_idl.transaction - RevisionConflict: OVN revision number for * (type: ports) is equal or higher than the given resource"
~~~
Suspecting - port didn't transitioned to up state and ssh to instance failed, found one OVN weird error and it could be ovn issue (details on bz).
Pinged #ovn if they have any pointers about ovn error, jlibosva is checking but we need to confirm what exactly is failing from tempest side.
~~~
### 7th April 2020
#### Tripleo
* Train centos8 image build wip:
https://review.rdoproject.org/r/#/c/26285/
https://review.rdoproject.org/r/#/c/26287/
* https://bugs.launchpad.net/tripleo/+bug/1871291 - Introspection failing for OVB jobs - No nodes are manageable at this time. - **fixed**
~~~
On further check found metadata issue, detailed logs[1].
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-master last run was on 2020-04-06. This issue seems duplicate of bug[1] which was fixed already yesterday.
[ 137.351520] cloud-init[857]: 2020-04-06 07:21:38,783 - url_helper.py[WARNING]: Calling 'http://192.168.100.1/latest/meta-data/instance-id' failed [0/120s]: request error
[1] https://logserver.rdoproject.org/openstack-component-common/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-master/04e70b0/logs/bmc-console.log
[2] https://bugs.launchpad.net/tripleo/+bug/1871076
~~~~
***HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871338 - "overcloud deployment failing with msg: 'argument parameters is of type <class ''str''> and we were unable to convert to dict: unable to evaluate string as dictionary'
**Issue determined** - https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/playbooks/cli-update-params.yaml have issue. Issue started when "update parameters mistral workflows" was removed 3 days earlier on https://review.opendev.org/#/c/716286/. bz have details.. Need help from DFG:DF - https://review.opendev.org/#/c/717865/ patch is up
* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871346 - "Ironic nodes registration failing with error - ironicclient.common.apiclient.exceptions.InternalServerError: 'NoneType' object has no attribute 'keys'"
**Suspected Issue** - To me recent changes in ironic/api/controllers/v1/port.py seems be related - https://review.opendev.org/#/c/715312/, - pinged hjensas for pointers as he proposed this patch.
~~~
Error is coming from here:- ironic/api/controllers/v1/port.py
File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/port.py", line 449, in _check_allowed_port_fields
{}).keys()):
AttributeError: 'NoneType' object has no attribute 'keys
~~~
### 6th April 2020
#### Tripleo
~~~
* https://bugs.launchpad.net/tripleo/+bug/1871033 - RDO Third Party CI check failing with ERROR! the role 'tripleo-bootstrap' was not found - For Stable/ train branch
Chandan gave some pointers, Need to work further:-
* tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 job is meant for master not sure why it is running on train branch
* https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/zuul.d/tripleo.yaml#L791 - it needs to be updated with centos-8 based jobs and replace it with train job
* BAH should be fixed w/ https://review.rdoproject.org/r/#/c/26275/2/zuul.d/tripleo-rdo-base.yaml
~~~
~~~
* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871076 - OVB jobs failing on "Prepare the overcloud images" Task - Pinged #rhos-ops(kforde and mporrato) to check for any issue on rdo cloud but they are busy right with another ongoing psi outage.
* HELD NODE FOR DEBUG
~~~
* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871086 tripleo-ci-centos-8-scenario001-standalone jobs failing with Container(s) with bad ExitCode: [''container-puppet-collectd'']
A backward incompatible commit [1] is pushed in puppet-collectd and it's affecting check, promotion
and gate jobs, Until it's fixed in puppet-collectd. We have pushed a patch[2] to pin puppet-collectd to a good hash.
We are trying make upstream backwards compatible, PR[3] sent.
[1] https://github.com/voxpupuli/puppet-collectd/commit/d7b79c
[2] https://review.rdoproject.org/r/#/c/26267/
[3] https://github.com/voxpupuli/puppet-collectd/pull/933
### 5th April 2020
#### TripleO
https://bugs.launchpad.net/tripleo/+bug/1871010 Validation packaging error blocking master periodic @ysandeep - it was a transient issue and got cleared in next run, debugging ongoing on bz for RCA. - will need help from jpena
#### OSP
### 3rd April 2020
#### tripleo
* **HOT** Periodic jobs are failing with ERROR! the role 'tripleo-podman' was not found - https://bugs.launchpad.net/tripleo/+bug/1870481
Role tripleo_ansible/roles/tripleo-podman was removed here https://review.opendev.org/#/c/703477/, which seems to be causing issue.
Patches up:-
https://review.rdoproject.org/r/26240
https://review.rdoproject.org/r/#/c/26241/
### 2nd April 2020
#### tripleo
* [Bug 1870257] [NEW] puppet-neutron-tripleo-standalone is continuously failing/timing out
https://review.opendev.org/716823 - removed from voting for now
https://review.rdoproject.org/r/#/c/26213/ - let's run the same config(puppet-neutron tempest white list is 'network') in the component pipeline
Takashi mentioned a pain point that tempest scope is too big, he mentioned we can move the job to non-voting, but it's still taking really long because it runs 3 hours x 2 times
https://review.opendev.org/#/c/716952 - takashi proposed a patch to reduce tempest scope. he think we should review the test scope because it doesn't make sense to test wider scope in puppet than tripleo
### 1st April 2020
#### tripleo
* centos-8-containers-multinode is failing recently due to mirror rpm download miss.. no bug required yet.
* master: hang tight.. promotions are coming :) We need a few patches to get promoted through the component pipeline.
* 'current' repo is being used instead of 'current-tripleo' for non tripleo packages during rpmbuild in tripleo jobs Edit
https://bugs.launchpad.net/tripleo/+bug/1870026 - Yatin working
* centos-7 queens containers build promotion job
last run as per this https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-7-queens-containers-build in test project was good, keeping an eye on periodic job incase it fails again
* stein fs001 latest run green :) https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-stein, https://review.opendev.org/#/c/715698/ fixed fs0001 stein
* tripleo "keystone based standalone deployment failed with No such file or directory: ''/usr/sbin/nft'" https://bugs.launchpad.net/tripleo/+bug/1870095 - Chandan working on the fix.
* fix is here: https://review.opendev.org/#/c/716615/1/deployment/tripleo-firewall/tripleo-firewall-baremetal-ansible.yaml
### 31 March 2020
#### tripleo
* ovb jobs are oddly failing on image upload
* https://bugs.launchpad.net/tripleo/+bug/1869997
* only seeing this in integration jobs, component jobs
are all green. INTERESTING...
* https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-baremetal-master/157fd37/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz
* Sandeep found this earlier w/
* **Hot** - periodic promotion jobs failed - overcloud images prepare failing with 'function' object has no attribute 'list'
Found an existing bug: https://bugs.launchpad.net/tripleo/+bug/1869736
* FIXED w/ https://review.opendev.org/#/c/716277/
* tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001 failed with NTP error - Minor issue (faced once but behavior was not expected)
~~https://bugs.launchpad.net/tripleo/+bug/1869842~~ - all-nodes.sh script should not run if repo_setup.sh fails(Details in bz)
* openstack-periodic-latest-released pipeline tripleo-ci-centos-7* jobs failed with ImageNotFoundException
https://bugs.launchpad.net/tripleo/+bug/1869871 - It looks like container image pushed container with different hash and deploy jobs using different hash
https://review.rdoproject.org/r/#/c/26180/ https://review.rdoproject.org/r/#/c/26181/ - patches are up
#### osp
### 30 March 2020
#### tripleo
* giving train centos-8 container builds a shot w/
* https://review.rdoproject.org/r/26172
* debug stein fs001, using held nodes w/ https://review.rdoproject.org/r/#/c/26170/2/zuul.yaml
* x.x.33.185 @ysandeep your pubkey is on the zuul user
* train upgrade promotion blocked by missing stein containers. <wes>
* these containers were deleted as part of the rdo container server migration.
* train promotion looks clear other than upgrade.. needs a stein promotion https://trunk.rdoproject.org/api-centos-train/api/civotes_detail.html?commit_hash=987489a97b5eb083199a432098b8176e7a185d4d&distro_hash=f724bc71ac39075f8b1e9b99f7a4b5978ff7032a
* 2020-03-30 10:22:18 | Exception: Not found image: docker://trunk.registry.rdoproject.org/tripleostein/centos-binary-cinder-api:c1c1d6ca8c2e4187286a61c960a47335bb21357f_dabe06cc
* These do exist in docker.io
* https://hub.docker.com/layers/tripleostein/centos-binary-base/c1c1d6ca8c2e4187286a61c960a47335bb21357f_dabe06cc/images/sha256-5c5ff85993dff7a20900ec60f92c6e5398211d24d6fa9b88611a55344b598e20?context=explore
* https://review.opendev.org/#/c/715461/1 -
Revert "set tq branchful jobs to non-voting" Recheck failed, Analysis here http://paste.openstack.org/show/791315/ (Seems like a transient issue),
* Again posted recheck on the patch(Awaiting results)
* https://bugs.launchpad.net/tripleo/+bug/1869698
Stein/ Rocky Promotion jobs failure with Error "ImportError: No module named os_ken.tests.integrated.common"
* https://review.rdoproject.org/r/#/c/25085/ Updated neutron-tempest-plugin to 0.9.0 in Train/Stein/Rocky, but Stein and rocky had seperated tempest plugins(neutron-dynamic-routing, bgpvpn, fwaas etc) so those needed to be handled in neutron-tempest-plugin package. Following patches should clear the issue:- Rocky:- https://review.rdoproject.org/r/#/c/26160/ ,
Stein:- https://review.rdoproject.org/r/#/c/26157/
* https://bugs.launchpad.net/tripleo/+bug/1869701
Periodic jobs failing with ImportError: No module named os_ken.tests.integrated.common
* Seems like Job have issue (its using upromoted content)
#### osp
### Friday 27 March 2020
#### tripleo
* overcloud image build, permission denied - Fixed (https://bugs.launchpad.net/tripleo/+bug/1869119 Finding and fixed patches urls in bz)
https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-buildimage-overcloud-full-centos-8 -> Is back to green :)
Requested jpena on #rhos-ops to trigger nodepool image builds so that we will not need above ^^ fix in Rdo. (We need virtualenv-20.0.15 https://pypi.org/project/virtualenv/#history) to avoid https://review.opendev.org/#/c/715333/ workaround in rdo jobs
I think we can clear [CIX][LP:1869119][tripleoci][proa] permission denied error in diskimage_builder/lib/common-functions causing overcloud image build failures
* pip is pulling latest version from pypi which does not works on py27
https://bugs.launchpad.net/tripleo/+bug/1869161 (tripleo-ci-centos-7-containers-multinode-train/stein/stein/queens failing with ERROR: Package ‘python-heatclient’ requires a different Python: 2.7.5 not in ‘>=3.6’.)
It looks like a mirros issue - We also speak with #rhos-infra guys (See https://bugs.launchpad.net/tripleo/+bug/1869161/comments/7)
State of proposed patches:-
https://review.opendev.org/#/c/715324/ --Merged
https://review.opendev.org/#/c/715287/ - Failed in Gate posted recheck
https://review.opendev.org/#/c/715321/ - Probably not need fixed in DLRN
https://softwarefactory-project.io/r/#/q/topic:tripleo-ci-py27-fix+(status:open+OR+status:merged) --> Merged
#### osp
No Update
### Thursday 26 March 2020
#### tripleo
:::danger
* pip is pulling latest version from pypi which does not works on py27
* https://bugs.launchpad.net/tripleo/+bug/1869161 (tripleo-ci-centos-7-containers-multinode-train/stein/stein/queens failing with ERROR: Package 'python-heatclient' requires a different Python: 2.7.5 not in '>=3.6'.)
:::
https://review.opendev.org/#/c/715179/ is up for testing , but it looks like a mirror/any other issue(finding in bz) and may need more work here.
* https://bugs.launchpad.net/tripleo/+bug/1869174(tripleo-common-stable/train openstack-tox-py27 job failing with ERROR: Package 'Pygments' requires a different Python: 2.7.17 not in '>=3.5')
https://review.opendev.org/715168 is up
* overcloud image build, permission denied https://bugs.launchpad.net/tripleo/+bug/1869119 - bz updated with findings
I would want to monitor that job for a while. If its still failing and setuptools>= 46.1.3 in ci then its a different issue.
* rhel8 container build still failing again https://bugs.launchpad.net/tripleo/+bug/1869188
#### osp
* no update
### Wednesday 25 March 2020
#### tripleo
* fak, rdo container registry debacle
* https://review.opendev.org/#/c/715021/
* https://review.rdoproject.org/r/#/c/26106/
* https://review.rdoproject.org/r/#/c/26105/
* several gate failures today
* mostly tempest failures that appear to be transient
* http://tripleo-cockpit.usersys.redhat.com/d/9DmvErfZz/cockpit?orgId=1&fullscreen&panelId=61
* overcloud image build, permission denied https://bugs.launchpad.net/tripleo/+bug/1869119
* tempest component [link](http://tripleo-cockpit.usersys.redhat.com/d/2tivP9BWz/component-pipeline?orgId=1&fullscreen&panelId=431)
* starting to see green jobs again
* comparing integration and tempest component
* https://bugs.launchpad.net/tripleo/+bug/1869077 @sf9mAPkTSTexOvfiCGHboA FYI
* https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-scenario-master/4366d79/logs/undercloud/var/log/tempest/stestr_results.html.gz
* https://logserver.rdoproject.org/openstack-component-tempest/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-scenario-tempest-master/76b0247/logs/undercloud/var/log/tempest/stestr_results.html.gz
* upstream stable/stein results are low due to two patches that can be ignored
* https://review.opendev.org/#/c/656935/
* https://review.opendev.org/#/c/714940/
#### osp
* no updates