Ruck and rover notes #25

tags: ruck_rover

ruck/rover primer: https://docs.openstack.org/tripleo-docs/latest/ci/ruck_rover_primer.html

Infrared gerrit: https://review.gerrithub.io/q/project:redhat-openstack/infrared

Infrared doc: https://infrared.readthedocs.io/en/latest/

Cockpit: http://tripleo-cockpit.usersys.redhat.com/d/9DmvErfZz/cockpit?orgId=1

Internal Cockpit (WIP) http://tripleo-cockpit.usersys.redhat.com/?orgId=1
http://cistatus.tripleo.org/
https://trello.com/b/j4IcIomh/production-chain-escalation
http://rhos-release.virt.bos.redhat.com:3030/rhosp

Debugging Tools https://docs.google.com/document/d/1VZhje7ZN9sk4E31fYVrPxpqMJGz5ZhHRfhte_RYMXxg/edit#

Review.rdoproject.org dashboard: https://review.rdoproject.org/grafana/?orgId=1&var-datasource=default&var-server=registry.rdoproject.org.rdocloud&var-inter=$__auto_interval_inter

CentOS pre-release rpm updates for minor releases http://mirror.centos.org/centos/7/cr/x86_64/Packages/

hackmd.io rh-openstack-dev
https://hackmd.io/team/rh-openstack-ci?nav=overview

Internal software factory: https://sf.hosted.upshift.rdu2.redhat.com

upstream rsync mirror logs: files.openstack.org/mirror/logs/rsync-mirrors/centos.log

TRELLO RETROSPECTIVE https://trello.com/b/0VFswmht/rdo-infra-retrospective?menu=filter&filter=label:UniSprint21

Internal Dashboard - https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16/ OSP-10 - OSP-16

RHOS INFRA INFRARED ISSUES https://projects.engineering.redhat.com/issues/?filter=34183

CIX escalation https://mojo.redhat.com/docs/DOC-1098748#jive_content_id_CIX_Escalation_Automation_and_email_format

CIX board https://trello.com/b/j4IcIomh/production-chain-escalation

Nodepool image logs: https://softwarefactory-project.io/nodepool-log/

We may want to move this etherpad to something internal at this point

please add your (colored) name here: time to move to hackmd WDYT? +1 (either now - start of the sprint/rr - or in 3 weeks)
marios (baby blue) fhubik("green lantern") wznoinsk (orange)Amnon(Marrooned)

POST BELOW THIS

Dates: March 26 - April 15th
Tripleo CI team ruck|rover: Wes (weshay) && Sandeep ysandeep
OSP CI team ruck|rover: Attila Fazekas and Ariel Opincaru

Previous notes: link

Issues to track on-going

put these issues in the spoiler.

tripleo

https://bugs.launchpad.net/tripleo/+bug/1872881 - Cinder volume failed to build and went to ERROR state - No valid backend was found ( Stderr: ' Volume group "cinder-volumes" not found\n)

16th April

Tripleo

https://review.opendev.org/#/c/718545/19 has caused it, we see a patch which cleansup healthchecks https://review.opendev.org/#/c/720061/, not sure if that will fix the issue - We have tagged Emilien on #tripleo to confirm

15th April

Tripleo

14th April

Tripleo

13th April

Tripleo

  • Promotion Blocker - Compute component promotion pipeline affected
    https://bugs.launchpad.net/tripleo/+bug/1872399 - Deployment failed because "nova_wait_for_api_service" container failed to start (nova_api_wsgi_error - ModuleNotFoundError: No module named 'dataclasses')

Patch is up - https://review.rdoproject.org/r/#/c/26402/

The dataclasses library was recently added as requirements [1] and nova is it's first user[2] - so now this new dep needs to be added in RDO, once added it needs to be added in nova rpm spec file(Workflow details here[3]).

It is only needed for python3.6, The dataclasses library has been added to the standard library in Python 3.7 
[1] https://github.com/openstack/requirements/commit/e7c7dbfc8e09f07ba19cb4474b13f98470ae16b7
[2] https://review.opendev.org/#/c/704643
[3] https://www.rdoproject.org/documentation/requirements/#adding-a-new-requirement-to-rdo

Chatter with smcginnis:-

ysandeep|rover> #openstack-release Hello! Need help with patch https://review.opendev.org/#/c/718468/ - this patch was regarding removal of stein branch from tripleo-ansible, patch got merged but we still see https://opendev.org/openstack/tripleo-ansible/src/branch/stable/stein - do we need any manual step needed for the cleanup?
<smcginnis> ysandeep|rover: Correct. It was noted in that commit, but not super clear. You will now need to request someone from infra delete the branch. It needed to be removed from the release deliverable first to make sure it didn't get accidentally re-added after manual deletion by the release automation.

pinged on #openstack-infra - awaiting response from infra guys.


### 9th April
#### Tripleo

* https://bugs.launchpad.net/tripleo/+bug/1871809 - periodic-tripleo-ci-rhel-8-standalone-train job failing with "Failed to parse dlrn hash"
Last successful run for this job was on 16th March, since then its failing --> @weshayutin do you have history about this?

http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org:8080/rdo/rhel8-train/9a/07/9a07da081ab55116e871add699d18371aeaed356_c0bb2d14/ - missing delorean.repo



* https://bugs.launchpad.net/tripleo/+bug/1871818 - Intermittently tempest run fails because SSH connection to instance fails - "ERROR ovsdbapp.backend.ovs_idl.transaction - RevisionConflict: OVN revision number for * (type: ports) is equal or higher than the given resource" 

Suspecting - port didn't transitioned to up state and ssh to instance failed, found one OVN weird error and it could be ovn issue (details on bz).
Pinged #ovn if they have any pointers about ovn error, jlibosva is checking but we need to confirm what exactly is failing from tempest side.



### 7th April 2020
#### Tripleo

* Train centos8 image build wip:
https://review.rdoproject.org/r/#/c/26285/
https://review.rdoproject.org/r/#/c/26287/

* https://bugs.launchpad.net/tripleo/+bug/1871291 - Introspection failing for OVB jobs - No nodes are manageable at this time. - **fixed**

On further check found metadata issue, detailed logs[1].

periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-master last run was on 2020-04-06. This issue seems duplicate of bug[1] which was fixed already yesterday.

[ 137.351520] cloud-init[857]: 2020-04-06 07:21:38,783 - url_helper.py[WARNING]: Calling 'http://192.168.100.1/latest/meta-data/instance-id' failed [0/120s]: request error

[1] https://logserver.rdoproject.org/openstack-component-common/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-master/04e70b0/logs/bmc-console.log
[2] https://bugs.launchpad.net/tripleo/+bug/1871076


***HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871338 - "overcloud deployment failing with msg: 'argument parameters is of type <class ''str''> and we were unable to convert to dict: unable to evaluate string as dictionary'
**Issue determined** - https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/playbooks/cli-update-params.yaml have issue. Issue started when  "update parameters mistral workflows" was removed 3 days earlier on https://review.opendev.org/#/c/716286/. bz have details.. Need help from DFG:DF - https://review.opendev.org/#/c/717865/ patch is up
* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871346 - "Ironic nodes registration failing with error - ironicclient.common.apiclient.exceptions.InternalServerError: 'NoneType' object has no attribute 'keys'"
**Suspected Issue** - To me recent changes in ironic/api/controllers/v1/port.py seems be related - https://review.opendev.org/#/c/715312/,  - pinged hjensas for pointers as he proposed this patch.
~~~

Error is coming from here:- ironic/api/controllers/v1/port.py 

File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/port.py", line 449, in _check_allowed_port_fields
    {}).keys()):

AttributeError: 'NoneType' object has no attribute 'keys
~~~

### 6th April 2020
#### Tripleo
~~~
* https://bugs.launchpad.net/tripleo/+bug/1871033 - RDO Third Party CI check failing with ERROR! the role 'tripleo-bootstrap' was not found - For Stable/ train branch
Chandan gave some pointers, Need to work further:-

    * tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 job is meant for master not sure why it is running on train branch
    * https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/zuul.d/tripleo.yaml#L791 - it needs to be updated with centos-8 based jobs and replace it with train job
    
  * BAH should be fixed w/ https://review.rdoproject.org/r/#/c/26275/2/zuul.d/tripleo-rdo-base.yaml
~~~

~~~
* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871076 -  OVB jobs failing on "Prepare the overcloud images" Task - Pinged #rhos-ops(kforde and mporrato) to check for any issue on rdo cloud but they are busy right with another ongoing psi outage.
    * HELD NODE FOR DEBUG
~~~

* **HOT Promotion blocker** https://bugs.launchpad.net/tripleo/+bug/1871086 tripleo-ci-centos-8-scenario001-standalone jobs failing with Container(s) with bad ExitCode: [''container-puppet-collectd'']
A backward incompatible commit [1] is pushed in puppet-collectd and it's affecting check, promotion
and gate jobs, Until it's fixed in puppet-collectd. We have pushed a patch[2] to pin puppet-collectd to a good hash.
We are trying make upstream backwards compatible,  PR[3] sent.
[1] https://github.com/voxpupuli/puppet-collectd/commit/d7b79c
[2] https://review.rdoproject.org/r/#/c/26267/
[3] https://github.com/voxpupuli/puppet-collectd/pull/933



### 5th April 2020
#### TripleO
https://bugs.launchpad.net/tripleo/+bug/1871010 Validation packaging error blocking master periodic @ysandeep - it was a transient issue and got cleared in next run, debugging ongoing on bz for RCA. - will need help from jpena 
 


#### OSP

### 3rd April 2020
#### tripleo
* **HOT** Periodic jobs are failing with ERROR! the role 'tripleo-podman' was not found - https://bugs.launchpad.net/tripleo/+bug/1870481
Role tripleo_ansible/roles/tripleo-podman was removed here https://review.opendev.org/#/c/703477/, which seems to be causing issue.
Patches up:-
 https://review.rdoproject.org/r/26240
 https://review.rdoproject.org/r/#/c/26241/

### 2nd April 2020
#### tripleo

* [Bug 1870257] [NEW] puppet-neutron-tripleo-standalone is continuously failing/timing out
https://review.opendev.org/716823 - removed from voting for now
https://review.rdoproject.org/r/#/c/26213/ - let's run the same config(puppet-neutron tempest white list is 'network') in the component pipeline
Takashi mentioned a pain point that tempest scope is too big, he mentioned we can move the job to non-voting, but it's still taking really long because it runs 3 hours x 2 times
https://review.opendev.org/#/c/716952 - takashi proposed a patch to reduce tempest scope. he think we should review the test scope because it doesn't make sense to test wider scope in puppet than tripleo


### 1st April 2020
#### tripleo

* centos-8-containers-multinode is failing recently due to mirror rpm download miss.. no bug required yet.

* master: hang tight.. promotions are coming :)  We need a few patches to get promoted through the component pipeline.

* 'current' repo is being used instead of 'current-tripleo' for non tripleo packages during rpmbuild in tripleo jobs Edit
    https://bugs.launchpad.net/tripleo/+bug/1870026 - Yatin working
    
* centos-7 queens containers build promotion job 
    last run as per this https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-7-queens-containers-build in test project was good,  keeping an eye on periodic job incase it fails again
    
* stein fs001 latest run green :)  https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-stein, https://review.opendev.org/#/c/715698/ fixed fs0001 stein

* tripleo "keystone based standalone deployment failed with No such file or directory: ''/usr/sbin/nft'" https://bugs.launchpad.net/tripleo/+bug/1870095 - Chandan working on the fix.
  * fix is here: https://review.opendev.org/#/c/716615/1/deployment/tripleo-firewall/tripleo-firewall-baremetal-ansible.yaml

### 31 March 2020
####   tripleo
* ovb jobs are oddly failing on image upload
  * https://bugs.launchpad.net/tripleo/+bug/1869997
  * only seeing this in integration jobs, component jobs
    are all green. INTERESTING... 
     * https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-baremetal-master/157fd37/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz
  * Sandeep found this earlier w/
    * **Hot** - periodic promotion jobs failed - overcloud images prepare failing with 'function' object has no attribute 'list'
Found an existing bug: https://bugs.launchpad.net/tripleo/+bug/1869736
  * FIXED w/ https://review.opendev.org/#/c/716277/

* tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001 failed with NTP error - Minor issue (faced once but behavior was not expected)
~~https://bugs.launchpad.net/tripleo/+bug/1869842~~ - all-nodes.sh script should not run if repo_setup.sh fails(Details in bz)

* openstack-periodic-latest-released pipeline tripleo-ci-centos-7* jobs failed with ImageNotFoundException 
https://bugs.launchpad.net/tripleo/+bug/1869871 - It looks like  container image pushed container with different hash and deploy jobs using different hash
https://review.rdoproject.org/r/#/c/26180/ https://review.rdoproject.org/r/#/c/26181/ - patches are up






#### osp





### 30 March 2020
#### tripleo

* giving train centos-8 container builds a shot w/
  * https://review.rdoproject.org/r/26172

* debug stein fs001, using held nodes w/ https://review.rdoproject.org/r/#/c/26170/2/zuul.yaml
    * x.x.33.185 @ysandeep your pubkey is on the zuul user

* train upgrade promotion blocked by missing stein containers. <wes>
  * these containers were deleted as part of the rdo container server migration.
  * train promotion looks clear other than upgrade.. needs a stein promotion https://trunk.rdoproject.org/api-centos-train/api/civotes_detail.html?commit_hash=987489a97b5eb083199a432098b8176e7a185d4d&distro_hash=f724bc71ac39075f8b1e9b99f7a4b5978ff7032a
  
     * 2020-03-30 10:22:18 | Exception: Not found image: docker://trunk.registry.rdoproject.org/tripleostein/centos-binary-cinder-api:c1c1d6ca8c2e4187286a61c960a47335bb21357f_dabe06cc
     * These do exist in docker.io
         * https://hub.docker.com/layers/tripleostein/centos-binary-base/c1c1d6ca8c2e4187286a61c960a47335bb21357f_dabe06cc/images/sha256-5c5ff85993dff7a20900ec60f92c6e5398211d24d6fa9b88611a55344b598e20?context=explore

 
* https://review.opendev.org/#/c/715461/1 - 
Revert "set tq branchful jobs to non-voting"  Recheck failed, Analysis here http://paste.openstack.org/show/791315/ (Seems like a transient issue), 
    * Again posted recheck on the patch(Awaiting results)


* https://bugs.launchpad.net/tripleo/+bug/1869698
Stein/ Rocky Promotion jobs failure with Error "ImportError: No module named os_ken.tests.integrated.common"
    * https://review.rdoproject.org/r/#/c/25085/ Updated neutron-tempest-plugin to 0.9.0 in Train/Stein/Rocky, but Stein and rocky had seperated tempest plugins(neutron-dynamic-routing, bgpvpn, fwaas etc) so those needed to be handled in neutron-tempest-plugin package. Following patches should clear the issue:- Rocky:- https://review.rdoproject.org/r/#/c/26160/ ,
Stein:- https://review.rdoproject.org/r/#/c/26157/

* https://bugs.launchpad.net/tripleo/+bug/1869701 
Periodic jobs failing with ImportError: No module named os_ken.tests.integrated.common
    * Seems like Job have issue (its using upromoted content)






#### osp



### Friday 27 March 2020 
#### tripleo


* overcloud image build, permission denied - Fixed (https://bugs.launchpad.net/tripleo/+bug/1869119 Finding and fixed patches urls in bz)

    https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-buildimage-overcloud-full-centos-8 -> Is back to green :)

    Requested jpena on #rhos-ops to trigger nodepool image builds so that we will not need above ^^ fix in Rdo. (We need virtualenv-20.0.15 https://pypi.org/project/virtualenv/#history) to avoid https://review.opendev.org/#/c/715333/ workaround in rdo jobs
    
I think we can clear [CIX][LP:1869119][tripleoci][proa] permission denied error in diskimage_builder/lib/common-functions causing overcloud image build failures

* pip is pulling latest version from pypi which does not works on py27

    https://bugs.launchpad.net/tripleo/+bug/1869161 (tripleo-ci-centos-7-containers-multinode-train/stein/stein/queens failing with ERROR: Package ‘python-heatclient’ requires a different Python: 2.7.5 not in ‘>=3.6’.)

    It looks like a mirros issue - We also speak with #rhos-infra guys (See https://bugs.launchpad.net/tripleo/+bug/1869161/comments/7)
    
    State of proposed patches:-    
    https://review.opendev.org/#/c/715324/ --Merged
    https://review.opendev.org/#/c/715287/ - Failed in Gate posted recheck
    https://review.opendev.org/#/c/715321/ - Probably not need fixed in DLRN
    https://softwarefactory-project.io/r/#/q/topic:tripleo-ci-py27-fix+(status:open+OR+status:merged) --> Merged
    
#### osp
No Update
    

### Thursday 26 March 2020 
#### tripleo

:::danger

* pip is pulling latest version from pypi which does not works on py27

    * https://bugs.launchpad.net/tripleo/+bug/1869161 (tripleo-ci-centos-7-containers-multinode-train/stein/stein/queens failing with ERROR: Package 'python-heatclient' requires a different Python: 2.7.5 not in '>=3.6'.)
:::

    https://review.opendev.org/#/c/715179/ is up for testing , but it looks like a mirror/any other issue(finding in bz) and may need more work here.

    * https://bugs.launchpad.net/tripleo/+bug/1869174(tripleo-common-stable/train openstack-tox-py27 job failing with ERROR: Package 'Pygments' requires a different Python: 2.7.17 not in '>=3.5')

    https://review.opendev.org/715168 is up

* overcloud image build, permission denied https://bugs.launchpad.net/tripleo/+bug/1869119 - bz updated with findings

I would want to monitor that job for a while. If its still failing and setuptools>= 46.1.3 in ci then its a different issue.

 * rhel8 container build still failing again      https://bugs.launchpad.net/tripleo/+bug/1869188 



#### osp
* no update


### Wednesday 25 March 2020 
#### tripleo
* fak, rdo container registry debacle
    * https://review.opendev.org/#/c/715021/
    * https://review.rdoproject.org/r/#/c/26106/
    * https://review.rdoproject.org/r/#/c/26105/

* several gate failures today
  * mostly tempest failures that appear to be transient
  * http://tripleo-cockpit.usersys.redhat.com/d/9DmvErfZz/cockpit?orgId=1&fullscreen&panelId=61
  * overcloud image build, permission denied https://bugs.launchpad.net/tripleo/+bug/1869119

* tempest component [link](http://tripleo-cockpit.usersys.redhat.com/d/2tivP9BWz/component-pipeline?orgId=1&fullscreen&panelId=431)
  * starting to see green jobs again
  * comparing integration and tempest component
    * https://bugs.launchpad.net/tripleo/+bug/1869077 @sf9mAPkTSTexOvfiCGHboA FYI
    * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-scenario-master/4366d79/logs/undercloud/var/log/tempest/stestr_results.html.gz
    * https://logserver.rdoproject.org/openstack-component-tempest/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-scenario-tempest-master/76b0247/logs/undercloud/var/log/tempest/stestr_results.html.gz

* upstream stable/stein results are low due to two patches that can be ignored 
    * https://review.opendev.org/#/c/656935/
    * https://review.opendev.org/#/c/714940/
    
#### osp
* no updates
Select a repo