Ruck and Rover notes #41

tags: ruck_rover

Important links for ruck rover's ruck/rover links to help Ruck Rover - Unified Sprint #<fix> Dates: Feb 4 - Feb 25

Tripleo CI team ruck|rover: arxcruz , ysandeep OSP CI team ruck|rover: <fix>Names</fix>

Previous notes: link

Issues to track on-going

put these issues in the spoiler.

tripleo

check/Gate:

Stein branch check/Gate jobs are failing because of missing container images, Error - ImageNotFoundException https://bugs.launchpad.net/tripleo/+bug/1915921

promotions:

Master: 17th Feb (Yellow)

sc01/02 only failed once, passed in testproject

We have a bug for fs39 for master, fixed now.. fix need to hit integration line. *

We also need to talk with security dfg about fs039 - need to drop/migrate this job

**Victoria -

**Ussuri - Green -

**c8 train- **

** c7 train - Red - 23rd Jan** [CIX][LP:1915519][tripleoci][proa][Train][CentOS7][scenario004] Failing with Error: 'ip-192.168.24.3' already exists. Too many tries" https://bugs.launchpad.net/tripleo/+bug/1915519

  • Stein - Green - Promoted on 24th Feb
  • Rocky - Red periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-rocky is failing with NeutronError: "Invalid input for operation: segmentation_id requires physical_network for VLAN provider network" https://bugs.launchpad.net/tripleo/+bug/1916695
  • Queens - Promoted on 11th Feb

add dates in decending order so the latest date is at the top. Break out TripleO and OSP sections.

March 3rd

Gate

RDO

Feb 25th

Gate

RDO

26th Feb

RDO

1st March

Tripleo upsteram Gate:

RDO:

  • Promotion status

    • Master - 1st March
    • Victoria - 1st March
    • Ussuri - 27th Feb
    • C8 train - 1st March
  • NODE_FAILURES: https://review.rdoproject.org/zuul/builds?result=NODE_FAILURE

    ​​​​<ykarel> is there some known issue with vexxhost, our jobs     using nodepool-rdo tenant in vexxhost are failing, on     looking i see vms are not spawning, Build of instance 2806cd94-f395-4f2e-a803-2de297c45749 aborted: Failed to allocate the network(s), not rescheduling
    ​​​​<ykarel> ohhk other tenants also have too many NODE_FAILURE https://review.rdoproject.org/zuul/builds?result=NODE_FAILURE
    ​​​​<ykarel> so likely it's outage on vexxhost side, checking launcer logs
    ​​​​<ykarel> hmm it's same on zuul side too:- Detailed node error: Build of instance 0eee0560-1304-4f7a-ad60-57885077d066 aborted: Failed to allocate the network(s), not rescheduling
    ​​​​<ykarel> pinged on #vexxhost, but doubt if someone is around at this time, will wait
    ​​​​<bhagyashris> ykarel, ack thanks :)
    ​​​​<ykarel> mnaser fixed ^, now vms are being created successfully
    ​​​​<ykarel> bhagyashris, fyi ^
    ​​​​<bhagyashris> ykarel, ack thanks :)
    
  • Note:

    • There are two issue with NODE_FAILURE as given below:
      1. Detailed node error: Build of instance 0eee0560-1304-4f7a-ad60-57885077d066 aborted: Failed to allocate the network(s), not rescheduling / "Failed to allocate the network(s), not rescheduling"
      2. nodepool.exceptions.LaunchNetworkException: Unable to find public IP of server : https://review.rdoproject.org/r/#/c/32123/
    • So the first one get resolved but second one is happeing randomly and discussed this on #rhos-ops channel.
    ​​​​<bhagyashris> ykarel, still seeing some node failure on  triggered openstack-periodic-integration-stable3 pipeline https://review.rdoproject.org/zuul/status 
    
    ​​​​<ykarel> bhagyashris, looking
    ​​​​<ykarel> bhagyashris, okk this is different one:     nodepool.exceptions.LaunchNetworkException: Unable to find public IP of server
    ​​​​<ykarel> it is known already, we hitting this randomly when large build requests are at a point
    ​​​​<ykarel> see spikes in https://softwarefactory-project.io/grafana/d/lu6loudWz/provider-vexxhost-nodepool-tripleo?orgId=1&from=now-1h&to=now
    
    ​​​​<ykarel> around 08:03 UTC
    ​​​​<ykarel> dpawlik, did we get something from vexxhost for ^?
    ​​​​<ykarel> also fyi there was one more issue since saturday, can see discussion on #vexxhost for that
    ​​​​<dpawlik> ykarel: so the issue is related to the "Unable to find public IP of server" right?
    ​​​​<ykarel> the other issue since saturday was "Failed to allocate the network(s), not rescheduling", which is now     fixed. the random one for "Unable to find public IP of     server" is still happening
    ​​​​<dpawlik> near 8 was some pick of FIPs https://prometheus.monitoring.softwarefactory-project.io/prometheus/graph?g0.expr=floating_ip&g0.tab=0&g0.stacked=0&g0.range_input=1w
    
    ​​​​<ykarel> yeap but iiuc we are not hitting quota iirc which is 125, so something wrong on server side
    ​​​​<dpawlik> ykarel: not really. We are calculating the fips base on what the user can get from Neutron API.
    ​​​​<ykarel> how we can get from admin side?
    ​​​​<dpawlik> ykarel: we have a task for it.  I will try to put it higher in priority
    ​​​​<ykarel> dpawlik, okk Thanks
    ​​​​<ykarel> dpawlik++
    ​​​​<bhagyashris> ykarel, ok thanks for info and dpawlik     thanks :)
    ​​​​<dpawlik> ykarel: seems that via horizon the calculation is ok
    ​​​​<dpawlik> ykarel: so prometheus says that we now have 34 floating ips in use, where in horizon for nodepool-tripleo     project is 52
    ​​​​<dpawlik> ykarel: I will try to dig a littlebit if we can do something in our script to fix that calculations
    ​​​​<ykarel> dpawlik, okk
    ​​​​<ykarel> dpawlik, those 52 ips are in-use state?
    ​​​​<dpawlik> yup
    
    ​​​​<dpawlik> ykarel|lunch: thats strange, all floating ips are from subnet 38.102.83.0/24 where it should also use 38.129.56.0/24
    ​​​​<dpawlik> and it can be possible that the first network is out of ips
    ​​​​<dpawlik> ykarel|lunch: maybe I found where is an issue
    
    ​​​​<dpawlik> ykarel|lunch: https://review.rdoproject.org/r/#/c/32123/
    
    ​​​​<dpawlik> slaweq: Hey. If we have network "public" and in that network, there are two subnets: 38.102.83.0/24 and 38.129.56.0/24 . Is it possible, that neutron is taking ip address just from one subnet and it does not touch second subnet until the first is finished or it will "touch" also the second subnet?
    ​​​​<slaweq> dpawlik: let me check in the code
    ​​​​<slaweq> I don't remember exactly
    ​​​​<slaweq> dpawlik: it seems for me that it can get IP from any subnet
    ​​​​<dpawlik> slaweq++
    ​​​​<dpawlik> thanks
    ​​​​<slaweq> look here https://github.com/openstack/neutron/blob/482d0fe2bf0b078ced598aae4059862981550cae/neutron/db/ipam_pluggable_backend.py#L257
    ​​​​<slaweq> it makes list of available IPs from all subnets in the network
    ​​​​<dpawlik> cc jpena ^^
    ​​​​<slaweq> or wait
    ​​​​<slaweq> it seems it will be like that, in https://github.com/openstack/neutron/blob/482d0fe2bf0b078ced598aae4059862981550cae/neutron/ipam/drivers/neutrondb_ipam/driver.py#L174 it iterates over allocation pools and getting IP from the pool
    ​​​​<slaweq> but to be sure I would need to test that :)
    ​​​​<dpawlik> slaweq: k. Thanks
    ​​​​<ykarel> dpawlik, Thanks, will check post meeting
    

2nd March

Tripleo upsteram Gate:

RDO periodic :

3rd March

Tripleo upsteram Gate:

RDO periodic :

4th March

Tripleo upsteram Gate:

RDO periodic

5th March

Tripleo upsteram Gate:

RDO periodic :

8th March

Tripleo upsteram Gate:

RDO periodic :

9th March

Tripleo upsteram Gate:

RDO periodic :

10th March

Tripleo upsteram Gate:

RDO periodic :

11th March

Tripleo upsteram Gate:

RDO periodic :

12th March

Tripleo upsteram Gate:

RDO periodic :

15th March

Tripleo upsteram Gate:

RDO periodic :

16th March

Tripleo upsteram Gate:

RDO periodic :

Select a repo