OVB Status: Increasing robustness of our OVB jobs

tags: `ruck_rover`

Current situation with OVB jobs

Dozens of our OVB jobs fail on intermittent failures each day! Rerunning and debugging jobs dozens of times a day is exhausting our rucks and rovers.

To get an idea how bad the situation really is, refer to these rr notes:

For example, it took sometimes up to 20 reruns of the same job to get a promotion of c9 master or c9 wallaby. Typical culprit of that period were c9 master/wallaby fs35 and fs64.

Ideas for improvements

Do OVB provide value?
How can we utilize IBM cloud in a better way?
Why ovb jobs are failing?
Can we reduce ovb testing in check?
results/pointers to previous efforts (we investigated memory, and tempest split?)
Can we make our promotion smarter - pass if corresponding internal fs passes
Are we triggering ovb-check job on correct file changes?
Next gen resources make the situation worst?
Using elastic-recheck to report back on patches for rdo-check job failures?
Partner from DF in debug?
Doug's work on all nodes on one compute
Are we running OVB as the right test - hardware prov?
What are the actual errors?
[jm1] Most intermittent failures let us conclude that RDO/RHOS/TripleO is very sensitive to load or latency (cpu? network? disk?) of the underlying systems. We need help from DFGs to make RDO/RHOS/TripleO more robust. Customers will likely run in some of these issues as well, but they do not know that a simple rerun could fix it, since no customer-facing document states this.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

OVB Status: Increasing robustness of our OVB jobs

tags: ruck_rover

Current situation with OVB jobs

Ideas for improvements

tags: `ruck_rover`