OVB Status: Increasing robustness of our OVB jobs

# OVB Status: Increasing robustness of our OVB jobs ###### tags: `ruck_rover` ## Current situation with OVB jobs Dozens of our OVB jobs fail on intermittent failures each day! Rerunning and debugging jobs dozens of times a day is exhausting our rucks and rovers. To get an idea how bad the situation really is, refer to these rr notes: * [2022-09-13 - 2022-09-15](https://hackmd.io/dKeK6zo9R66heikGyCb4NA) * [2022-09-09 - 2022-09-12](https://hackmd.io/s4TgnCY-QQGKv2ONxTjOZA) * [2022-09-01 - 2022-09-01](https://hackmd.io/94uNoMlnQgegrgy1iXV1kQ) * [2022-08-26 - 2022-08-31](https://hackmd.io/7qAKWCiCQA6IEdE9WXYn4Q) For example, it took sometimes up to 20 reruns of the same job to get a promotion of c9 master or c9 wallaby. Typical culprit of that period were c9 master/wallaby fs35 and fs64. ## Ideas for improvements 1. Do OVB provide value? 2. How can we utilize IBM cloud in a better way? 3. Why ovb jobs are failing? 4. Can we reduce ovb testing in check? 5. results/pointers to previous efforts (we investigated memory, and tempest split?) 6. Can we make our promotion smarter - pass if corresponding internal fs passes 7. Are we triggering ovb-check job on correct file changes? 8. Next gen resources make the situation worst? 9. Using elastic-recheck to report back on patches for rdo-check job failures? 10. Partner from DF in debug? 11. Doug's work on all nodes on one compute 12. Are we running OVB as the right test - hardware prov? 13. What are the actual errors? 14. [jm1] Most intermittent failures let us conclude that RDO/RHOS/TripleO is very sensitive to load or latency (cpu? network? disk?) of the underlying systems. We need help from DFGs to make RDO/RHOS/TripleO more robust. Customers will likely run in some of these issues as well, but they do not know that a simple rerun could fix it, since no customer-facing document states this.