owned this note changed 3 years ago
Published Linked with GitHub

OVB Status: Increasing robustness of our OVB jobs

tags: ruck_rover

Current situation with OVB jobs

Dozens of our OVB jobs fail on intermittent failures each day! Rerunning and debugging jobs dozens of times a day is exhausting our rucks and rovers.

To get an idea how bad the situation really is, refer to these rr notes:

For example, it took sometimes up to 20 reruns of the same job to get a promotion of c9 master or c9 wallaby. Typical culprit of that period were c9 master/wallaby fs35 and fs64.

Ideas for improvements

  1. Do OVB provide value?
  2. How can we utilize IBM cloud in a better way?
  3. Why ovb jobs are failing?
  4. Can we reduce ovb testing in check?
  5. results/pointers to previous efforts (we investigated memory, and tempest split?)
  6. Can we make our promotion smarter - pass if corresponding internal fs passes
  7. Are we triggering ovb-check job on correct file changes?
  8. Next gen resources make the situation worst?
  9. Using elastic-recheck to report back on patches for rdo-check job failures?
  10. Partner from DF in debug?
  11. Doug's work on all nodes on one compute
  12. Are we running OVB as the right test - hardware prov?
  13. What are the actual errors?
  14. [jm1] Most intermittent failures let us conclude that RDO/RHOS/TripleO is very sensitive to load or latency (cpu? network? disk?) of the underlying systems. We need help from DFGs to make RDO/RHOS/TripleO more robust. Customers will likely run in some of these issues as well, but they do not know that a simple rerun could fix it, since no customer-facing document states this.
Select a repo