changed 6 years ago
Published Linked with GitHub

CI Job Diff

Problem

In TripleO CI, we have multiple CI jobs runned using featureset files, defined in
tripleo-quickstart/config/general_config.

Once the CI job finishes, we collect lots of data from the environment using
ansible-role-collect-logs.

The CI job runs on:

  • Periodic, check and gate pipeline in rdo-cloud as third party jobs
    • on CentOS 7 and RHEL 8
  • Upstream check and gate jobs
    • on CentOS

On Multiple times the same job started failing with various reasons:

  • Time out
    • Reasons:
      • Time taken in pulling containers
      • tasks taking maximum time during execution
  • undercloud/standalone/overcloud deploy failures
    • Reasons:
      • Difference in RPM/pip packages/ containers used
        • common example: podman & ceph-ansible & Ansible
        • use of older version of package
        • container tags difference coming from other places
      • Environment vars
        • mix of python2 and python3
        • mix of pip and rpm
      • Deployment configuration difference
  • Variance in passing/skipping of tempest tests
  • Varience in config used in downstream and upstream

Since the deployment is always complex, so it is very hard to find out what is the actual reason of failures.
In day to day debugging in TripleO CI, we compares the pass and failed job of same featureset manually.
We compare:
* the yum repo files, rpm packages version
* Containers used and from where they are coming from
* Ansible run time taken
* Comparing the failed log file and finding out what went wrong.

Solution

The aim here is to compare fs01 or standalone passed and failed jobs and extract meaningful info which makes debugging easier.

So the comparison consists of:

  • RHEL vs RHEL standalone/FS01 passed & failed job
  • RHEL vs centos standalone/FS01 passed & failed job

List of things needs to be compared:

  • rpm
  • rpms installed with in containers
  • tempest results
  • systemd services status
  • pip results
  • Ara output measuring time
    • might be useful for finding timeout

Using logreduce

logreduce can help in comparing log files between passed and failed jobs and find the error easily.
But there are other things which cannot be achieved using that.

Work proposal

Initial Goal: Having a MVP which will compare passed and failed FS01 job

Example implementation as a zuul-jobs:

# zuul.yaml
---
- job: compare-rhel-centos
  dependencies:
    - fs01-centos
    - fs01-rhel
  run: compare.yaml
  
- project:
  check:
    jobs:
      - fs01-centos
      - fs01-rhel
      - compare-rhel-centos

# compare.yaml
---
- tasks: fetch result
- tasks: generate report

See this phoronix-merge-result implemented in: https://review.opendev.org/#/c/679082/
For the child job to fetch result from parent job, the parent job needs to indicate their log using zuul_return artifacts:

Task breakdown

  • Get the passed and failed fs01 job
  • Download the required files from logs
    • Extend this tool to parse the task where it got failed then navigate to required file and
      use the logreduce to compare the log file and show exact error or issue.
  • Pass the script to compare rpm
    • The script will print rpm packages with same version and different version
    • Extend the rpm version compare script to find what reviews got merged between two versions.
  • For tempest results:
    • Check the list of tempest tests passed and failed
  • Including container-diff tool as a part of collect-logs tool?

Questions

Once we have scripts available how it can be get consumed.

  • As a service
  • Running manually on demand
  • Consumed as a part of zuul job

Proposer

  • Ronelle Landy

Consumer

  • TripleO CI team

Available Tools

Notes from meeting [11/09/2019]

Select a repo