CI Job Diff

Problem

In TripleO CI, we have multiple CI jobs runned using featureset files, defined in
tripleo-quickstart/config/general_config.

Once the CI job finishes, we collect lots of data from the environment using
ansible-role-collect-logs.

The CI job runs on:

Periodic, check and gate pipeline in rdo-cloud as third party jobs
- on CentOS 7 and RHEL 8
Upstream check and gate jobs
- on CentOS

On Multiple times the same job started failing with various reasons:

Time out
- Reasons:
  - Time taken in pulling containers
  - tasks taking maximum time during execution
undercloud/standalone/overcloud deploy failures
- Reasons:
  - Difference in RPM/pip packages/ containers used
    - common example: podman & ceph-ansible & Ansible
    - use of older version of package
    - container tags difference coming from other places
  - Environment vars
    - mix of python2 and python3
    - mix of pip and rpm
  - Deployment configuration difference
Variance in passing/skipping of tempest tests
Varience in config used in downstream and upstream

Since the deployment is always complex, so it is very hard to find out what is the actual reason of failures.
In day to day debugging in TripleO CI, we compares the pass and failed job of same featureset manually.
We compare:
* the yum repo files, rpm packages version
* Containers used and from where they are coming from
* Ansible run time taken
* Comparing the failed log file and finding out what went wrong.

Solution

The aim here is to compare fs01 or standalone passed and failed jobs and extract meaningful info which makes debugging easier.

So the comparison consists of:

RHEL vs RHEL standalone/FS01 passed & failed job
RHEL vs centos standalone/FS01 passed & failed job

List of things needs to be compared:

rpm
rpms installed with in containers
tempest results
systemd services status
pip results
Ara output measuring time
- might be useful for finding timeout

Using logreduce

logreduce can help in comparing log files between passed and failed jobs and find the error easily.
But there are other things which cannot be achieved using that.

Work proposal

Initial Goal: Having a MVP which will compare passed and failed FS01 job

Example implementation as a zuul-jobs:

# zuul.yaml
---
- job: compare-rhel-centos
  dependencies:
    - fs01-centos
    - fs01-rhel
  run: compare.yaml
  
- project:
  check:
    jobs:
      - fs01-centos
      - fs01-rhel
      - compare-rhel-centos

# compare.yaml
---
- tasks: fetch result
- tasks: generate report

See this phoronix-merge-result implemented in: https://review.opendev.org/#/c/679082/
For the child job to fetch result from parent job, the parent job needs to indicate their log using zuul_return artifacts:

return artifact example: https://review.opendev.org/#/c/679082/5/roles/fetch-phoronix-results/tasks/main.yaml@21
fetch parent artifact example: https://review.opendev.org/#/c/679082/5/roles/phoronix-combine-results/tasks/main.yaml@8

Task breakdown

Get the passed and failed fs01 job
Download the required files from logs
- Extend this tool to parse the task where it got failed then navigate to required file and
  use the logreduce to compare the log file and show exact error or issue.
Pass the script to compare rpm
- The script will print rpm packages with same version and different version
- Extend the rpm version compare script to find what reviews got merged between two versions.
For tempest results:
- Check the list of tempest tests passed and failed
Including container-diff tool as a part of collect-logs tool?

Questions

Once we have scripts available how it can be get consumed.

As a service
Running manually on demand
Consumed as a part of zuul job

Proposer

Ronelle Landy

Consumer

TripleO CI team

Available Tools

https://github.com/GoogleContainerTools/container-diff
- written in Golang
https://github.com/omron93/containerdiff
- Quite old no longer maintained
https://github.com/sshnaidm/jcomparison
- Job times compare tool

Notes from meeting [11/09/2019]

rpm varies based on deployment
* in downstream, we have manifest files what wents in terms of rpm, containers
* in checks, highight the package built in from depends-on should be noted
* gathering infomation in a format
```
      * comparing would be second step
```
- Make it as a Rest service
- use container-diff to collect rpms
  - http://paste.openstack.org/show/775119/
  - https://63135da7b256b2e67b7c-35f4a35ba7daf66e9666e7e697074949.ssl.cf2.rackcdn.com/683530/1/check/tripleo-ci-centos-7-containers-multinode/cc2accd/logs/undercloud/var/log/extra/podman/containers/keystone/podman_info.log.txt.gz
- Checking difference between promotion and check job
- Diff between recheck
- comparsion between previous and current tripleo promotion hash per branch
  - non tripleo git changes
  - starting with
  - container-diff
- rpm diff
  * list of patches merged into that
  * rdo-infra -> start from scratch
  * In future think about other distros
- Merge sova into grafana
  - pull database
- time comparison, rpm diff and logreduce link

wes hayutin

2019/09/03 18:06:41

while it will be very nice to be able to diff centos to rhel jobs. That is not the primary purpose IMHO and the diff would be very large. The primary purpose is to diff a failing fs001 to a passing fs001 or a failing standalone scen002 to a passing one. (Edited)

Chandan Kumar

2019/09/05 07:22:55

I have updated the proposal, we can discuss it on how to proceed further and then we can pass it to the team for review

sshnaidm

2019/09/06 13:22:23

Well, I have a project that does it - time comparison between the jobs. Maybe I didn't advertise it enough: https://github.com/sshnaidm/jcomparison we can build this tool over it (Edited)

Sorin Sbarnea

2019/09/09 15:52:54

This is what is called a *build manifest*, something that we talked downstream about. One file that defines what went inside a job with very clear formatting in order to make a diff meaningful (likely yaml). (Edited)

Martin Kopec

2019/09/09 21:10:16

logreduce would be a good core of the tool used for comparing files but still a wrapper around it will need to be written - the wrapper would download the files and do some prep (Edited)

2019/09/09 22:41:22

Can you clarify please - should we run this on every fs001 run? If no - how do we choose jobs? In your example we just compare same rhel and centos, but it's less interesting. How do we compare failed and passed fs001 on same platform? (Edited)

2019/09/09 22:43:41

Diff of RHEL and Centos jobs seems for me not really important. I don't think we have problems with running centos vs rhel. (Edited)

2019/09/10 05:33:53

Having a diff between two platforms will give a clear picture on what is available in RHEL and centos deployment and how the versions are varying.

2019/09/10 08:41:48

Why would we need this? In all our work with rhel/centos I never compared them. I don't say it's not needed, but its importance doesn't seem critical.

Tristan de Cacqueray

2019/09/10 00:50:46

oh i see, in that the case, the compare-rhel-centos job could fetch the build result from the zuul api. my point is that if the logic is encoded in a zuul jobs, it's easier to trigger and get the result as part of the buildset. (Edited)

Yatin Karel

2019/09/10 06:11:53

Browsing the logs is time consuming, Would be good if tool itself returns at which stage(environment prep, container prep, undercloud deploy, overcloud deploy etc) the job failed and then at which stage(heat stack, plan create, ansible task) undercloud/overcloud deploy failed. And also returns which log file(s) to check further based on the collected data and download specific files as downloading all the log files is not needed sometimes. Some of the work is already done by tooling reporting to (Edited)

2019/09/11 11:16:24

Added in the task break down itself.

2019/09/11 07:19:57

Output time should NOT be part of the build manifest because is entropy factor and not reproducable. If a build runs twice it should produce the same manifest (zero diff). Anything that is not repoducable/random should go in other files. Another example are the logreduce results. All these are important but we should be careful how we split them. (Edited)

rlandy

2019/08/30 13:17:06

Happy to see this project on the list of proposals. (Edited)

2019/08/30 13:18:18

We should be careful/explicit about distribution versions. RHEL7 vs RHEL8 vs CentOS7. (Edited)

2019/08/30 13:19:04

We should be aware of any testing differences based on ansible_distribution_major_version (Edited)

2019/08/30 13:20:44

OSP vs. RDO differences (not just RDO on RHEL vs RDO on CentOS) may also make for an interesting comparison. (Edited)

2019/09/06 13:25:02

Required logs should be downloaded by script, there are a lot of files. (Edited)

2019/09/06 13:25:41

Both as service and manual, as I did in my script :) (Edited)

Javier Peña

2019/09/09 10:24:11

How would this tool compare to logreduce? I think it shares similar goals, although logreduce is primarily aimed at comparing the logs. (Edited)

2019/09/09 11:59:19

adding to the list (Edited)

2019/09/09 15:17:58

This can be implemented using a child job: (Edited)

2019/09/09 21:08:54

I believe logreduce can be used here, there is no reason to build something new from scratch, it's always better to reuse existing solutions (Edited)

2019/09/09 21:11:39

it's used similar way downstream, psedlak showed me some results and it looked pretty well (Edited)

2019/09/09 22:38:01

Logreduce can be a part of this tool (Edited)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.