Elastic Recheck - OpenStack Health for TripleO

tags: `Design`, `Elastic`

Background

Elastic Recheck Documentation: https://docs.openstack.org/infra/elastic-recheck/

Elastic Recheck Dashboard: http://status.openstack.org/elastic-recheck/

Elastic Recheck is available upstream but it has been very difficult to get querries merged into the upstream database. Sorin Sbarnea is focused on revitalizing the project and getting core rights.

Goal: Make elastic recheck provide meaningful feedback for tripleo across all of our use case. To provide recheck services across all of the following use cases the first step will be to containerize the service to make it more portable and maintanable.

Architecture

ER is made out of several small services:

dashboard: apache httpd serving a static pages and also a a json file with the graphs.
watch-bot: notifies irc-channels about various matches found (not a priority for us, yet)
crons
- elastic-recheck-all @30min - calls elastic-recheck-graph and produces json file
- elastic-recheck-gate @30min - same
- elastic-recheck-uncat @30min - calls elastic-recheck-uncategorized command, produces some html
watchbot: Pools gerrit using stream event for changes in order to detect zuul comments about finished builds.

Elastic recheck is organized as a python package that installs several CLI tools:

elastic-recheck – that is the irc bot!
elastic-recheck-graph - builds the json file used to display the graphs
elastic-recheck-uncategorized - similar to graph
elastic-recheck-query - cli to perform a query manually
elastic-recheck-cleanup
elastic-recheck-success

At this moment the deployment part is still done using pupper-elastic-recheck repository but we are working to replace it with ansible/container/compose. My goal is to make it as easy to start a private instance locally with just "make run".



graph LR
  gr -- load-config --> ./queries/*.yaml
  gr -- http --> logstash.opensack.org
  gr -- write --> output.json

  gr(elastic-recheck-graph)

Containerization

https://review.opendev.org/#/c/729623/
Host the service in a public Red Hat Openshift Platform

Use Case #1; Upstream

Create a repo of queries specific to TripleO, allow recheck to comment on upstream gerrit reviews w/ known errors
Ensure dashboard works

Use Case #2, RDO Software Factory

RDO Softwarefactory stores job logs in kibana

Using the same query repo in uc#1 ensure elastic recheck can read kibana data and post to RDO software factory gerrit changes for TripleO

Use Case #3, Internal Software Factory

Using a synced query database from #1,#2
Work w/ internal infra to ensure log data is available
Provide feedback on internal gerrit jobs and provide dashboards

Use Case #4, Internal Jenkins

Using the query repo from #3, provide feedback on gerrithub patches

Use Case #5, Queries from multiple sources

ATM queries have a 1-1 connection with bugs in launchpad. Expand that to jira and bugzilla

Jira queries
Bugzilla queries

Summary

Elastic recheck is a valuable service that can inform developers of known issues that have caused their OpenStack builds and jobs to fail automatically via a comment on their gerrit review or via a dashboard.

This is useful at any level and in any of the above use cases.

If all five use cases can be covered we have a single location to express known issues / bugs and have developers informed automatically regarding why deployments / jobs are failing.

Demo #1

picking up elastic-recheck as a project due to lack of community atm… see we value in this tool not dying.
emphasize the delination between the upstream running instance and query repo.. hands off..
Goals:
- a tripleo specific instance
- a tripleo specific query repo / config
- sova / elastic recheck work together from the same query repo.
  - sova is a good local to the instance tool
    - http://dashboard-ci.tripleo.org/d/wb8HBhrWk/cockpit?orgId=1&fullscreen&panelId=61
  - elastic-recheck is a very nice service
    - reports to gerrit
    - dashboard

Upstream Usecase steps:

[] Note to team: In no way are we using the production deployment at http://status.openstack.org/openstack-health/#/
- [] we will have our own deployment oh, using the same code as upstream production, w/ our own query db.
[] openstack ( tripleo owned ) repo for queries
[] run openstack-health container in rdo infra
[] consolidate sova / openstack health query format
- transform queries to json and health compatible format
[] new query repo needs to have a single format ( I think )
[] test test test
- [] define the criteria to turn on elastic recheck for tripleo
- [] inspect

Upstream infra proposing to drop elastic-search and health

https://docs.google.com/document/d/1s5v43HNwRy8X9CFeLhAXaOpakiK_XpJTaD_WB18KwTs/edit

Sorin and Daniel sync on common reply to opendev, open gdoc for a draft, have Wes&Alan review before sending
Alan to explore Vexxhost private cloud pricing based on current opendev infra allocation
Sorin to explore Logging-aaS pricing for Datadog, and Daniel for logz.io

Query params required.

To help ensure our results and hits are meaningful

build_name
- e.g. build_name:"tripleo-*"
- e.g. AND NOT build_name:"openstack-*"
build_status ( SUCCESS / FAILURE )
- e.g. build_status:FAILED
  - help reduce noise.

message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND build_name:tripleo-*


message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND NOT build_name:neutron*

"must_not": [
                    {
                      "fquery": {
                        "query": {
                          "query_string": {
                            "query": "build_name:neutron*"
                          }
                        },
                        "_cache": true

How to use TripleO-Health to track tempest

It's important to know how often certain tempest
tests are failing. There are a couple of ways to
do this, and I need to sync w/ Arx on this topic.

Add tempest run log to logstash

https://review.opendev.org/c/openstack/ansible-role-collect-logs/+/794664
the openstack-health way..
- http://status.openstack.org/openstack-health/#/
We could also build queries w/
- build_status:failed
- and match the text of the tempest test name.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.