Elastic Recheck - OpenStack Health for TripleO

tags: Design, Elastic

Background

Elastic Recheck Documentation: https://docs.openstack.org/infra/elastic-recheck/

Elastic Recheck Dashboard: http://status.openstack.org/elastic-recheck/

Elastic Recheck is available upstream but it has been very difficult to get querries merged into the upstream database. Sorin Sbarnea is focused on revitalizing the project and getting core rights.

Goal: Make elastic recheck provide meaningful feedback for tripleo across all of our use case. To provide recheck services across all of the following use cases the first step will be to containerize the service to make it more portable and maintanable.

Architecture

ER is made out of several small services:

  • dashboard: apache httpd serving a static pages and also a a json file with the graphs.

  • watch-bot: notifies irc-channels about various matches found (not a priority for us, yet)

  • crons

    • elastic-recheck-all @30min - calls elastic-recheck-graph and produces json file
    • elastic-recheck-gate @30min - same
    • elastic-recheck-uncat @30min - calls elastic-recheck-uncategorized command, produces some html
  • watchbot: Pools gerrit using stream event for changes in order to detect zuul comments about finished builds.

Elastic recheck is organized as a python package that installs several CLI tools:

  • elastic-recheck that is the irc bot!
  • elastic-recheck-graph - builds the json file used to display the graphs
  • elastic-recheck-uncategorized - similar to graph
  • elastic-recheck-query - cli to perform a query manually
  • elastic-recheck-cleanup
  • elastic-recheck-success

At this moment the deployment part is still done using pupper-elastic-recheck repository but we are working to replace it with ansible/container/compose. My goal is to make it as easy to start a private instance locally with just "make run".



graph LR
  gr -- load-config --> ./queries/*.yaml
  gr -- http --> logstash.opensack.org
  gr -- write --> output.json

  gr(elastic-recheck-graph)


Containerization

Use Case #1; Upstream

  • Create a repo of queries specific to TripleO, allow recheck to comment on upstream gerrit reviews w/ known errors
  • Ensure dashboard works

Use Case #2, RDO Software Factory

RDO Softwarefactory stores job logs in kibana

  • Using the same query repo in uc#1 ensure elastic recheck can read kibana data and post to RDO software factory gerrit changes for TripleO

Use Case #3, Internal Software Factory

  • Using a synced query database from #1,#2
  • Work w/ internal infra to ensure log data is available
  • Provide feedback on internal gerrit jobs and provide dashboards

Use Case #4, Internal Jenkins

  • Using the query repo from #3, provide feedback on gerrithub patches

Use Case #5, Queries from multiple sources

ATM queries have a 1-1 connection with bugs in launchpad. Expand that to jira and bugzilla

  • Jira queries
  • Bugzilla queries

Summary

Elastic recheck is a valuable service that can inform developers of known issues that have caused their OpenStack builds and jobs to fail automatically via a comment on their gerrit review or via a dashboard.

This is useful at any level and in any of the above use cases.

If all five use cases can be covered we have a single location to express known issues / bugs and have developers informed automatically regarding why deployments / jobs are failing.

Demo #1

  • picking up elastic-recheck as a project due to lack of community atm see we value in this tool not dying.

  • emphasize the delination between the upstream running instance and query repo.. hands off..

  • Goals:

Upstream Usecase steps:

  • [] Note to team: In no way are we using the production deployment at http://status.openstack.org/openstack-health/#/
    • [] we will have our own deployment oh, using the same code as upstream production, w/ our own query db.
  • [] openstack ( tripleo owned ) repo for queries
  • [] run openstack-health container in rdo infra
  • [] consolidate sova / openstack health query format
    • transform queries to json and health compatible format
  • [] new query repo needs to have a single format ( I think )
  • [] test test test
    • [] define the criteria to turn on elastic recheck for tripleo
    • [] inspect

Upstream infra proposing to drop elastic-search and health

https://docs.google.com/document/d/1s5v43HNwRy8X9CFeLhAXaOpakiK_XpJTaD_WB18KwTs/edit

  1. Sorin and Daniel sync on common reply to opendev, open gdoc for a draft, have Wes&Alan review before sending
  2. Alan to explore Vexxhost private cloud pricing based on current opendev infra allocation
  3. Sorin to explore Logging-aaS pricing for Datadog, and Daniel for logz.io

Query params required.

To help ensure our results and hits are meaningful

  • build_name

    • e.g. build_name:"tripleo-*"
    • e.g. AND NOT build_name:"openstack-*"
  • build_status ( SUCCESS / FAILURE )

    • e.g. build_status:FAILED
      • help reduce noise.
message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND build_name:tripleo-*


message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND NOT build_name:neutron*
"must_not": [
                    {
                      "fquery": {
                        "query": {
                          "query_string": {
                            "query": "build_name:neutron*"
                          }
                        },
                        "_cache": true

How to use TripleO-Health to track tempest

It's important to know how often certain tempest
tests are failing. There are a couple of ways to
do this, and I need to sync w/ Arx on this topic.

Add tempest run log to logstash

Select a repo