Elastic Recheck - OpenStack Health for TripleO

# Elastic Recheck - OpenStack Health for TripleO ###### tags: `Design`, `Elastic` ## Background Elastic Recheck Documentation: https://docs.openstack.org/infra/elastic-recheck/ Elastic Recheck Dashboard: http://status.openstack.org/elastic-recheck/ Elastic Recheck is available [upstream](http://status.openstack.org/elastic-recheck) but it has been very difficult to get [querries merged](https://review.opendev.org/#/q/project:opendev/elastic-recheck) into the upstream database. Sorin Sbarnea is focused on revitalizing the project and getting core rights. Goal: Make elastic recheck provide meaningful feedback for tripleo across all of our use case. To provide recheck services across all of the following use cases the first step will be to containerize the service to make it more portable and maintanable. ## Architecture ER is made out of several small services: * **dashboard**: apache httpd serving a static pages and also a a json file with the graphs. * **watch-bot**: notifies irc-channels about various matches found (not a priority for us, yet) * **crons** * `elastic-recheck-all` @30min - calls elastic-recheck-graph and produces json file * `elastic-recheck-gate` @30min - same * `elastic-recheck-uncat` @30min - calls elastic-recheck-uncategorized command, produces some html * **watchbot**: Pools gerrit using stream event for changes in order to detect zuul comments about finished builds. Elastic recheck is organized as a python package that installs several CLI tools: * elastic-recheck -- that is the irc bot! * elastic-recheck-graph - builds the json file used to display the graphs * elastic-recheck-uncategorized - similar to graph * elastic-recheck-query - cli to perform a query manually * elastic-recheck-cleanup * elastic-recheck-success At this moment the deployment part is still done using [pupper-elastic-recheck](https://opendev.org/opendev/puppet-elastic_recheck) repository but we are working to replace it with ansible/container/compose. My goal is to make it as easy to start a private instance locally with just "make run". ```mermaid graph LR gr -- load-config --> ./queries/*.yaml gr -- http --> logstash.opensack.org gr -- write --> output.json gr(elastic-recheck-graph) ``` ## Containerization - [ ] https://review.opendev.org/#/c/729623/ - [ ] Host the service in a public Red Hat Openshift Platform ## Use Case #1; Upstream - [ ] Create a repo of queries specific to TripleO, allow recheck to comment on upstream gerrit reviews w/ known errors - [ ] Ensure dashboard works ## Use Case #2, RDO Software Factory RDO Softwarefactory stores job logs in kibana - [ ] Using the same query repo in uc#1 ensure elastic recheck can read kibana data and post to RDO software factory gerrit changes for TripleO ## Use Case #3, Internal Software Factory - [ ] Using a synced query database from #1,#2 - [ ] Work w/ internal infra to ensure log data is available - [ ] Provide feedback on internal gerrit jobs and provide dashboards ## Use Case #4, Internal Jenkins - [ ] Using the query repo from #3, provide feedback on gerrithub patches ## Use Case #5, Queries from multiple sources ATM queries have a 1-1 connection with bugs in launchpad. Expand that to jira and bugzilla - [ ] Jira queries - [ ] Bugzilla queries ## Summary Elastic recheck is a valuable service that can inform developers of known issues that have caused their OpenStack builds and jobs to fail automatically via a comment on their gerrit review or via a dashboard. This is useful at any level and in any of the above use cases. If all five use cases can be covered we have a single location to express known issues / bugs and have developers informed automatically regarding why deployments / jobs are failing. # Demo #1 * picking up elastic-recheck as a project due to lack of community atm... see we value in this tool not dying. * emphasize the delination between the upstream running instance and query repo.. hands off.. * Goals: * a tripleo specific instance * a tripleo specific query repo / config * sova / elastic recheck work together from the same query repo. * sova is a good local to the instance tool * http://dashboard-ci.tripleo.org/d/wb8HBhrWk/cockpit?orgId=1&fullscreen&panelId=61 * elastic-recheck is a very nice service * reports to gerrit * dashboard ## Upstream Usecase steps: - [] Note to team: In no way are we using the production deployment at http://status.openstack.org/openstack-health/#/ - [] we will have our own deployment oh, using the same code as upstream production, w/ our own query db. - [] openstack ( tripleo owned ) repo for queries - [] run openstack-health container in rdo infra - [] consolidate sova / openstack health query format - transform queries to json and health compatible format - [] new query repo needs to have a single format ( I think ) - [] test test test - [] define the criteria to turn on elastic recheck for tripleo - [] inspect ## Upstream infra proposing to drop elastic-search and health https://docs.google.com/document/d/1s5v43HNwRy8X9CFeLhAXaOpakiK_XpJTaD_WB18KwTs/edit 1) Sorin and Daniel sync on common reply to opendev, open gdoc for a draft, have Wes&Alan review before sending 2) Alan to explore Vexxhost private cloud pricing based on current opendev infra allocation 3) Sorin to explore Logging-aaS pricing for Datadog, and Daniel for logz.io ## Query params required. To help ensure our results and hits are meaningful * build_name * e.g. build_name:"tripleo-*" * e.g. AND NOT build_name:"openstack-*" * build_status ( SUCCESS / FAILURE ) * e.g. build_status:FAILED * help reduce noise. ``` message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND build_name:tripleo-* message:"ReadTimeoutError: HTTPConnectionPool" AND (tags:"console") AND voting:1 AND build_status:SUCCESS AND NOT build_name:neutron* ``` ``` "must_not": [ { "fquery": { "query": { "query_string": { "query": "build_name:neutron*" } }, "_cache": true ``` ## How to use TripleO-Health to track tempest It's important to know how often certain tempest tests are failing. There are a couple of ways to do this, and I need to sync w/ Arx on this topic. Add tempest run log to logstash * https://review.opendev.org/c/openstack/ansible-role-collect-logs/+/794664 * the openstack-health way.. * http://status.openstack.org/openstack-health/#/ * We could also build queries w/ * build_status:failed * and match the text of the tempest test name.