Very Search Log Analysis === ## Preparation 1. Download one day adept-search log (2019-07-09) as an example for the investigation. The original log files can be found [here](https://s3.console.aws.amazon.com/s3/buckets/adept-very/adept-search/2019/07/09/?region=us-west-2&tab=overview). 2. Build an elasticsearch index as it is on the box: ``` http://192.168.0.215:9200/adept-search-2019-07-09/_search ``` ## Checking the Number of Events Run the elasticsearch ``` GET adept-search-2019-07-09/_search { "aggs": { "events": { "terms": { "field": "event.keyword", "size": 100 } } }, "size": 0 } ``` get the number of events as follows ``` "aggregations": { "events": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "QUERY_EXECUTION_RESPONSE_TIME", "doc_count": 366115 }, { "key": "ADEPT_SEARCH_SEARCH", "doc_count": 362279 }, { "key": "ADEPT_SEARCH_CACHED_MATCH", "doc_count": 358457 }, { "key": "QUERY_UNDERSTANDING_DEBUG", "doc_count": 3827 }, { "key": "ADEPT_SEARCH_CACHED_MISS", "doc_count": 3459 }, { "key": "QUERY_UNDERSTANDING_RESPONSE_TIME", "doc_count": 3321 }, { "key": "ADEPT_SEARCH_CACHED_UPDATE_SUCCEEDED", "doc_count": 3320 }, { "key": "ADEPT_SEARCH_ES_SUGGEST_AUTOCORRECT", "doc_count": 3134 }, { "key": "QUERY_UNDERSTANDING_EXCEPTION", "doc_count": 584 }, { "key": "ADEPT_SEARCH_CACHED_NETWORKING", "doc_count": 359 }, { "key": "ADEPT_SEARCH_ES_SUGGEST_AUTOCORRECT_SKIPPED", "doc_count": 128 }, { "key": "ADEPT_SEARCH_CACHED_EXPIRED", "doc_count": 9 }, { "key": "ADEPT_SEARCH_AUTOCOMPLETE", "doc_count": 1 } ] } } ``` A few questions are raised based on the number of events - cache look up - `ADEPT_SEARCH_CACHED_MATCH (358457) + ADEPT_SEARCH_CACHED_MISS (3459) + ADEPT_SEARCH_CACHED_NETWORKING (359) + ADEPT_SEARCH_CACHED_EXPIRED (9) = 362284 > ADEPT_SEARCH_SEARCH (362279)` - cache update - `ADEPT_SEARCH_CACHED_UPDATE_SUCCEEDED (3320) < ADEPT_SEARCH_CACHED_MISS (3459)` - `QUERY_EXECUTION_RESPONSE_TIME (366115) > ADEPT_SEARCH_SEARCH (362279)` - qu - `QUERY_UNDERSTANDING_DEBUG (3827) > QUERY_UNDERSANDING_RESPONSE_TIME (3321) + QUERY_UNDERSTANDING_EXCEPTION (359)` - `QUERY_UNDERSTANDING_DEBUG (3827) == ADEPT_SEARCH_CACHED_MISS (3459) + ADEPT_SEARCH_CACHED_NETWORKING (359) + ADEPT_SEARCH_CACHED_EXPIRED (9)`: this is a good one! ## QU Exceptions Look at QU exceptions ``` GET adept-search-2019-07-09/_search { "aggs": { "errors": { "terms": { "field": "properties.error.keyword", "size": 100 } } }, "query": { "term": { "event.keyword": "QUERY_UNDERSTANDING_EXCEPTION" } }, "size": 0 } ``` response (partial) ``` "aggregations": { "errors": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "HTTPConnectionPool(host='shopreader.machine-learning', port=7081): Read timed out. (read timeout=0.3)", "doc_count": 503 }, { "key": "HTTPConnectionPool(host='localhost', port=7081): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1115eac50>: Failed to establish a new connection: [Errno 61] Connection refused',))", "doc_count": 1 } ] } } ``` It seems that most or the issues are connection timeout, and we should raise `QUERY_UNDERSTANDING_TIMEOUT` (we now have `QUERY_UNDERSTANDING_TIMEOUT` for timeouts and `QUERY_UNDERSTANDING_EXCEPTION` for any other exceptions) ## Other Issues - many events miss the `query` field - events should have a `search_id` field so all the events generated from one search can be easily aggregated for the analysis.