### CCS-725
Problem: Suppose we have data like tp-link, t-shirts. The query is not showing result for concatenated terms like "tshirt" and "tplink".
Cause: The index analyzer separates these terms. They are indexed as "tp link" and "t shirt". Thats why elastic does not return any result for their concatenated terms ("tplink" , "tshirt").
Existing Analyzer:
"search_analyzer": {
"filter": [
"lowercase",
"es_synonyms",
"spanish_stopfilter",
"spanish_keywords",
"ascii_folding"
],
"char_filter": [
"discardChars"
],
"type": "custom",
"tokenizer": "standard"
}
First of all "char_filter" is applied which will simply remove underscores.
means "first_Second" will be mapped to "firstSecond".
Now, tokenizer("standard") splits the input stream into tokens. The terms like "tp-link" will be separated to "tp" and "link".
On recieved tokens, filters are applied in the specified sequence.
Proposed Analyzer:
"new_search_analyzer": {
"filter": [
"concatenate_words_with_delimiter",
"lowercase",
"es_synonyms",
"spanish_stopfilter",
"spanish_keywords",
"ascii_folding",
"remove_duplicates",
],
"type": "custom",
"tokenizer": "whitespace"
}
changed tokenizer to "whitespace" and added a new filters which will handle our cases.
"concatenate_words_with_delimiter": {
"type": "word_delimiter",
"split_on_case_change": false,
"split_on_numerics": false,
"stem_english_possessive": true,
"catenate_words": true,
"preserve_original": true
}
This filter will work as
"tp-link multi_antena" ==> {"tp", "link", "tplink", "tp-link", "multi", "antena", "multiantena", "multi_antena"}
means it will cover all possible combinations.
It will work similar for all delimiters. It will keep original tokens as well. Then apply remaining filters as usual.