Hi, I have a performance and scoring problem for phrase queries 1. Performance - phrase queries involving frequent terms are very slow due to the reading of large positions posting list. 2. Scoring - I want to control the boost of phrase and entity (in gazetteers) matches
Indexing all terms as bi-grams and unigrams is out of question in my use case, so I plan indexing only the useful bi-grams. Part of it will be achieved by the CommonGram filter in which I put the frequent words, but I think of going one step further and indexing also every phrase query I have extracted from my query log and entity from my gazetteers To the latter (which are N-grams) I will also add a payload to control the boost. An example MappingCharFilter.txt would be: #phrase-query term1 term2 term3 => term1_term2_term3|1 #entity firstName lastName => firstName_lastName|2 One of the issues is that I have 100k-1M (depending on frequency) phrases/entities as above. I saw that MappingCharFilter is implemented as an FST, still I'm concerned that iterating on the charBuffer for long documents might cause problems. Has anyone faced a similar issue? Is this mapping implementation resonable during query time performance wise? Thanks in advance, Manuel