Hi I'm trying to build an index for technical documents that basically works like "grep", i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie.
I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: <fieldType name="configtext_trigram" class="solr.TextField" > <analyzer type="index"> <tokenizer class="solr.PatternTokenizerFactory" pattern="\R" group="-1"/> <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="3"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> Then in the search, I use the edismax parser with mm=100%, so given the documents {"id":"test1","content":" encryption 10.0.100.22 description "} {"id":"test2","content":" 10.100.0.22 description "} and the query content:encryption, this will turn into "parsedquery_toString": "+((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8)", and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be "parsedquery_toString": "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query "content:.10 content:100 content:00. content:0.2 content:.22" with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian