Hmmm, this sounds like a nonsensical question, but "what do you mean by arbitrary substring"?
Because if your substrings consist of whole _tokens_, then ngramming is totally unnecessary (and gets in the way). Phrase queries with no slop fulfill this requirement. But let's assume you need to march within tokens, i.e. if the doc contains "my dog has fleas", you need to match input like "as fle", in this case ngramming is an option. You have substantially different index and query time chains. The result is that the offsets for all the grams at index time are the same in the quick experiment I tried, all were 1. But at query time, each gram had an incremented position. I'd start by using the query time analysis chain for indexing also. Next, I'd try enclosing multiple words in double quotes at query time and go from there. What you have now is an anti-pattern in that having substantially different index and query time analysis chains is not something that's likely to be very predictable unless you know _exactly_ what the consequences are. The admin/analysis page is your friend, in this case check the "verbose" checkbox to see what I mean. Best, Erick On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote: > Hi > > I'm trying to build an index for technical documents that basically > works like "grep", i.e. the user gives an arbitray substring somewhere > in a line of a document and the exact matches will be returned. I > specifically want no stemming etc. and keep all whitespace, parentheses > etc. because they might be significant. The only normalization is that > the search should be case-insensitvie. > > I tried to achieve this by tokenizing on line breaks, and then building > trigrams of the individual lines: > > <fieldType name="configtext_trigram" class="solr.TextField" > > > <analyzer type="index"> > > <tokenizer class="solr.PatternTokenizerFactory" > pattern="\R" group="-1"/> > > <filter class="solr.NGramFilterFactory" > minGramSize="3" maxGramSize="3"/> > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.NGramTokenizerFactory" > minGramSize="3" maxGramSize="3"/> > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > </fieldType> > > Then in the search, I use the edismax parser with mm=100%, so given the > documents > > > {"id":"test1","content":" > encryption > 10.0.100.22 > description > "} > > {"id":"test2","content":" > 10.100.0.22 > description > "} > > and the query content:encryption, this will turn into > > "parsedquery_toString": > > "+((content:enc content:ncr content:cry content:ryp > content:ypt content:pti content:tio content:ion)~8)", > > and return only the first document. All fine and dandy. But I have a > problem with possible false positives. If the search is e.g. > > content:.100.22 > > then the generated query will be > > "parsedquery_toString": > "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", > > and because all of tokens are also generated for document test2 in the > proximity of 5, both documents will wrongly be returned. > > So somehow I'd need to express the query "content:.10 content:100 > content:00. content:0.2 content:.22" with *the tokens exactly in this > order and nothing in between*. Is this somehow possible, maybe by using > the termvectors/termpositions stuff? Or am I trying to do something > that's fundamentally impossible? Other good ideas how to achieve this > kind of behaviour? > > Thanks > Christian > > >