Re: Exact substring search with ngrams

Erick Erickson Tue, 25 Aug 2015 15:24:44 -0700

Hmmm, this sounds like a nonsensical question, but "what do you mean
by arbitrary substring"?


Because if your substrings consist of whole _tokens_, then ngramming
is totally unnecessary (and gets in the way). Phrase queries with no slop
fulfill this requirement.

But let's assume you need to march within tokens, i.e. if the doc
contains "my dog has fleas", you need to match input like "as fle", in this
case ngramming is an option.

You have substantially different index and query time chains. The result is that
the offsets for all the grams at index time are the same in the quick experiment
I tried, all were 1. But at query time, each gram had an incremented position.

I'd start by using the query time analysis chain for indexing also. Next, I'd
try enclosing multiple words in double quotes at query time and go from there.
What you have now is an anti-pattern in that having substantially
different index
and query time analysis chains is not something that's likely to be very
predictable unless you know _exactly_ what the consequences are.

The admin/analysis page is your friend, in this case check the
"verbose" checkbox
to see what I mean.

Best,
Erick

On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote:
> Hi
>
> I'm trying to build an index for technical documents that basically
> works like "grep", i.e. the user gives an arbitray substring somewhere
> in a line of a document and the exact matches will be returned. I
> specifically want no stemming etc. and keep all whitespace, parentheses
> etc. because they might be significant. The only normalization is that
> the search should be case-insensitvie.
>
> I tried to achieve this by tokenizing on line breaks, and then building
> trigrams of the individual lines:
>
> <fieldType name="configtext_trigram" class="solr.TextField" >
>
>     <analyzer type="index">
>
>         <tokenizer class="solr.PatternTokenizerFactory"
>             pattern="\R" group="-1"/>
>
>         <filter class="solr.NGramFilterFactory"
>             minGramSize="3" maxGramSize="3"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>     </analyzer>
>
>     <analyzer type="query">
>
>         <tokenizer class="solr.NGramTokenizerFactory"
>             minGramSize="3" maxGramSize="3"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>     </analyzer>
> </fieldType>
>
> Then in the search, I use the edismax parser with mm=100%, so given the
> documents
>
>
> {"id":"test1","content":"
> encryption
> 10.0.100.22
> description
> "}
>
> {"id":"test2","content":"
> 10.100.0.22
> description
> "}
>
> and the query content:encryption, this will turn into
>
> "parsedquery_toString":
>
> "+((content:enc content:ncr content:cry content:ryp
> content:ypt content:pti content:tio content:ion)~8)",
>
> and return only the first document. All fine and dandy. But I have a
> problem with possible false positives. If the search is e.g.
>
> content:.100.22
>
> then the generated query will be
>
> "parsedquery_toString":
> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>
> and because all of tokens are also generated for document test2 in the
> proximity of 5, both documents will wrongly be returned.
>
> So somehow I'd need to express the query "content:.10 content:100
> content:00. content:0.2 content:.22" with *the tokens exactly in this
> order and nothing in between*. Is this somehow possible, maybe by using
> the termvectors/termpositions stuff? Or am I trying to do something
> that's fundamentally impossible? Other good ideas how to achieve this
> kind of behaviour?
>
> Thanks
> Christian
>
>
>

Re: Exact substring search with ngrams

Reply via email to