Exact substring search with ngrams

Christian Ramseyer Tue, 25 Aug 2015 15:00:57 -0700

Hi

I'm trying to build an index for technical documents that basically
works like "grep", i.e. the user gives an arbitray substring somewhere
in a line of a document and the exact matches will be returned. I
specifically want no stemming etc. and keep all whitespace, parentheses
etc. because they might be significant. The only normalization is that
the search should be case-insensitvie.


I tried to achieve this by tokenizing on line breaks, and then building
trigrams of the individual lines:

<fieldType name="configtext_trigram" class="solr.TextField" >

    <analyzer type="index">

        <tokenizer class="solr.PatternTokenizerFactory"
            pattern="\R" group="-1"/>

        <filter class="solr.NGramFilterFactory"
            minGramSize="3" maxGramSize="3"/>
        <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

    <analyzer type="query">

        <tokenizer class="solr.NGramTokenizerFactory"
            minGramSize="3" maxGramSize="3"/>
        <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>
</fieldType>

Then in the search, I use the edismax parser with mm=100%, so given the
documents


{"id":"test1","content":"
encryption
10.0.100.22
description
"}

{"id":"test2","content":"
10.100.0.22
description
"}

and the query content:encryption, this will turn into

"parsedquery_toString":

"+((content:enc content:ncr content:cry content:ryp
content:ypt content:pti content:tio content:ion)~8)",

and return only the first document. All fine and dandy. But I have a
problem with possible false positives. If the search is e.g.

content:.100.22

then the generated query will be

"parsedquery_toString":
"+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",

and because all of tokens are also generated for document test2 in the
proximity of 5, both documents will wrongly be returned.

So somehow I'd need to express the query "content:.10 content:100
content:00. content:0.2 content:.22" with *the tokens exactly in this
order and nothing in between*. Is this somehow possible, maybe by using
the termvectors/termpositions stuff? Or am I trying to do something
that's fundamentally impossible? Other good ideas how to achieve this
kind of behaviour?

Thanks
Christian

Exact substring search with ngrams

Reply via email to