n-Gram, only works with queries of 2 letters

Norberto Meijome Sun, 22 Jun 2008 23:24:34 -0700

hi there,
my use case : I  want to be able to match documents when only a partial word is 
provided. ie, searching for 'roc' or 'ock' should match documents containing 
'rock'.


As I understand, the way to solve this problem is to use the nGram tokenizer @ 
index time and the nGram analyser  @ search time. I want at least 'sub terms' 
of 2 to 6 characters indexed, so I can match substrings that length.

What I have (using 1.3 , nightly build from 2008-06-15):

schema.xml :
[..]
            <!-- n-gram tokenization -->
    <fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" 
minGramSize="2" maxGramSize="6"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" 
minGramSize="2" maxGramSize="6"/>
      </analyzer>
    </fieldType>
[..]
                <field name="artist" type="text" indexed="true" stored="true"
                        required="true" />

                        
                <!--  make this field not stored after testing. -->
                <field name="artist_ngram" type="ngram" indexed="true" 
stored="true" 
                        required="true" />

[...]

        <copyField source="artist" dest="artist_ngram" />

----

SolrConfig is same as from sample app, minus a few comments.

After each change in schema or solrconfig, I delete the index and load the 
documents (240 of them) again.

when I search :
http://localhost:8983/solr/_test_/select?q=ock&df=artist_ngram&debugQuery=true

I get 0 results:
=====================
−
        <response>
−
        <lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">38</int>
−
        <lst name="params">
<str name="q">ock</str>
<str name="df">artist_ngram</str>
<str name="debugQuery">true</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
−
        <lst name="debug">
<str name="rawquerystring">ock</str>
<str name="querystring">ock</str>
<str name="parsedquery">PhraseQuery(artist_ngram:"oc ck ock")</str>
<str name="parsedquery_toString">artist_ngram:"oc ck ock"</str>
<lst name="explain"/>
−
        <lst name="timing">
<double name="time">37.0</double>
−
        <lst name="prepare">
<double name="time">1.0</double>
−
        <lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">1.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">0.0</double>
</lst>
</lst>
−
        <lst name="process">
<double name="time">35.0</double>
−
        <lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">35.0</double>
</lst>
</lst>
</lst>
</lst>
</response>
=====================

The same when I search for 'roc'. BUT if i search for any of the 2-character 
tokens (ro, oc, ck ), I find everything fine.

===================================
.<response>
−
        <lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">4</int>
−
        <lst name="params">
<str name="q">oc</str>
<str name="df">artist_ngram</str>
<str name="debugQuery">true</str>
</lst>
</lst>
−
        <result name="response" numFound="5" start="0">
−
        <doc>
<str name="artist">Jay Rock</str>
<str name="artist_ngram">Jay Rock</str>
<str name="artistid">Jay Rock</str>
<date name="index_timestamp">2008-06-23T04:46:05.933Z</date>
</doc>
−
[...]
</result>
−
        <lst name="debug">
<str name="rawquerystring">oc</str>
<str name="querystring">oc</str>
<str name="parsedquery">artist_ngram:oc</str>
<str name="parsedquery_toString">artist_ngram:oc</str>
−
        <lst name="explain">
−
        <str name="Jay Rock">

0.87916493 = (MATCH) fieldWeight(artist_ngram:oc in 82), product of:
  1.0 = tf(termFreq(artist_ngram:oc)=1)
  4.6888795 = idf(docFreq=5, numDocs=240)
  0.1875 = fieldNorm(field=artist_ngram, doc=82)
</str>
−
[...]
</response>
===================================

On analysis.jsp I can see that all the tokens are generated, both in index and 
search:
Fieldname = artist_ngram
field value (Index) = rock
Field Value (query) = ock

(marked with * * are the terms shown as a match)

Index Analyzer 
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2}
term position   1       2       3       4       5       6
term text       ro      *oc*    *ck*    roc     *ock*   rock
term type       word    word    word    word    word    word
source start,end        0,2     1,3     2,4     0,3     1,4     0,4
Query Analyzer
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2}
term position   1       2       3
term text       oc      ck      ock
term type       word    word    word
source start,end        0,2     1,3     0,3

What am I doing wrong here? Why does only the searches for 'oc' , 'ro', 'ck' 
return matching documents, but 3 or 4 letter terms (rock, even) don't match? 

thanks for any help you can provide,
B
_________________________
{Beto|Norberto|Numard} Meijome

"That's what I love about GUIs: They make simple tasks easier, and complex 
tasks impossible."
   John William Chambless

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

n-Gram, only works with queries of 2 letters

Reply via email to