hi there,
my use case : I  want to be able to match documents when only a partial word is 
provided. ie, searching for 'roc' or 'ock' should match documents containing 
'rock'.

As I understand, the way to solve this problem is to use the nGram tokenizer @ 
index time and the nGram analyser  @ search time. I want at least 'sub terms' 
of 2 to 6 characters indexed, so I can match substrings that length.

What I have (using 1.3 , nightly build from 2008-06-15):

schema.xml :
[..]
            <!-- n-gram tokenization -->
    <fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" 
minGramSize="2" maxGramSize="6"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" 
minGramSize="2" maxGramSize="6"/>
      </analyzer>
    </fieldType>
[..]
                <field name="artist" type="text" indexed="true" stored="true"
                        required="true" />

                        
                <!--  make this field not stored after testing. -->
                <field name="artist_ngram" type="ngram" indexed="true" 
stored="true" 
                        required="true" />

[...]

        <copyField source="artist" dest="artist_ngram" />

----

SolrConfig is same as from sample app, minus a few comments.

After each change in schema or solrconfig, I delete the index and load the 
documents (240 of them) again.

when I search :
http://localhost:8983/solr/_test_/select?q=ock&df=artist_ngram&debugQuery=true

I get 0 results:
=====================
−
        <response>
−
        <lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">38</int>
−
        <lst name="params">
<str name="q">ock</str>
<str name="df">artist_ngram</str>
<str name="debugQuery">true</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
−
        <lst name="debug">
<str name="rawquerystring">ock</str>
<str name="querystring">ock</str>
<str name="parsedquery">PhraseQuery(artist_ngram:"oc ck ock")</str>
<str name="parsedquery_toString">artist_ngram:"oc ck ock"</str>
<lst name="explain"/>
−
        <lst name="timing">
<double name="time">37.0</double>
−
        <lst name="prepare">
<double name="time">1.0</double>
−
        <lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">1.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">0.0</double>
</lst>
</lst>
−
        <lst name="process">
<double name="time">35.0</double>
−
        <lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
        <lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">35.0</double>
</lst>
</lst>
</lst>
</lst>
</response>
=====================

The same when I search for 'roc'. BUT if i search for any of the 2-character 
tokens (ro, oc, ck ), I find everything fine.

===================================
.<response>
−
        <lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">4</int>
−
        <lst name="params">
<str name="q">oc</str>
<str name="df">artist_ngram</str>
<str name="debugQuery">true</str>
</lst>
</lst>
−
        <result name="response" numFound="5" start="0">
−
        <doc>
<str name="artist">Jay Rock</str>
<str name="artist_ngram">Jay Rock</str>
<str name="artistid">Jay Rock</str>
<date name="index_timestamp">2008-06-23T04:46:05.933Z</date>
</doc>
−
[...]
</result>
−
        <lst name="debug">
<str name="rawquerystring">oc</str>
<str name="querystring">oc</str>
<str name="parsedquery">artist_ngram:oc</str>
<str name="parsedquery_toString">artist_ngram:oc</str>
−
        <lst name="explain">
−
        <str name="Jay Rock">

0.87916493 = (MATCH) fieldWeight(artist_ngram:oc in 82), product of:
  1.0 = tf(termFreq(artist_ngram:oc)=1)
  4.6888795 = idf(docFreq=5, numDocs=240)
  0.1875 = fieldNorm(field=artist_ngram, doc=82)
</str>
−
[...]
</response>
===================================

On analysis.jsp I can see that all the tokens are generated, both in index and 
search:
Fieldname = artist_ngram
field value (Index) = rock
Field Value (query) = ock

(marked with * * are the terms shown as a match)

Index Analyzer 
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2}
term position   1       2       3       4       5       6
term text       ro      *oc*    *ck*    roc     *ock*   rock
term type       word    word    word    word    word    word
source start,end        0,2     1,3     2,4     0,3     1,4     0,4
Query Analyzer
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2}
term position   1       2       3
term text       oc      ck      ock
term type       word    word    word
source start,end        0,2     1,3     0,3

What am I doing wrong here? Why does only the searches for 'oc' , 'ro', 'ck' 
return matching documents, but 3 or 4 letter terms (rock, even) don't match? 

thanks for any help you can provide,
B
_________________________
{Beto|Norberto|Numard} Meijome

"That's what I love about GUIs: They make simple tasks easier, and complex 
tasks impossible."
   John William Chambless

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Reply via email to