hi there, my use case : I want to be able to match documents when only a partial word is provided. ie, searching for 'roc' or 'ock' should match documents containing 'rock'.
As I understand, the way to solve this problem is to use the nGram tokenizer @ index time and the nGram analyser @ search time. I want at least 'sub terms' of 2 to 6 characters indexed, so I can match substrings that length. What I have (using 1.3 , nightly build from 2008-06-15): schema.xml : [..] <!-- n-gram tokenization --> <fieldType name="ngram" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="6"/> </analyzer> <analyzer type="query"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="6"/> </analyzer> </fieldType> [..] <field name="artist" type="text" indexed="true" stored="true" required="true" /> <!-- make this field not stored after testing. --> <field name="artist_ngram" type="ngram" indexed="true" stored="true" required="true" /> [...] <copyField source="artist" dest="artist_ngram" /> ---- SolrConfig is same as from sample app, minus a few comments. After each change in schema or solrconfig, I delete the index and load the documents (240 of them) again. when I search : http://localhost:8983/solr/_test_/select?q=ock&df=artist_ngram&debugQuery=true I get 0 results: ===================== − <response> − <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">38</int> − <lst name="params"> <str name="q">ock</str> <str name="df">artist_ngram</str> <str name="debugQuery">true</str> </lst> </lst> <result name="response" numFound="0" start="0"/> − <lst name="debug"> <str name="rawquerystring">ock</str> <str name="querystring">ock</str> <str name="parsedquery">PhraseQuery(artist_ngram:"oc ck ock")</str> <str name="parsedquery_toString">artist_ngram:"oc ck ock"</str> <lst name="explain"/> − <lst name="timing"> <double name="time">37.0</double> − <lst name="prepare"> <double name="time">1.0</double> − <lst name="org.apache.solr.handler.component.QueryComponent"> <double name="time">1.0</double> </lst> − <lst name="org.apache.solr.handler.component.FacetComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.MoreLikeThisComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.HighlightComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.DebugComponent"> <double name="time">0.0</double> </lst> </lst> − <lst name="process"> <double name="time">35.0</double> − <lst name="org.apache.solr.handler.component.QueryComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.FacetComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.MoreLikeThisComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.HighlightComponent"> <double name="time">0.0</double> </lst> − <lst name="org.apache.solr.handler.component.DebugComponent"> <double name="time">35.0</double> </lst> </lst> </lst> </lst> </response> ===================== The same when I search for 'roc'. BUT if i search for any of the 2-character tokens (ro, oc, ck ), I find everything fine. =================================== .<response> − <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">4</int> − <lst name="params"> <str name="q">oc</str> <str name="df">artist_ngram</str> <str name="debugQuery">true</str> </lst> </lst> − <result name="response" numFound="5" start="0"> − <doc> <str name="artist">Jay Rock</str> <str name="artist_ngram">Jay Rock</str> <str name="artistid">Jay Rock</str> <date name="index_timestamp">2008-06-23T04:46:05.933Z</date> </doc> − [...] </result> − <lst name="debug"> <str name="rawquerystring">oc</str> <str name="querystring">oc</str> <str name="parsedquery">artist_ngram:oc</str> <str name="parsedquery_toString">artist_ngram:oc</str> − <lst name="explain"> − <str name="Jay Rock"> 0.87916493 = (MATCH) fieldWeight(artist_ngram:oc in 82), product of: 1.0 = tf(termFreq(artist_ngram:oc)=1) 4.6888795 = idf(docFreq=5, numDocs=240) 0.1875 = fieldNorm(field=artist_ngram, doc=82) </str> − [...] </response> =================================== On analysis.jsp I can see that all the tokens are generated, both in index and search: Fieldname = artist_ngram field value (Index) = rock Field Value (query) = ock (marked with * * are the terms shown as a match) Index Analyzer org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2} term position 1 2 3 4 5 6 term text ro *oc* *ck* roc *ock* rock term type word word word word word word source start,end 0,2 1,3 2,4 0,3 1,4 0,4 Query Analyzer org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=6, minGramSize=2} term position 1 2 3 term text oc ck ock term type word word word source start,end 0,2 1,3 0,3 What am I doing wrong here? Why does only the searches for 'oc' , 'ro', 'ck' return matching documents, but 3 or 4 letter terms (rock, even) don't match? thanks for any help you can provide, B _________________________ {Beto|Norberto|Numard} Meijome "That's what I love about GUIs: They make simple tasks easier, and complex tasks impossible." John William Chambless I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.