On Thu, 26 Jun 2008 10:44:32 +1000 Norberto Meijome <[EMAIL PROTECTED]> wrote:
> On Wed, 25 Jun 2008 15:37:09 -0300 > "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: > > > I've been trying to use the NGramTokenizer and I ran into a problem. > > It seems like solr is trying to match documents with all the tokens that the > > analyzer returns from the query term. So if I index a document with a title > > field with the value "nice dog" and search for "dog" (where the > > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get > > any results. > > Hi Jonathan, > I don't have the expertise yet to have gone straight into testing code with > lucene, but my 'black box' testing with ngramtokenizer seems to agree with > what > you found - see my latest posts over the last couple of days about this. > > Have you tried searching for 'do' or 'ni' or any search term with size = > minGramSize ? I've found that Solr matches results just fine then. hi there, I did some more tests with nGramTokenizer ... Summary : 5 tests are shown below, 4 work as expected, 1 fails. In particular, this failure is when searching , on a field using the NGramTokenizerFactory, with minGramSize != maxGramSize ,and length(q) > minGramSize. I've reproduced it with many several variations of minGramSize and length(q) and terms, both in stored field and query.. My setup: 1.3 nightly code from 2008-06-25, FreeBSD 7, JDK 1.6, Jetty from sample app. my documents are loaded via csv, 1 field copied with fieldCopy to all the artist_ngram variants. Relevant data loaded into documents : "nice dog" "the nice dog canine" "Triumph The Insult Comic Dog". the id field is the same data as string. I am searching directly on the field with q=field:query , qt=standard. After each schema or solrconfig change, i stop the service, delete data directory, start server and post the docs again. ------------------ <fieldType name="ngram2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="2" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="2" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldType name="ngram4" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="4" maxGramSize="4" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="4" maxGramSize="4" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldType name="ngram_var" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="10" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2" maxGramSize="10" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <field name="artist_ngram" type="ngram" indexed="true" stored="true" required="true" /> <field name="artist_ngram2" type="ngram2" indexed="true" stored="true" required="true" /> <field name="artist_var_ngram" type="ngram_var" indexed="true" stored="true" required="true" /> --------------------- Test 1: OK http://localhost:8983/solr/_test_/select?q=artist_ngram2:dog&debugQuery=true&qt=standard returns all 3 docs as expected. If i understood your mail correctly Jonathan, you aren't getting results ? Test 2 : OK http://localhost:8983/solr/_test_/select?q=artist_ngram:dog&debugQuery=true&qt=standard returns 0 documents as expected. artist_ngram has 4 letters per token, we gave it 3. Same result when searching on artist_var_ngram field for same reasons. Test 3: OK http://localhost:8983/solr/_test_/select?q=artist_ngram2:insul&debugQuery=true&qt=standard Returns 1 doc , "Triumph The Insult Comic Dog" as expected. query gets tokenized into 2-letter tokens and match tokens in index. same result when searching on artist_ngram field , same reasons (except that we get 4 char tokens out of the 5 char query) Test 4 : FAIL!! http://localhost:8983/solr/_test_/select?q=artist_var_ngram:insul&debugQuery=true&qt=standard Returns 0 docs. I think it should have matched the same doc as in Test 3, because the query would be tokenized into 4 and 5 char tokens - all of which are included in the index as the field is tokenized with all the range between 2 and 10 chars. Using Luke (the java app, not the filter), the field shows the tokens shown after my signature. Using analysis.jsp, it shows that we should get a match in several tokens. The query is parsed as follows : [..] <lst name="debug"> <str name="rawquerystring">artist_var_ngram:insul</str> <str name="querystring">artist_var_ngram:insul</str> − <str name="parsedquery"> PhraseQuery(artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul") </str> − <str name="parsedquery_toString"> artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul" </str> <lst name="explain"/> <str name="QParser">OldLuceneQParser</str> [...] Test 5 : OK http://localhost:8983/solr/_test_/select?q=artist_var_ngram:ul&debugQuery=true&qt=standard Searching for a query which won't be tokenized further (ie, its length = minGramSize), it works as expected. It seems to me there is a problem with matching on fields where minGramSize != maxGramSize . I don't know enough to point to the cause. In the meantime, I am creating multiple n-gram fields, with growing sizes, min == max, and using dismax across the lot... not pretty, but it'll do until I understand why 'Test 4' isn't working. Please let me know if any more info / tests are needed. Or if I should open an issue in JIRA. cheers, B _________________________ {Beto|Norberto|Numard} Meijome "A dream you dream together is reality." John Lennon I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. ----- tokens for test4 in index, as per luke, field artist_var_ngram tr, ri, iu, um, mp, ph, h , t, th, he, e , i, in, ns, su, ul, lt, t , c, co, om, mi, ic, c , d, do, og, tri, riu, ium, ump, mph, ph , h t, th, the, he , e i, in, ins, nsu, sul, ult, lt , t c, co, com, omi, mic, ic , c d, do, dog, triu, rium, iump, umph, mph , ph t, h th, the, the , he i, e in, ins, insu, nsul, sult, ult , lt c, t co, com, comi, omic, mic , ic d, c do, dog, trium, riump, iumph, umph , mph t, ph th, h the, the , the i, he in, e ins, insu, insul, nsult, sult , ult c, lt co, t com, comi, comic, omic , mic d, ic do, c dog, triump, riumph, iumph , umph t, mph th, ph the, h the , the i, the in, he ins, e insu, insul, insult, nsult , sult c, ult co, lt com, t comi, comic, comic , omic d, mic do, ic dog, triumph, riumph , iumph t, umph th, mph the, ph the , h the i, the in, the ins, he insu, e insul, insult, insult , nsult c, sult co, ult com, lt comi, t comic, comic , comic d, omic do, mic dog, triumph , riumph t, iumph th, umph the, mph the , ph the i, h the in, the ins, the insu, he insul, e insult, insult , insult c, nsult co, sult com, ult comi, lt comic, t comic , comic d, comic do, omic dog, triumph t, riumph th, iumph the, umph the , mph the i, ph the in, h the ins, the insu, the insul, he insult, e insult , insult c, insult co, nsult com, sult comi, ult comic, lt comic , t comic d, comic do, comic dog, triumph th, riumph the, iumph the , umph the i, mph the in, ph the ins, h the insu, the insul, the insult, he insult , e insult c, insult co, insult com, nsult comi, sult comic, ult comic , lt comic d, t comic do, comic dog