Re: NGramTokenizer issue

Norberto Meijome Wed, 25 Jun 2008 21:01:27 -0700

On Thu, 26 Jun 2008 10:44:32 +1000
Norberto Meijome <[EMAIL PROTECTED]> wrote:


> On Wed, 25 Jun 2008 15:37:09 -0300
> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
> 
> > I've been trying to use the NGramTokenizer and I ran into a problem.
> > It seems like solr is trying to match documents with all the tokens that the
> > analyzer returns from the query term. So if I index a document with a title
> > field with the value "nice dog" and search for "dog" (where the
> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get
> > any results.
> 
> Hi Jonathan,
> I don't have the expertise yet to have gone straight into testing code with
> lucene, but my 'black box' testing with ngramtokenizer seems to agree with 
> what
> you found - see my latest posts over the last couple of days about this.
> 
> Have you tried searching for 'do' or 'ni' or any search term with size =
> minGramSize ? I've found that Solr matches results just fine then.

hi there,
I did some more tests with nGramTokenizer ... 

Summary : 
5 tests are shown below, 4 work as expected, 1 fails. In particular, this
failure is  when searching , on a field using the NGramTokenizerFactory, with
minGramSize != maxGramSize ,and length(q) > minGramSize. I've reproduced it with
many several variations of minGramSize and length(q) and terms, both in stored
field and query..

My setup:
1.3 nightly code from 2008-06-25,  FreeBSD 7,  JDK 1.6, Jetty from sample app.

my documents are loaded via csv, 1 field copied with fieldCopy to all the
artist_ngram variants.
Relevant data loaded into documents : "nice dog" "the nice dog canine" "Triumph
The Insult Comic Dog". 
the id field is the same data as string.

I am searching directly on the field with q=field:query , qt=standard.
After each schema or solrconfig change, i stop the service, delete data
directory, start server and post the docs again.

------------------
<fieldType name="ngram2" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2"
maxGramSize="2" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
        <analyzer type="query">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2"
maxGramSize="2" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
</fieldType>

<fieldType name="ngram4" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="4"
maxGramSize="4" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
        <analyzer type="query">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="4"
maxGramSize="4" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
</fieldType>


<fieldType name="ngram_var" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2"
maxGramSize="10" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
        <analyzer type="query">
                <tokenizer
class="org.apache.solr.analysis.NGramTokenizerFactory" minGramSize="2"
maxGramSize="10" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
</fieldType>

<field name="artist_ngram" type="ngram" indexed="true" stored="true"
required="true" />
<field name="artist_ngram2" type="ngram2" indexed="true" stored="true"
required="true" />
<field name="artist_var_ngram" type="ngram_var" indexed="true" stored="true"
required="true" />
---------------------

Test 1: OK
http://localhost:8983/solr/_test_/select?q=artist_ngram2:dog&debugQuery=true&qt=standard

returns all 3 docs as expected.  If i understood your mail correctly Jonathan,
you aren't getting results ? 

Test 2 : OK
http://localhost:8983/solr/_test_/select?q=artist_ngram:dog&debugQuery=true&qt=standard

returns 0 documents as expected. artist_ngram has 4 letters per token, we gave
it 3. 

Same result when searching on artist_var_ngram field for same reasons.

Test 3: OK
http://localhost:8983/solr/_test_/select?q=artist_ngram2:insul&debugQuery=true&qt=standard

Returns 1 doc , "Triumph The Insult Comic Dog" as expected. query gets
tokenized into 2-letter tokens and match tokens in index.

same result when searching on artist_ngram field , same reasons (except that we
get 4 char tokens out of the 5 char query)

Test 4 : FAIL!!
http://localhost:8983/solr/_test_/select?q=artist_var_ngram:insul&debugQuery=true&qt=standard

Returns 0 docs. I think it should have matched the same doc as in Test 3,
because the query would be tokenized into 4 and 5 char tokens - all of which
are included in the index as the field is tokenized with all the range between
2 and 10 chars. 
Using Luke (the java app, not the filter), the field shows the tokens shown
after my signature.
Using analysis.jsp, it shows that we should get a match in several tokens.

The query is parsed as follows :
[..]
<lst name="debug">
<str name="rawquerystring">artist_var_ngram:insul</str>
<str name="querystring">artist_var_ngram:insul</str>
−
        <str name="parsedquery">
PhraseQuery(artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul")
</str>
−
        <str name="parsedquery_toString">
artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul"
</str>
<lst name="explain"/>
<str name="QParser">OldLuceneQParser</str>
[...]


Test 5 : OK
http://localhost:8983/solr/_test_/select?q=artist_var_ngram:ul&debugQuery=true&qt=standard

Searching for a query which won't be tokenized further (ie, its length = 
minGramSize), it works as expected. 


It seems to me there is a problem with matching  on fields where minGramSize != 
maxGramSize . I don't know enough to point to the cause.

In the meantime, I am creating multiple n-gram fields, with growing sizes, min 
== max, and using dismax across the lot... not pretty, but it'll do until I 
understand why 'Test 4' isn't working.

Please let me know if any more info / tests are needed. Or if I should open an 
issue in JIRA.

cheers,
B
_________________________
{Beto|Norberto|Numard} Meijome

"A dream you dream together is reality."
  John Lennon

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

-----
tokens for test4 in index, as per luke, field artist_var_ngram


tr, ri, iu, um, mp, 
ph, h ,  t, th, he, 
e ,  i, in, ns, su, 
ul, lt, t ,  c, co, 
om, mi, ic, c ,  d, 
do, og, tri, riu, ium, 
ump, mph, ph , h t,  th, 
the, he , e i,  in, ins, 
nsu, sul, ult, lt , t c, 
 co, com, omi, mic, ic , 
c d,  do, dog, triu, rium, 
iump, umph, mph , ph t, h th, 
 the, the , he i, e in,  ins, 
insu, nsul, sult, ult , lt c, 
t co,  com, comi, omic, mic , 
ic d, c do,  dog, trium, riump, 
iumph, umph , mph t, ph th, h the, 
 the , the i, he in, e ins,  insu, 
insul, nsult, sult , ult c, lt co, 
t com,  comi, comic, omic , mic d, 
ic do, c dog, triump, riumph, iumph , 
umph t, mph th, ph the, h the ,  the i, 
the in, he ins, e insu,  insul, insult, 
nsult , sult c, ult co, lt com, t comi, 
 comic, comic , omic d, mic do, ic dog, 
triumph, riumph , iumph t, umph th, mph the, 
ph the , h the i,  the in, the ins, he insu, 
e insul,  insult, insult , nsult c, sult co, 
ult com, lt comi, t comic,  comic , comic d, 
omic do, mic dog, triumph , riumph t, iumph th, 
umph the, mph the , ph the i, h the in,  the ins, 
the insu, he insul, e insult,  insult , insult c, 
nsult co, sult com, ult comi, lt comic, t comic , 
 comic d, comic do, omic dog, triumph t, riumph th, 
iumph the, umph the , mph the i, ph the in, h the ins, 
 the insu, the insul, he insult, e insult ,  insult c, 
insult co, nsult com, sult comi, ult comic, lt comic , 
t comic d,  comic do, comic dog, triumph th, riumph the, 
iumph the , umph the i, mph the in, ph the ins, h the insu, 
 the insul, the insult, he insult , e insult c,  insult co, 
insult com, nsult comi, sult comic, ult comic , lt comic d, 
t comic do,  comic dog

Re: NGramTokenizer issue

Reply via email to