Re: Getting an ngram fieldtype to work

Markus Jelsma Fri, 08 Oct 2010 06:52:54 -0700


On Friday, October 08, 2010 03:40:09 pm Allistair Crossley wrote:
> Well, a lot of this is working but not all.
> 
> Consider the company name Shooters Inc
> 
> My ngram field is able to match queries to the name for shoot and hoot and
> so on. This works.
> 
> However consider the company name
> 
> Location Scotland
> 
> If I query scot I get one result back - but it's for a company called
> Prescott Inc
> 
> I looked at the analyzer and realised that the NGramTokenizer was
> generating substrings from the start (left) of the *whole phrase*
> 
> location scotland
> 
> Because my max was set to 15 it was not generating a token for scot


Huh? Your supplied config does generate scot as a token. The 15 is just the 
maximum size of a gram, it does not set a limit to how many new terms are 
generated.

Are you querying the correct server? Did you reindex on the correct server? It 
should work.

> So I figured I would change to a whitespace tokenizer first and then apply
> the ngram as a filter.
> 
> This now looks like it is generating scot in the tokens as shown below:
> Index Analyzer
> 
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> 
> term position 1       2
> term text     location        scotland
> term type     word    word
> source start,end      0,8     9,17
> payload
> org.apache.solr.analysis.NGramFilterFactory {maxGramSize=15, minGramSize=4}
> 
> term
> position      1       2       3       4       5       6       7       8       
> 9       10      11      12      13      14      
15      16      17      18      19      20      21      22      23      24      
25
>       26      27      28      29      30 term
> text  loca    ocat    cati    atio    tion    locat   ocati   catio   ation   
> locati  ocatio  
cation
>       locatio ocation location        scot    cotl    otla    tlan    land    
> scotl   cotla   otlan   
tland   
> scotla        cotlan  otland  scotlan cotland scotland term
> type  word    word    word    word    word    word    word    word    word    
> word    word    word    word    word
>       word    word    word    word    word    word    word    word    word    
> word    word    word    word    word    word
>       word source
> start,end     0,4     1,5     2,6     3,7     4,8     0,5     1,6     2,7     
> 3,8     0,6     1,7     2,8     0,7     
1,8     0,8     9,13
>       10,14   11,15   12,16   13,17   9,14    10,15   11,16   12,17   9,15    
10,16   11,17   9,16    10,17
>       9,17 payload
> Query Analyzer
> 
> scot
> scot
> 
> BUT it still results no results for scot, but does continue to return the
> Prescott result.
> 
> So ngramming is working but it is not working when the query is something
> far to the right of the indexed value.
> 
> Is this another user-error or have I missed something else here?
> 
> Cheers
> 
> On Oct 8, 2010, at 9:02 AM, Allistair Crossley wrote:
> > Oh my. I am basically being a total monkey. Every time I was changing my
> > schema.xml to try new things out I was then reindexing our staging
> > server's index instead of my local dev index so no changes were
> > occurring locally.
> > 
> > Dear me.
> > 
> > This is working now, surprise.
> > 
> > On Oct 8, 2010, at 8:53 AM, Markus Jelsma wrote:
> >> How come your query analyser spits out grams? It isn't configured to do
> >> so or you posted an older field definition. Anyway,  do you actually
> >> search on your new field?
> >> 
> >> On Friday, October 08, 2010 02:46:08 pm Allistair Crossley wrote:
> >>> Hi,
> >>> 
> >>> Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as
> >>> expected, so it's not my configuration that is at fault (he says)
> >>> 
> >>> Index Analyzer
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      ooter
> >> 
> >>>   shoote  hooter
> >>> 
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter Query Analyzer
> >>> 
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      ooter
> >> 
> >>>   shoote  hooter
> >>> 
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter
> >>> sh        ho      oo      ot      te      er      sho     hoo     oot     
> >>> ote     ter     shoo    hoot    oote    oter    
shoot
> >> 
> >> hoote      oote
> >> 
> >>> r shoote  hooter
> >>> 
> >>> 
> >>> Yet, searching either
> >>> 
> >>> /solr/select?q=hoot
> >>> 
> >>> or
> >>> 
> >>> /solr/select?q=name:hoot
> >>> 
> >>> does not yield results.
> >>> 
> >>> When searching for shooter I see 2 results with names:
> >>> 
> >>> 1. <str name="name">Shooters International Inc</str>
> >>> 2. <str name="name">Hong Kong Shooter</str>
> >>> 
> >>> Yours, puzzled :)
> >>> 
> >>> On Oct 8, 2010, at 8:38 AM, Jan Høydahl / Cominvent wrote:
> >>>> Hi,
> >>>> 
> >>>> The first thing I would try is to go to the analysis page, enter your
> >>>> test data, and report back what each analysis stage prints out:
> >>>> http://localhost:8983/solr/admin/analysis.jsp
> >>>> 
> >>>> --
> >>>> Jan Høydahl, search solution architect
> >>>> Cominvent AS - www.cominvent.com
> >>>> 
> >>>> On 8. okt. 2010, at 14.19, Allistair Crossley wrote:
> >>>>> Morning all,
> >>>>> 
> >>>>> I would like to ngram a company name field in our index. I have read
> >>>>> about
> >> 
> >> the costs of doing so in the great David Smiley Solr 1.4 book and just
> >> to get started I have followed his example in setting up an ngram field
> >> type as
> >> 
> >> follows:
> >>>>>                 <fieldType name="text_substring" class="solr.TextField"
> >>>>>                 positionIncrementGap="100" stored="false" 
> >>>>> multiValued="true">
> >>>>>                 
> >>>>>                         <analyzer type="index">
> >>>>>                         
> >>>>>                                 <tokenizer 
> >>>>> class="solr.StandardTokenizerFactory" />
> >>>>>                                 <filter 
> >>>>> class="solr.LowerCaseFilterFactory" />
> >>>>>                                 <filter class="solr.NGramFilterFactory" 
minGramSize="4"
> >>>>>                                 maxGramSize="15" />
> >>>>>                         
> >>>>>                         </analyzer>
> >>>>>                         <analyzer type="query">
> >>>>>                         
> >>>>>                                 <tokenizer 
> >>>>> class="solr.StandardTokenizerFactory" />
> >>>>>                                 <filter 
> >>>>> class="solr.LowerCaseFilterFactory" />
> >>>>>                         
> >>>>>                         </analyzer>
> >>>>>                 
> >>>>>                 </fieldType>
> >>>>> 
> >>>>> I have restarted/reindexed everything but I still cannot search
> >>>>> 
> >>>>> hoot
> >>>>> 
> >>>>> and get back the company named Shooter. searching shooter is fine.
> >>>>> 
> >>>>> I have followed other examples on the internet regards an ngram field
> >>>>> type. Some examples seem to use an index analyzer that has an ngram
> >>>>> tokenizer rather than filter if this makes a difference. But in all
> >>>>> cases I am not seeing the expected result, just 0 results.
> >>>>> 
> >>>>> Is there anything else I should be considering here? I feel like I
> >>>>> must be very close, it doesn't seem complicated but yet it's not
> >>>>> working like everything else I have done with solr to date :)
> >>>>> 
> >>>>> Any guidance appreciated,
> >>>>> 
> >>>>> Allistair

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Getting an ngram fieldtype to work

Reply via email to