Re: Help with spellchecker integration

Otis Gospodnetic Fri, 22 Dec 2006 08:48:19 -0800

Hi Thorsten,

Some comments to your comments, inlined and prefixed with "OG".

----- Original Message ----
From: Thorsten Scherler <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, December 22, 2006 5:53:19 AM
Subject: Re: Help with spellchecker integration

On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote: 
> Hi,
> I'm trying to integrate the Lucene-based spellchecker 
> (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker 
> under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in 
> order to provide a query spellchecking service (you enter Speers and it 
> suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ 
> NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.
> 
> What I'm not yet sure about is:
> 1) integration of this generic n-grammer with that Lucene SpellChecker code - 
> SpellChecker & TRStringDistance classes in particular.

Hmm, reading SOLR-81, you actually have everything you need.

> 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field 
> names, like 3start, 4start, gram1, gram2, gram3.... is there is scheme.xml 
> trick one can use to accomplish this?

It is in the issue:
<!-- Here you define what happens if the field "gram2" get indexed.
     The solr.NGramTokenizerFactory will return the different combination of 
tokens -->
<fieldtype name="gram2" class="solr.TextField"> 
  <analyzer> 
    <!--more tokenizer --> 
    <tokenizer 
      class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/> 
  </analyzer> 
</fieldtype>

OG: Yes, adding those separate fieldtype definitions was my attempt at
getting separate sets of n-grams of different sizes: uni-bram,
bi-gram... But how do I get "3start", "4start", "2end", and "4end"?  It looks 
like I'd have to do this:
- To get 3start, pass "query string" to "gram3" type tokenizer, and keep only 
the first token.
- To get 3end, pass "query string" to "gram3" type tokenizer, and keep only the 
last token (this could be the same n-gram if query string is a 3-letter word)

But can this be configured somehow?  I don't see a way to configure Solr to do 
this.

<!-- Here you map the @source="word" to @dest="gram2" 
     What is does is copying the word input to the gram2 field-->
<copyField source="word" dest="gram2"/>
...

OG: But doesn't this tell Solr to copy the _whole_ "word" into a field _named_ 
"gram2"?  The above fieldtype is a definition for a field of _type_ "gram2".
What I need to tell Solr is:
"Take the field named word, analyze is as fieldtype gram2 and index it into a 
field named gram2"
"Take the field named word, analyze is as fieldtype gram3 and index it into a 
field named gram3"
...
"Take the field named word, analyze is as fieldtype gram2 and index only the 
1st token into a field named 2start"

"Take the field named word, analyze is as fieldtype gram3 and index only the 
1st token into a field named 3start"

...
"Take the field named word, analyze is as fieldtype gram2 and index only the 
last token into a field named 2end"

"Take the field named word, analyze is as fieldtype gram3 and index only the 
last token into a field named 3end"

OG: I think :).  Doable?

The above shows how to configure the second (spellcheck) index, however
if you want to update both indexes at the same time you need to write
your own implementation of the update servlet.

OG: Right.  I think the spellchecker index will be small enough that it could 
be rebuilt from scratch on demand or at least separately from the main index 
being searched.

> 3) once 2) is done, getting the.... request handler(?) to n-gram the query 
> appropriately and hit the SpellChecker index to try and find alternative 
> spelling suggestions.

hmm, not sure, actually IMHO that highly depends on how you plan to use
it in the end. I mean there is more then one way to use spell check.

In the issue they talked about AJAX suggestions but that would be IMO
before the actual search request. If you want to have it in the request
handler then you need to decide how and when the spellchecker comes into
place.

OG: The goal is a "did you mean" type of functionality.  In other words, run 
the real query + run the query against the spellchecker index.  If the 
spellchecker returns something, offer than on the results page as a "did you 
mean: <suggested query>"

Like if the normal search does not return a result or parallel. Parallel
would search in the spell check index for alternatives, use this
alternatives to dispatch the alternative word query and later on parse
the result of directly into the output writer. Here you have again
different alternatives, you can attack the solr index directly (loosing
all the cool feature) 

Or you want the google thingy "Did you mean".

... in any form 
start with:
public class NGramRequestHandler extends StandardRequestHandler
implements SolrRequestHandler, SolrInfoMBean {
    public void handleRequest(SolrQueryRequest req, SolrQueryResponse
rsp) {
        // Depending on the use case do your processing here
    }
}

This way you just need to implement the class specific methods. 

OG: I see I'll be losing my RequestHandler virginity.  Ah, the innocence.  I 
suppose at this point, if I manage to get the all the ngrams into the right 
fields, I can use Spellchecker.suggest(....) from the Lucene spellchecker and 
return any suggestions as matching documents.

> Damn, that's a lot of unknowns... on top of that my computer started freezing 
> every half an hour.  Hi Murphy.
> Any pointers will be greatly appreciated. Thanks,

HTH a wee bit.

Thanks!
Otis

Re: Help with spellchecker integration

Reply via email to