Hi Thorsten, Some comments to your comments, inlined and prefixed with "OG".
----- Original Message ---- From: Thorsten Scherler <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, December 22, 2006 5:53:19 AM Subject: Re: Help with spellchecker integration On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote: > Hi, > I'm trying to integrate the Lucene-based spellchecker > (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker > under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in > order to provide a query spellchecking service (you enter Speers and it > suggest pant^H^H ... Spears). I've created a generic NGramTokenizer (+ > NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly. > > What I'm not yet sure about is: > 1) integration of this generic n-grammer with that Lucene SpellChecker code - > SpellChecker & TRStringDistance classes in particular. Hmm, reading SOLR-81, you actually have everything you need. > 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field > names, like 3start, 4start, gram1, gram2, gram3.... is there is scheme.xml > trick one can use to accomplish this? It is in the issue: <!-- Here you define what happens if the field "gram2" get indexed. The solr.NGramTokenizerFactory will return the different combination of tokens --> <fieldtype name="gram2" class="solr.TextField"> <analyzer> <!--more tokenizer --> <tokenizer class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/> </analyzer> </fieldtype> OG: Yes, adding those separate fieldtype definitions was my attempt at getting separate sets of n-grams of different sizes: uni-bram, bi-gram... But how do I get "3start", "4start", "2end", and "4end"? It looks like I'd have to do this: - To get 3start, pass "query string" to "gram3" type tokenizer, and keep only the first token. - To get 3end, pass "query string" to "gram3" type tokenizer, and keep only the last token (this could be the same n-gram if query string is a 3-letter word) But can this be configured somehow? I don't see a way to configure Solr to do this. <!-- Here you map the @source="word" to @dest="gram2" What is does is copying the word input to the gram2 field--> <copyField source="word" dest="gram2"/> ... OG: But doesn't this tell Solr to copy the _whole_ "word" into a field _named_ "gram2"? The above fieldtype is a definition for a field of _type_ "gram2". What I need to tell Solr is: "Take the field named word, analyze is as fieldtype gram2 and index it into a field named gram2" "Take the field named word, analyze is as fieldtype gram3 and index it into a field named gram3" ... "Take the field named word, analyze is as fieldtype gram2 and index only the 1st token into a field named 2start" "Take the field named word, analyze is as fieldtype gram3 and index only the 1st token into a field named 3start" ... "Take the field named word, analyze is as fieldtype gram2 and index only the last token into a field named 2end" "Take the field named word, analyze is as fieldtype gram3 and index only the last token into a field named 3end" OG: I think :). Doable? The above shows how to configure the second (spellcheck) index, however if you want to update both indexes at the same time you need to write your own implementation of the update servlet. OG: Right. I think the spellchecker index will be small enough that it could be rebuilt from scratch on demand or at least separately from the main index being searched. > 3) once 2) is done, getting the.... request handler(?) to n-gram the query > appropriately and hit the SpellChecker index to try and find alternative > spelling suggestions. hmm, not sure, actually IMHO that highly depends on how you plan to use it in the end. I mean there is more then one way to use spell check. In the issue they talked about AJAX suggestions but that would be IMO before the actual search request. If you want to have it in the request handler then you need to decide how and when the spellchecker comes into place. OG: The goal is a "did you mean" type of functionality. In other words, run the real query + run the query against the spellchecker index. If the spellchecker returns something, offer than on the results page as a "did you mean: <suggested query>" Like if the normal search does not return a result or parallel. Parallel would search in the spell check index for alternatives, use this alternatives to dispatch the alternative word query and later on parse the result of directly into the output writer. Here you have again different alternatives, you can attack the solr index directly (loosing all the cool feature) Or you want the google thingy "Did you mean". ... in any form start with: public class NGramRequestHandler extends StandardRequestHandler implements SolrRequestHandler, SolrInfoMBean { public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) { // Depending on the use case do your processing here } } This way you just need to implement the class specific methods. OG: I see I'll be losing my RequestHandler virginity. Ah, the innocence. I suppose at this point, if I manage to get the all the ngrams into the right fields, I can use Spellchecker.suggest(....) from the Lucene spellchecker and return any suggestions as matching documents. > Damn, that's a lot of unknowns... on top of that my computer started freezing > every half an hour. Hi Murphy. > Any pointers will be greatly appreciated. Thanks, HTH a wee bit. Thanks! Otis