I followed your instructions exactly. But still have trouble with multiword queries for eg: q=grapics returns 'graphics' but q=grapics card returns nothing. I even tried with the latest nightly build but didn't solve the problem. Any solution available.
scott.tabar wrote: > > Matthew, > > Thanks for the question. The answer is that they come from your own > indexes so the dictionary is based upon the actual words that are already > stored in Solr. This makes sense; if the spell checker is suggesting a > word that is not in the Solr index, then it will not help the user find > what they are looking for. > > You can control which fields in Solr can feed the spell checker. Also you > can have more than one spell checker that is focused on a specific > subjects. > > The following example of a SpellCheckerRequestHandler is based upon the > one I created for the test case. You need to add this to yor > solrconfig.xml file. You can view the whole thing within the Solr source > code once it is commited in to the main stream. The path is: > /src/test/test-files/solr/conf/solrconfig-spellchecker.xml and > schema-spellchecker.xml in the same directory. > > <!-- SpellCheckerRequestHandler takes in a word (or several words) as > the > value of the "q" parameter and returns a list of alternative > spelling > suggestions. If invoked with a ...&cmd=rebuild, it will rebuild > the > spellchecker index. > --> > <requestHandler name="spellchecker" > class="solr.SpellCheckerRequestHandler" startup="lazy"> > <!-- default values for query parameters --> > <lst name="defaults"> > <int name="suggestionCount">20</int> > <float name="accuracy">0.60</float> > </lst> > > <!-- Main init params for handler --> > > <!-- The directory where your SpellChecker Index should live. --> > <!-- May be absolute, or relative to the Solr "dataDir" directory. > --> > <!-- If this option is not specified, a RAM directory will be used > --> > <str name="spellcheckerIndexDir">spell</str> > > <!-- the field in your schema that you want to be able to build --> > <!-- your spell index on. This should be a field that uses a very --> > <!-- simple FieldType without a lot of Analysis (ie: string) --> > <str name="termSourceField">spell</str> > > </requestHandler> > > Some comments: > - The termSourceField should be a field you have defined within your > solr schema file. See notes below about the use of this field. > - The spellcheckeerIndexDir is the name of the directory that contain > the spellchecker indexes. In my example, I used spell, and it will be at > the same level of data and conf. You can name it what ever you would like > to. > - if you use the name of "/spellchecker" the url will be more RESTful > - if you need to have more than one spell checker in use at a time, then > you will need to change the name, spellcheckerIndexDir, and > termSourceField > - If you have more than one spell checker hitting the same index > directory, then when you rebuild the index through one of the handlers the > other handlers will not know it has been reindexed. To resolve this > issue, you may have to restart Solr. > > > The following components are from the schema-spellchecker.xml file: > > <fieldType name="spellText" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > > <field name="spell" type="spellText" indexed="true" stored="true" /> > > > > Some comments on Schema items above: > - The fieldType must be contained within the types > - The spellText content can be named what every you want > - The spellText fieldType should not be too aggressive on stemming or > modifying the the contents of the field > - Could use string instead of the defined fieldType of spellText, but it > does not have to be that restrictive > > - The field spellText needs to be within the "fields" group with your > other defined fields > - You could always use the copyField to either copy another fields > content into your "spell" field: > <copyField source="misc" dest="spell"/> > > > Some notes on the name of the handler: > - If you precede the name with "/" you can use the following url instead > of the second one: > - using the name of "/spellchecker" > http://yourSolrSite/solr/spellchecker?q=sialophosphoprotein > - using the name of "spellchecker" > http://yourSolrSite/solr/select?qt=spellchecker&q=sialophosphoprotein > > > Matthew, I hope you find this somewhat helpful. > > Scott Tabar > > ---- Matthew Runo <[EMAIL PROTECTED]> wrote: > Where does the index come from in the first place? Do we have to > enter the words, or are they entered as documents enter the SOLR index? > > I'd love to be able to use my own documents as the spell check index > of "correctly spelled words". > > +--------------------------------------------------------+ > | Matthew Runo > | Zappos Development > | [EMAIL PROTECTED] > | 702-943-7833 > +--------------------------------------------------------+ > > > On Oct 11, 2007, at 7:08 AM, <[EMAIL PROTECTED]> > <[EMAIL PROTECTED]> wrote: > >> Climbingrose, >> >> I think you make a valid point. Each person may have a different >> concept of how something should work with their application. >> >> My thought on the subject of spell checking multiple words: >> - the parameter "multiWords" enables spell checking on each word >> in "q" parameter instead of on the whole field >> - each word is then represented in its own entry in a list of all >> words that are checked >> - to identify each word that is being checked within that entry, >> it is identified by the key "words" >> - to identify if the word was found exactly as it is within the >> spell checker's index, the "exist" key contains this information >> - Since there can be suggestions for both misspelled words and >> words that are spelled correctly, the list of suggestions is also >> included for both correctly spelled and misspelled words, even if >> the suggestion list is empty. >> >> - My vision is that if a user has a search query of multiple >> words and they are wanting to perform a check on the words, the use >> of "multiWords" will check all words at one time, independently >> from each others and return the list. The presenting web app can >> then identify visually to the user which words are misspelled and >> which ones have suggestions too. The user can then work with the >> various lists of suggestions without having to re-hit Solr. >> Naturally, if the user manually changes a word, then Solr will have >> to be re-hit, but providing a single list of all words, including >> suggestions for correct words along with incorrect words, will help >> simplify applications (by reducing iterating over each word) and >> will help reduce the number of hits to the Solr server. >> >> >>> 1) I assumpt that when user enter a misspelled multiword query, we >>> should >>> only check for words that are actually misspelled. For example, if >>> user >>> enter "life expectancy calculatar", which has "calculator" >>> misspelled, we >>> should only spellcheck "calculatar". >> >> I think I understand what you mean in the above statement, but you >> must admit, it does sound funny. After all, how do you identify >> that a word is misspelled by NOT using the spelling checker? >> Correct me if I am wrong, but I think you intended to say that when >> a word is identified as being misspelled, then you should only >> include the suggestions for misspelled words. If this is the case, >> then I would have to disagree with you. The user may be interested >> in finding words that might mean the same, but are more popular >> (appears in more indexed documents within the Lucene index). Hence >> the reason why I added the result field "exist" to identify that a >> word is spelled correctly even if there is a list of suggestions. >> Please note, the situation can exist too where a word is misspelled >> and there are no suggestions so one cannot use the suggestion list >> as an indicator to the correctness of the individual word(s). >> >> >>> 2) I only return the best string for a mispelled query. >> >> You can also use the parameter "suggestionCount=1" to control how >> many words are returned. In this case, it will do what your code >> is doing, but still allow the client to dynamically change this >> value without the need to hard code it within the main source code. >> >> >> As far as only including terms that are more popular than the word >> that is being checked, there is already a parameter >> "onlyMorePopular" that you can use to dynamically control this >> feature from the client side so it does not have to be hard coded >> within the spelling checker. >> >> Review these parameter options on the wiki, but keep in mind I have >> not updated the wiki with my changes or the new parameter and >> result fields: >> http://wiki.apache.org/solr/SpellCheckerRequestHandler >> >> Thanks Climbingrose, >> >> Scott Tabar >> >> >> >> >> ---- climbingrose <[EMAIL PROTECTED]> wrote: >> Just to clarify this line of code: >> >> String[] suggestions = spellChecker.suggestSimilar(termText, numSug, >> req.getSearcher().getReader(), restrictToField, true); >> >> I only return suggestions if they are more popular than termText. You >> probably need to use code in Scott's patch to make this behaviour >> configurable. >> >> On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote: >>> >>> Hi all, >>> >>> I've been so busy the last few days so I haven't replied to this >>> email. I >>> modified SpellCheckerHandler a while ago to include support for >>> multiword >>> query. To be honest, I didn't have time to write unit test for the >>> code. >>> However, I deployed it in a production environment and it has been >>> working >>> for me so far. My version, however, has two assumptions: >>> >>> 1) I assumpt that when user enter a misspelled multiword query, we >>> should >>> only check for words that are actually misspelled. For example, if >>> user >>> enter "life expectancy calculatar", which has "calculator" >>> misspelled, we >>> should only spellcheck "calculatar". >>> 2) I only return the best string for a mispelled query. >>> >>> I guess I can just directly paste the code here so that others can >>> adapt >>> for their own purposes. If you have any question, just send me an >>> email. >>> I'll happy to help you. >>> >>> StringBuffer buf = null; >>> if (null != words && !"".equals(words.trim())) { >>> Analyzer analyzer = req.getSchema >>> ().getField(field).getType().getAnalyzer(); >>> >>> TokenStream source = analyzer.tokenStream(field, new >>> StringReader(words)); >>> Token t; >>> boolean hasSuggestion = false; >>> boolean termExists = false; >>> while (true) { >>> try { >>> t = source.next(); >>> } catch (IOException e) { >>> t = null; >>> } >>> if (t == null) >>> break; >>> >>> String termText = t.termText(); >>> String[] suggestions = spellChecker.suggestSimilar >>> (termText, >>> numSug, req.getSearcher().getReader(), restrictToField, true); >>> if (suggestions != null && suggestions.length > 0) { >>> if (!suggestions[0].equals(termText)) { >>> hasSuggestion = true; >>> } >>> if (buf == null) { >>> buf = new StringBuffer(suggestions[0]); >>> } else >>> buf.append(" ").append(suggestions[0]); >>> } else if (spellChecker.exist(termText)){ >>> termExists = true; >>> if (buf == null) { >>> buf = new StringBuffer(termText); >>> } else >>> buf.append(" ").append(termText); >>> } else { >>> hasSuggestion = false; >>> termExists= false; >>> break; >>> } >>> } >>> try { >>> source.close(); >>> } catch (IOException e) { >>> // ignore >>> } >>> // String[] suggestions = spellChecker.suggestSimilar >>> (words, >>> numSug, >>> // nullReader, restrictToField, onlyMorePopular); >>> if (hasSuggestion || (!hasSuggestion && termExists)) >>> rsp.add("suggestions", buf.toString()); >>> else >>> rsp.add("suggestions", null); >>> >>> >>> >>> On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>>> >>>> Hoss, >>>> >>>> I had a feeling someone would be quoting Yonik's Law of >>>> Patches! ;-) >>>> >>>> For now, this is done. >>>> >>>> I created the changes, created JavaDoc comments on the various >>>> settings >>>> and their expected output, created a JUnit test for the >>>> SpellCheckerRequestHandler >>>> which tests various components of the handler, and I also created >>>> the >>>> supporting configuration files for the JUnit tests (schema and >>>> solrconfig files). >>>> >>>> I attached the patch to the JIRA issue so now we just have to >>>> wait until >>>> it gets >>>> added back in to the main code stream. >>>> >>>> For anyone who is interested, here is a link to the JIRA: >>>> https://issues.apache.org/jira/browse/SOLR-375 >>>> >>>> Could someone please drop me a hint on how to update the wiki or any >>>> other >>>> documentation that could benefit to being updated; I'll like to >>>> help out >>>> as much >>>> as possible, but first I need to know "how". ;-) >>>> >>>> When these changes do get committed back in to the daily build, >>>> please >>>> review the generated JavaDoc for information on how to utilize >>>> these new >>>> features. >>>> If anyone has any questions, or comments, please do not hesitate >>>> to ask. >>>> >>>> >>>> As a general note of a self-critique on these changes, I am not 100% >>>> sure of the way I >>>> implemented the "nested" structure when the "multiWords" >>>> parameter is >>>> used. My interest >>>> is that it should work smoothly with some other technology such as >>>> Prototype using the >>>> JSon output type. Unfortunately, I will not be getting a chance to >>>> start on that coding until >>>> next week so it is up in the air as to if this structure will be >>>> conducive or not. I am planning >>>> on providing more details in the documentations as far as how to >>>> utilize >>>> these modifications >>>> in Prototype and AJax when I get a chance (even provide links to a >>>> production site so you >>>> can see it in action and view the source if interested). So stay >>>> tuned... >>>> >>>> Thanks for everyones time, >>>> Scott Tabar >>>> >>>> ---- Chris Hostetter <[EMAIL PROTECTED]> wrote: >>>> >>>> : If you like, I can post the source code changes that I made to the >>>> : SpellCheckerRequestHandler, but at this time I am not ready to >>>> open a >>>> : JIRA issue and submit the changes back through the subversion. >>>> I will >>>> : need to do a little more testing, documentation, and create >>>> some unit >>>> : tests to cover all of these changes, but what I have been able to >>>> : perform, it is working very well. >>>> >>>> Keep in mind "Yonik's Law Of Patches" ... >>>> >>>> "A half-baked patch in Jira, with no documentation, no tests >>>> and no backwards compatibility is better than no patch at >>>> all." >>>> http://wiki.apache.org/solr/HowToContribute >>>> >>>> ...even if you don't think the code is "solid" yet, if you want to >>>> eventually make it available to people, making a "rough" version >>>> available >>>> to people early gives other people the opportunity to help you >>>> make it >>>> solid (by writing unit tests, fixing bugs, and adding >>>> documentation). >>>> >>>> >>>> -Hoss >>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Cuong Hoang >> >> >> >> >> -- >> Regards, >> >> Cuong Hoang >> > > > > -- View this message in context: http://www.nabble.com/Re%3A-Spell-Check-Handler-tp13090498p15093599.html Sent from the Solr - User mailing list archive at Nabble.com.