Re: Spell Check Handler

scott.tabar Thu, 11 Oct 2007 07:15:54 -0700

Climbingrose,

I think you make a valid point.  Each person may have a different concept of 
how something should work with their application.


My thought on the subject of spell checking multiple words:
  - the parameter "multiWords" enables spell checking on each word in "q" 
parameter instead of on the whole field
  - each word is then represented in its own entry in a list of all words that 
are checked
  - to identify each word that is being checked within that entry, it is 
identified by the key "words"
  - to identify if the word was found exactly as it is within the spell 
checker's index, the "exist" key contains this information
  - Since there can be suggestions for both misspelled words and words that are 
spelled correctly, the list of suggestions is also included for both correctly 
spelled and misspelled words, even if the suggestion list is empty.

  - My vision is that if a user has a search query of multiple words and they 
are wanting to perform a check on the words, the use of "multiWords" will check 
all words at one time, independently from each others and return the list.  The 
presenting web app can then identify visually to the user which words are 
misspelled and which ones have suggestions too.  The user can then work with 
the various lists of suggestions without having to re-hit Solr.  Naturally, if 
the user manually changes a word, then Solr will have to be re-hit, but 
providing a single list of all words, including suggestions for correct words 
along with incorrect words, will help simplify applications (by reducing 
iterating over each word) and will help reduce the number of hits to the Solr 
server.


> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".

I think I understand what you mean in the above statement, but you must admit, 
it does sound funny.  After all, how do you identify that a word is misspelled 
by NOT using the spelling checker?  Correct me if I am wrong, but I think you 
intended to say that when a word is identified as being misspelled, then you 
should only include the suggestions for misspelled words.  If this is the case, 
then I would have to disagree with you.  The user may be interested in finding 
words that might mean the same, but are more popular (appears in more indexed 
documents within the Lucene index).  Hence the reason why I added the result 
field "exist" to identify that a word is spelled correctly even if there is a 
list of suggestions.  Please note, the situation can exist too where a word is 
misspelled and there are no suggestions so one cannot use the suggestion list 
as an indicator to the correctness of the individual word(s).
 

> 2) I only return the best string for a mispelled query.

You can also use the parameter "suggestionCount=1" to control how many words 
are returned.  In this case, it will do what your code is doing, but still 
allow the client to dynamically change this value without the need to hard code 
it within the main source code.


As far as only including terms that are more popular than the word that is 
being checked, there is already a parameter "onlyMorePopular" that you can use 
to dynamically control this feature from the client side so it does not have to 
be hard coded within the spelling checker.

Review these parameter options on the wiki, but keep in mind I have not updated 
the wiki with my changes or the new parameter and result fields:
http://wiki.apache.org/solr/SpellCheckerRequestHandler

   Thanks Climbingrose,

     Scott Tabar




---- climbingrose <[EMAIL PROTECTED]> wrote: 
Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I've been so busy the last few days so I haven't replied to this email. I
> modified SpellCheckerHandler a while ago to include support for multiword
> query. To be honest, I didn't have time to write unit test for the code.
> However, I deployed it in a production environment and it has been working
> for me so far. My version, however, has two assumptions:
>
> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".
> 2) I only return the best string for a mispelled query.
>
> I guess I can just directly paste the code here so that others can adapt
> for their own purposes. If you have any question, just send me an email.
> I'll happy to help  you.
>
>         StringBuffer buf = null;
>         if (null != words && !"".equals(words.trim())) {
>             Analyzer analyzer = req.getSchema
> ().getField(field).getType().getAnalyzer();
>
>             TokenStream source = analyzer.tokenStream(field, new
> StringReader(words));
>             Token t;
>             boolean hasSuggestion = false;
>             boolean termExists = false;
>             while (true) {
>                 try {
>                     t = source.next();
>                 } catch (IOException e) {
>                     t = null;
>                 }
>                 if (t == null)
>                     break;
>
>                 String termText = t.termText();
>                 String[] suggestions = spellChecker.suggestSimilar(termText,
> numSug, req.getSearcher().getReader(), restrictToField, true);
>                 if (suggestions != null && suggestions.length > 0) {
>                     if (!suggestions[0].equals(termText)) {
>                         hasSuggestion = true;
>                     }
>                     if (buf == null) {
>                         buf = new StringBuffer(suggestions[0]);
>                     } else
>                         buf.append(" ").append(suggestions[0]);
>                 } else if (spellChecker.exist(termText)){
>                     termExists = true;
>                     if (buf == null) {
>                         buf = new StringBuffer(termText);
>                     } else
>                         buf.append(" ").append(termText);
>                 } else {
>                     hasSuggestion = false;
>                     termExists= false;
>                     break;
>                 }
>             }
>             try {
>                 source.close();
>             } catch (IOException e) {
>                 // ignore
>             }
>             // String[] suggestions = spellChecker.suggestSimilar(words,
> numSug,
>             // nullReader, restrictToField, onlyMorePopular);
>             if (hasSuggestion || (!hasSuggestion && termExists))
>                 rsp.add("suggestions", buf.toString());
>             else
>                 rsp.add("suggestions", null);
>
>
>
> On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >
> > Hoss,
> >
> > I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
> >
> > For now, this is done.
> >
> > I created the changes, created JavaDoc comments on the various settings
> > and their expected output, created a JUnit test for the
> > SpellCheckerRequestHandler
> > which tests various components of the handler, and I also created the
> > supporting configuration files for the JUnit tests (schema and
> > solrconfig files).
> >
> > I attached the patch to the JIRA issue so now we just have to wait until
> > it gets
> > added back in to the main code stream.
> >
> > For anyone who is interested, here is a link to the JIRA:
> > https://issues.apache.org/jira/browse/SOLR-375
> >
> > Could someone please drop me a hint on how to update the wiki or any
> > other
> > documentation that could benefit to being updated; I'll like to help out
> > as much
> > as possible, but first I need to know "how". ;-)
> >
> > When these changes do get committed back in to the daily build, please
> > review the generated JavaDoc for information on how to utilize these new
> > features.
> > If anyone has any questions, or comments, please do not hesitate to ask.
> >
> >
> > As a general note of a self-critique on these changes, I am not 100%
> > sure of the way I
> > implemented the "nested" structure when the "multiWords" parameter is
> > used.  My interest
> > is that it should work smoothly with some other technology such as
> > Prototype using the
> > JSon output type.  Unfortunately, I will not be getting a chance to
> > start on that coding until
> > next week so it is up in the air as to if this structure will be
> > conducive or not.  I am planning
> > on providing more details in the documentations as far as how to utilize
> > these modifications
> > in Prototype and AJax when I get a chance (even provide links to a
> > production site so you
> > can see it in action and view the source if interested).  So stay
> > tuned...
> >
> >    Thanks for everyones time,
> >       Scott Tabar
> >
> > ---- Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > : If you like, I can post the source code changes that I made to the
> > : SpellCheckerRequestHandler, but at this time I am not ready to open a
> > : JIRA issue and submit the changes back through the subversion.  I will
> > : need to do a little more testing, documentation, and create some unit
> > : tests to cover all of these changes, but what I have been able to
> > : perform, it is working very well.
> >
> > Keep in mind "Yonik's Law Of Patches" ...
> >
> >         "A half-baked patch in Jira, with no documentation, no tests
> >         and no backwards compatibility is better than no patch at all."
> >         http://wiki.apache.org/solr/HowToContribute
> >
> > ...even if you don't think the code is "solid" yet, if you want to
> > eventually make it available to people, making a "rough" version
> > available
> > to people early gives other people the opportunity to help you make it
> > solid (by writing unit tests, fixing bugs, and adding documentation).
> >
> >
> > -Hoss
> >
> >
> >
>
>
> --
> Regards,
>
> Cuong Hoang




-- 
Regards,

Cuong Hoang

Re: Spell Check Handler

Reply via email to