Re: Improving Solr Spell Checker Results

David Radunz Sun, 22 Jan 2012 04:42:51 -0800

Hey James,

I have played around a bit more with the settings and tried settingspellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. Thisyields 'Sigourney Weaver' as ONE of the corrections, but it's the secondone and not the first. Which is wrong if this is a patch for 'contextsensative', because it doesn't really seem to honor any context at all.Unless I am missunderstanding this? Also, I don't really likemaxResultsForSuggest as it means 'all or nothing'. If you set it to 10and there are 100 results, then you offer no corrections at all even ifthe term is missing in the dictionary entirely.

If I set spellcheck.maxResultsForSuggest=100 andspellcheck.maxCollations=3 and choose the collation with the largest'hits' I get Sigourney Weaver and other 'popular' terms. But say Isearched for 'pork and chups', the 'popular' correction is 'park andchips' where as the first correction was correct: 'pork and chips'.

So really, none of the solutions either in this patch or Solr offerwhat I would truely call context sensative spell checking. That being,in a full text search engine you find documents based on terms and howclose they are togehter in the document. It makes more than perfectsense to treat the dictionary like this, so that when there are multipleterms it offers suggestions for the terms that match closely to whatsentered surrounding the term.


Example:

    "Sigourney Wever" would never appear in a document ever.

"Sigourney Weaver" however has many 'hits' in exactly that order ofwords.

So there needs to be a way to boost suggestions based on adjacency...Much like the full text search operates.


Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:

James,
I worked out that I actually needed to 'apply' patch SOLR-2585,whoops. So I have done that now and it seems to return'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't evenin the dictionary). Could something have changed in the trunk to makeyour patch no longer work? I had to manually merge the setup for thetest case due to a new 'hyphens' test case. The settings I am use are:
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>

<str name="spellcheck.onlyMorePopular">false</str>
<int name="spellcheck.count">10</int>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<int name="spellcheck.maxCollationTries">10</int>
<int name="spellcheck.maxCollations">1</int>

<int name="spellcheck.alternativeTermCount">5</int>
<int name="spellcheck.maxResultsForSuggest">1</int>
</lst>


<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>

<str name="distanceMeasure">internal</str>

<float name="accuracy">0.5</float>

<int name="maxEdits">2</int>

<int name="minPrefix">1</int>

<int name="maxInspections">5</int>

<int name="minQueryLength">4</int>

<float name="maxQueryFrequency">0.01</float>



<str name="spellcheckIndexDir">spellchecker</str>
<str name="buildOnCommit">true</str>
</lst>

With the query:
spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:
James,
Thanks again for your lengthy and informative response. I updatedfrom SVN trunk again today and was successfully able to run 'anttest'. So I proceeded with trying your suggestions (for question 1 sofar):
On 17/01/2012 5:32 AM, Dyer, James wrote:
David,
The spellchecker normally won't give suggestions for any term inyour index. So even if "wever" is misspelled in context, if itexists in the index the spell checker will not try correcting it.There are 3 workarounds:1. Use the patch included with SOLR-2585 (this is for Trunk/4.xonly). See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'SignorneyWever'. I didn't notice any difference, although I am a littleunclear as to what exactly this patch does. Nor am I really clearwhat to set either of the options to, so I set them both to '5'. Itried to find the test case it mentions, but it's not present inSpellCheckCollatorTest.java .. Any suggestions?
2. try "onlyMorePopular=true" in your request.(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).But see the September 2, 2011 comment in SOLR-2585 about why thismight not do what you'd hope it would.
Trying this did produce 'Signourney Weaver' as you would hope, but Iam a little afraid of the downside. I would much more like a contextsensative spell check that involves the terms around the correction.
3. If you're building your index on a<copyField />, you can add astopword filter that filters out all of the misspelt or rare wordsfrom the field that the dictionary is based. This could be anarduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant,which is quite a lot and the dictionary would encompass a huge amountof data. Adding stopword filters would be out of the question as wepresently have more than 30,000 products and this is for the initiallaunch, we intend to have many many more.
As for your second question, I take it you're using (e)dismax withmultiple fields in "qf", right? The only way I know to handle thisis to create a<copyfield> that combines all of the fields yousearch across. Use this combined field to base your dictionary.Also, specifying "spellcheck.maxCollationTries" with a non-zerovalue will weed out the nonsense word combinations that are likelyto occur when doing this, ensuring that any collations provided willindeed yield hits. The downside to doing this, of course, is itwill make your first problem more acute in that there will be evenmore terms in your index that the spellchecker will ignore entirely,even if they're mispelled in context. Once again, SOLR-2585 isdesigned to tackle this problem but it is still in its early stages,and thus far it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it wouldhelp with the above problem, but it did not.
I have now tried using it in the context of question 2. I triedsearching for 'Sigorney Wever' in the series name (which it's notpresent in, as its an actor):
spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
Suggestions for 'Sigourney' Wever were returned, but no spellingsuggestions or ones for series names (which i doubt there would be)should have been returned.
You might also be interested inhttps://issues.apache.org/jira/browse/SOLR-2993 . Although this isunrelated to your two questions, the patch on this issue introducesa new "ConjunctionSolrSpellChecker" which theoretically could beenhanced to do exactly what you want. That is, you could(theoretically) create separate dictionaries for each of the fieldsyou're searching and let the CSSC combine the results& generatecollations, etc.
During the upgrade I switched to solr.DirectSolrSpellChecker, which Ipresume will help with this? I am a senior developer (inJava/Perl/Python/PHP) but I have not as yet looked at any of the Solrsource code. So I am in the dark when you say it could be tailoredfor my needs. Also, how would it work? Query wise.. Would it belike.. spellcheck.series_name.q= and spellcheck.actor.q= and so on?If so that sounds tempting to try and achieve. But if you couldprovide any pointers in what exactly would be required that wouldreally help.
Thanks again for your time,

David
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: David Radunz [mailto:[email protected]]
Sent: Friday, January 13, 2012 11:42 PM
To: [email protected]
Subject: Improving Solr Spell Checker Results

Hey,

      Firstly I would like to thank you all for creating such a great
searching platform. What I was wondering is whether it is possible to:

1. Have the spell checker take into account multiple words. For example
if I search for "Sigourney Wever" it doesn't flag as a spellingissue as
'wever' is a correctly spelled word. And if I searched for "Sigourney
Wevr" the suggestion is "Sigourney Wever". Of course the correct
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items
added on the field being searched. i.e. Searching for an actor would
only use the dictionary fields from the actor. This makes sense on many
levels, as when you are field searching its useless to get a correction
from another field as no values would match in any case.

Hopefully someone can help!

Thanks in advance,

David

Re: Improving Solr Spell Checker Results

Reply via email to