Re: Improving Solr Spell Checker Results

Erick Erickson Sun, 22 Jan 2012 18:02:14 -0800

David:

There's some good info here:
http://wiki.apache.org/solr/HowToContribute#Working_With_Patches


But the short form is to go into solr_home and issue this command:
'svn diff > SOLR-2585.patch'. IDE's may also have a "create patch"
feature, but I find the straight SVN command more reliable.

Note I'm not saying that your patch will necessarily be picked up, but
it's a thoughtful gesture to upload a more current patch. In your
comments please identify what code line you're working on (4.x? 3.x?).

And when you upload, down near the bottom of the dialog box there'll be
a radio button about "grant ASF license" which is fairly important to
click for legal reasons....

Thanks
Erick

On Sun, Jan 22, 2012 at 5:54 PM, David Radunz <da...@boxen.net> wrote:
> Hey Erick,
>
>    Sure, can you explain the process to create the patch and upload it and
> i'll do it first thing tomorrow.
>
> Thanks again for your help,
>
> David
>
>
> On 23/01/2012 12:51 PM, Erick Erickson wrote:
>>
>> I can't help with your *real* problem, but when looking at patches,
>> if the "resolution" field isn't set to something like "fixed" it means
>> that the patch has NOT  been applied to any code lines. There
>> also should be commit revisions specified in the comments.
>> If "Fix Versions" has values, that doesn't mean the patch has
>> been applied either, that's often just a statement of where
>> the patch *should* go.
>>
>> And, between the time someone uploads a patch and it actually
>> gets *committed*, the underlying code line can, indeed,  change
>> and the patch doesn't apply cleanly. Since you've already had
>> to do this, could you upload your version that *does* apply
>> cleanly?
>>
>> Best
>> Erick
>>
>> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz<da...@boxen.net>  wrote:
>>>
>>> James,
>>>
>>>    I worked out that I actually needed to 'apply' patch SOLR-2585,
>>> whoops.
>>> So I have done that now and it seems to return 'correctlySpelled=true'
>>> for
>>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
>>> something have changed in the trunk to make your patch no longer work? I
>>> had
>>> to manually merge the setup for the test case due to a new 'hyphens' test
>>> case. The settings I am use are:
>>>
>>> <lst name="defaults">
>>> <str name="echoParams">explicit</str>
>>> <int name="rows">10</int>
>>>
>>> <str name="spellcheck.onlyMorePopular">false</str>
>>> <int name="spellcheck.count">10</int>
>>> <str name="spellcheck.extendedResults">true</str>
>>> <str name="spellcheck.collate">true</str>
>>> <str name="spellcheck.collateExtendedResults">true</str>
>>> <int name="spellcheck.maxCollationTries">10</int>
>>> <int name="spellcheck.maxCollations">1</int>
>>>
>>> <int name="spellcheck.alternativeTermCount">5</int>
>>> <int name="spellcheck.maxResultsForSuggest">1</int>
>>> </lst>
>>>
>>>
>>> <lst name="spellchecker">
>>> <str name="name">default</str>
>>> <str name="field">spell</str>
>>> <str name="classname">solr.DirectSolrSpellChecker</str>
>>>
>>> <!-- the spellcheck distance measure used, the default is the internal
>>> levenshtein -->
>>> <str name="distanceMeasure">internal</str>
>>> <!-- minimum accuracy needed to be considered a valid spellcheck
>>> suggestion
>>> -->
>>> <float name="accuracy">0.5</float>
>>> <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
>>> -->
>>> <int name="maxEdits">2</int>
>>> <!-- the minimum shared prefix when enumerating terms -->
>>> <int name="minPrefix">1</int>
>>> <!-- maximum number of inspections per result. -->
>>> <int name="maxInspections">5</int>
>>> <!-- minimum length of a query term to be considered for correction -->
>>> <int name="minQueryLength">4</int>
>>> <!-- maximum threshold of documents a query term can appear to be
>>> considered
>>> for correction -->
>>> <float name="maxQueryFrequency">0.01</float>
>>> <!-- require suggestions to occur in 0.1% of the documents -->
>>> <!--
>>> <float name="thresholdTokenFrequency">0.001</float>
>>>      -->
>>>
>>> <str name="spellcheckIndexDir">spellchecker</str>
>>> <str name="buildOnCommit">true</str>
>>> </lst>
>>>
>>> With the query:
>>>
>>>
>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>>>
>>> Cheers,
>>>
>>> David
>>>
>>>
>>>
>>> On 22/01/2012 2:03 AM, David Radunz wrote:
>>>>
>>>> James,
>>>>
>>>>    Thanks again for your lengthy and informative response. I updated
>>>> from
>>>> SVN trunk again today and was successfully able to run 'ant test'. So I
>>>> proceeded with trying your suggestions (for question 1 so far):
>>>>
>>>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>>>
>>>>> David,
>>>>>
>>>>> The spellchecker normally won't give suggestions for any term in your
>>>>> index.  So even if "wever" is misspelled in context, if it exists in
>>>>> the
>>>>> index the spell checker will not try correcting it.  There are 3
>>>>> workarounds:
>>>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>>>  See https://issues.apache.org/jira/browse/SOLR-2585
>>>>
>>>> I have tried using this with the original test case of 'Signorney
>>>> Wever'.
>>>> I didn't notice any difference, although I am a little unclear as to
>>>> what
>>>> exactly this patch does. Nor am I really clear what to set either of the
>>>> options to, so I set them both to '5'. I tried to find the test case it
>>>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>>>> suggestions?
>>>>
>>>>> 2. try "onlyMorePopular=true" in your request.
>>>>>
>>>>>  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>>>  But see the September 2, 2011 comment in SOLR-2585 about why this
>>>>> might not
>>>>> do what you'd hope it would.
>>>>
>>>>
>>>> Trying this did produce 'Signourney Weaver' as you would hope, but I am
>>>> a
>>>> little afraid of the downside. I would much more like a context
>>>> sensative
>>>> spell check that involves the terms around the correction.
>>>>>
>>>>>
>>>>> 3. If you're building your index on a<copyField />, you can add a
>>>>> stopword filter that filters out all of the misspelt or rare words from
>>>>> the
>>>>> field that the dictionary is based.  This could be an arduous task, and
>>>>> it
>>>>> may or may not work well for your data.
>>>>
>>>> I am currently using a copyField for all terms that are relevant, which
>>>> is
>>>> quite a lot and the dictionary would encompass a huge amount of data.
>>>> Adding
>>>> stopword filters would be out of the question as we presently have more
>>>> than
>>>> 30,000 products and this is for the initial launch, we intend to have
>>>> many
>>>> many more.
>>>>>
>>>>>
>>>>> As for your second question, I take it you're using (e)dismax with
>>>>> multiple fields in "qf", right?  The only way I know to handle this is
>>>>> to
>>>>> create a<copyfield>    that combines all of the fields you search
>>>>> across.  Use
>>>>> this combined field to base your dictionary.  Also, specifying
>>>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>>>> nonsense word combinations that are likely to occur when doing this,
>>>>> ensuring that any collations provided will indeed yield hits.  The
>>>>> downside
>>>>> to doing this, of course, is it will make your first problem more acute
>>>>> in
>>>>> that there will be even more terms in your index that the spellchecker
>>>>> will
>>>>> ignore entirely, even if they're mispelled in context.  Once again,
>>>>> SOLR-2585 is designed to tackle this problem but it is still in its
>>>>> early
>>>>> stages, and thus far it is Trunk-only.
>>>>
>>>> I tried setting spellcheck.maxCollationTries to 5 to see if it would
>>>> help
>>>> with the above problem, but it did not.
>>>>
>>>> I have now tried using it in the context of question 2. I tried
>>>> searching
>>>> for 'Sigorney Wever' in the series name (which it's not present in, as
>>>> its
>>>> an actor):
>>>>
>>>>
>>>>
>>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
>>>>
>>>> Suggestions for 'Sigourney' Wever were returned, but no spelling
>>>> suggestions or ones for series names (which i doubt there would be)
>>>> should
>>>> have been returned.
>>>>
>>>>> You might also be interested in
>>>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
>>>>> unrelated to your two questions, the patch on this issue introduces a
>>>>> new
>>>>> "ConjunctionSolrSpellChecker" which theoretically could be enhanced to
>>>>> do
>>>>> exactly what you want.  That is, you could (theoretically) create
>>>>> separate
>>>>> dictionaries for each of the fields you're searching and let the CSSC
>>>>> combine the results&    generate collations, etc.
>>>>
>>>>
>>>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I
>>>> presume will help with this? I am a senior developer (in
>>>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr
>>>> source
>>>> code. So I am in the dark when you say it could be tailored for my
>>>> needs.
>>>> Also, how would it work? Query wise.. Would it be like..
>>>> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
>>>> sounds tempting to try and achieve. But if you could provide any
>>>> pointers in
>>>> what exactly would be required that would really help.
>>>>
>>>> Thanks again for your time,
>>>>
>>>> David
>>>>>
>>>>>
>>>>> James Dyer
>>>>> E-Commerce Systems
>>>>> Ingram Content Group
>>>>> (615) 213-4311
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Radunz [mailto:da...@boxen.net]
>>>>> Sent: Friday, January 13, 2012 11:42 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Improving Solr Spell Checker Results
>>>>>
>>>>> Hey,
>>>>>
>>>>>      Firstly I would like to thank you all for creating such a great
>>>>> searching platform. What I was wondering is whether it is possible to:
>>>>>
>>>>> 1. Have the spell checker take into account multiple words. For example
>>>>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue
>>>>> as
>>>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>>>> spelling is: Sigourney Weaver
>>>>> 2. Have the spell checker return corrections only for dictionary items
>>>>> added on the field being searched. i.e. Searching for an actor would
>>>>> only use the dictionary fields from the actor. This makes sense on many
>>>>> levels, as when you are field searching its useless to get a correction
>>>>> from another field as no values would match in any case.
>>>>>
>>>>> Hopefully someone can help!
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> David
>>>>
>>>>
>

Re: Improving Solr Spell Checker Results

Reply via email to