Re: Improving Solr Spell Checker Results

David Radunz Mon, 23 Jan 2012 06:47:35 -0800

Hey,

    Thanks for that, I have uploaded a new patch as advised.


Cheers,

David

On 23/01/2012 1:01 PM, Erick Erickson wrote:

David:

There's some good info here:
http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

But the short form is to go into solr_home and issue this command:
'svn diff>  SOLR-2585.patch'. IDE's may also have a "create patch"
feature, but I find the straight SVN command more reliable.

Note I'm not saying that your patch will necessarily be picked up, but
it's a thoughtful gesture to upload a more current patch. In your
comments please identify what code line you're working on (4.x? 3.x?).

And when you upload, down near the bottom of the dialog box there'll be
a radio button about "grant ASF license" which is fairly important to
click for legal reasons....

Thanks
Erick

On Sun, Jan 22, 2012 at 5:54 PM, David Radunz<da...@boxen.net>  wrote:

Hey Erick,

    Sure, can you explain the process to create the patch and upload it and
i'll do it first thing tomorrow.

Thanks again for your help,

David


On 23/01/2012 12:51 PM, Erick Erickson wrote:

I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunz<da...@boxen.net>    wrote:

James,

    I worked out that I actually needed to 'apply' patch SOLR-2585,
whoops.
So I have done that now and it seems to return 'correctlySpelled=true'
for
'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
something have changed in the trunk to make your patch no longer work? I
had
to manually merge the setup for the test case due to a new 'hyphens' test
case. The settings I am use are:

<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>

<str name="spellcheck.onlyMorePopular">false</str>
<int name="spellcheck.count">10</int>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<int name="spellcheck.maxCollationTries">10</int>
<int name="spellcheck.maxCollations">1</int>

<int name="spellcheck.alternativeTermCount">5</int>
<int name="spellcheck.maxResultsForSuggest">1</int>
</lst>


<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>

<!-- the spellcheck distance measure used, the default is the internal
levenshtein -->
<str name="distanceMeasure">internal</str>
<!-- minimum accuracy needed to be considered a valid spellcheck
suggestion
-->
<float name="accuracy">0.5</float>
<!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
-->
<int name="maxEdits">2</int>
<!-- the minimum shared prefix when enumerating terms -->
<int name="minPrefix">1</int>
<!-- maximum number of inspections per result. -->
<int name="maxInspections">5</int>
<!-- minimum length of a query term to be considered for correction -->
<int name="minQueryLength">4</int>
<!-- maximum threshold of documents a query term can appear to be
considered
for correction -->
<float name="maxQueryFrequency">0.01</float>
<!-- require suggestions to occur in 0.1% of the documents -->
<!--
<float name="thresholdTokenFrequency">0.001</float>
      -->

<str name="spellcheckIndexDir">spellchecker</str>
<str name="buildOnCommit">true</str>
</lst>

With the query:


spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David



On 22/01/2012 2:03 AM, David Radunz wrote:

James,

    Thanks again for your lengthy and informative response. I updated
from
SVN trunk again today and was successfully able to run 'ant test'. So I
proceeded with trying your suggestions (for question 1 so far):

On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your
index.  So even if "wever" is misspelled in context, if it exists in
the
index the spell checker will not try correcting it.  There are 3
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
  See https://issues.apache.org/jira/browse/SOLR-2585

I have tried using this with the original test case of 'Signorney
Wever'.
I didn't notice any difference, although I am a little unclear as to
what
exactly this patch does. Nor am I really clear what to set either of the
options to, so I set them both to '5'. I tried to find the test case it
mentions, but it's not present in SpellCheckCollatorTest.java .. Any
suggestions?

2. try "onlyMorePopular=true" in your request.

  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
  But see the September 2, 2011 comment in SOLR-2585 about why this
might not
do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I am
a
little afraid of the downside. I would much more like a context
sensative
spell check that involves the terms around the correction.


3. If you're building your index on a<copyField />, you can add a
stopword filter that filters out all of the misspelt or rare words from
the
field that the dictionary is based.  This could be an arduous task, and
it
may or may not work well for your data.

I am currently using a copyField for all terms that are relevant, which
is
quite a lot and the dictionary would encompass a huge amount of data.
Adding
stopword filters would be out of the question as we presently have more
than
30,000 products and this is for the initial launch, we intend to have
many
many more.


As for your second question, I take it you're using (e)dismax with
multiple fields in "qf", right?  The only way I know to handle this is
to
create a<copyfield>      that combines all of the fields you search
across.  Use
this combined field to base your dictionary.  Also, specifying
"spellcheck.maxCollationTries" with a non-zero value will weed out the
nonsense word combinations that are likely to occur when doing this,
ensuring that any collations provided will indeed yield hits.  The
downside
to doing this, of course, is it will make your first problem more acute
in
that there will be even more terms in your index that the spellchecker
will
ignore entirely, even if they're mispelled in context.  Once again,
SOLR-2585 is designed to tackle this problem but it is still in its
early
stages, and thus far it is Trunk-only.

I tried setting spellcheck.maxCollationTries to 5 to see if it would
help
with the above problem, but it did not.

I have now tried using it in the context of question 2. I tried
searching
for 'Sigorney Wever' in the series name (which it's not present in, as
its
an actor):



spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5

Suggestions for 'Sigourney' Wever were returned, but no spelling
suggestions or ones for series names (which i doubt there would be)
should
have been returned.

You might also be interested in
https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
unrelated to your two questions, the patch on this issue introduces a
new
"ConjunctionSolrSpellChecker" which theoretically could be enhanced to
do
exactly what you want.  That is, you could (theoretically) create
separate
dictionaries for each of the fields you're searching and let the CSSC
combine the results&      generate collations, etc.


During the upgrade I switched to solr.DirectSolrSpellChecker, which I
presume will help with this? I am a senior developer (in
Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr
source
code. So I am in the dark when you say it could be tailored for my
needs.
Also, how would it work? Query wise.. Would it be like..
spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
sounds tempting to try and achieve. But if you could provide any
pointers in
what exactly would be required that would really help.

Thanks again for your time,

David


James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: David Radunz [mailto:da...@boxen.net]
Sent: Friday, January 13, 2012 11:42 PM
To: solr-user@lucene.apache.org
Subject: Improving Solr Spell Checker Results

Hey,

      Firstly I would like to thank you all for creating such a great
searching platform. What I was wondering is whether it is possible to:

1. Have the spell checker take into account multiple words. For example
if I search for "Sigourney Wever" it doesn't flag as a spelling issue
as
'wever' is a correctly spelled word. And if I searched for "Sigourney
Wevr" the suggestion is "Sigourney Wever". Of course the correct
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items
added on the field being searched. i.e. Searching for an actor would
only use the dictionary fields from the actor. This makes sense on many
levels, as when you are field searching its useless to get a correction
from another field as no values would match in any case.

Hopefully someone can help!

Thanks in advance,

David

Re: Improving Solr Spell Checker Results

Reply via email to