Hey,
I am trying to send this again as 'plain-text' to see if it
delivers ok this time. All of the previous messages I sent should be below..
Cheers,
David
On 22/01/2012 11:42 PM, David Radunz wrote:
Hey James,
I have played around a bit more with the settings and tried
setting spellcheck.maxResultsForSuggest=100 and
spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of
the corrections, but it's the second one and not the first. Which is
wrong if this is a patch for 'context sensative', because it doesn't
really seem to honor any context at all. Unless I am missunderstanding
this? Also, I don't really like maxResultsForSuggest as it means 'all
or nothing'. If you set it to 10 and there are 100 results, then you
offer no corrections at all even if the term is missing in the
dictionary entirely.
If I set spellcheck.maxResultsForSuggest=100 and
spellcheck.maxCollations=3 and choose the collation with the largest
'hits' I get Sigourney Weaver and other 'popular' terms. But say I
searched for 'pork and chups', the 'popular' correction is 'park and
chips' where as the first correction was correct: 'pork and chips'.
So really, none of the solutions either in this patch or Solr
offer what I would truely call context sensative spell checking. That
being, in a full text search engine you find documents based on terms
and how close they are togehter in the document. It makes more than
perfect sense to treat the dictionary like this, so that when there
are multiple terms it offers suggestions for the terms that match
closely to whats entered surrounding the term.
Example:
"Sigourney Wever" would never appear in a document ever.
"Sigourney Weaver" however has many 'hits' in exactly that order
of words.
So there needs to be a way to boost suggestions based on adjacency...
Much like the full text search operates.
Thoughts?
David
On 22/01/2012 9:56 PM, David Radunz wrote:
James,
I worked out that I actually needed to 'apply' patch SOLR-2585,
whoops. So I have done that now and it seems to return
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't
even in the dictionary). Could something have changed in the trunk to
make your patch no longer work? I had to manually merge the setup for
the test case due to a new 'hyphens' test case. The settings I am use
are:
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="spellcheck.onlyMorePopular">false</str>
<int name="spellcheck.count">10</int>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<int name="spellcheck.maxCollationTries">10</int>
<int name="spellcheck.maxCollations">1</int>
<int name="spellcheck.alternativeTermCount">5</int>
<int name="spellcheck.maxResultsForSuggest">1</int>
</lst>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<!-- the spellcheck distance measure used, the default is the
internal levenshtein -->
<str name="distanceMeasure">internal</str>
<!-- minimum accuracy needed to be considered a valid spellcheck
suggestion -->
<float name="accuracy">0.5</float>
<!-- the maximum #edits we consider when enumerating terms: can be 1
or 2 -->
<int name="maxEdits">2</int>
<!-- the minimum shared prefix when enumerating terms -->
<int name="minPrefix">1</int>
<!-- maximum number of inspections per result. -->
<int name="maxInspections">5</int>
<!-- minimum length of a query term to be considered for correction -->
<int name="minQueryLength">4</int>
<!-- maximum threshold of documents a query term can appear to be
considered for correction -->
<float name="maxQueryFrequency">0.01</float>
<!-- require suggestions to occur in 0.1% of the documents -->
<!--
<float name="thresholdTokenFrequency">0.001</float>
-->
<str name="spellcheckIndexDir">spellchecker</str>
<str name="buildOnCommit">true</str>
</lst>
With the query:
spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
Cheers,
David
On 22/01/2012 2:03 AM, David Radunz wrote:
James,
Thanks again for your lengthy and informative response. I
updated from SVN trunk again today and was successfully able to run
'ant test'. So I proceeded with trying your suggestions (for
question 1 so far):
On 17/01/2012 5:32 AM, Dyer, James wrote:
David,
The spellchecker normally won't give suggestions for any term in
your index. So even if "wever" is misspelled in context, if it
exists in the index the spell checker will not try correcting it.
There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x
only). See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney
Wever'. I didn't notice any difference, although I am a little
unclear as to what exactly this patch does. Nor am I really clear
what to set either of the options to, so I set them both to '5'. I
tried to find the test case it mentions, but it's not present in
SpellCheckCollatorTest.java .. Any suggestions?
2. try "onlyMorePopular=true" in your request.
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
But see the September 2, 2011 comment in SOLR-2585 about why this
might not do what you'd hope it would.
Trying this did produce 'Signourney Weaver' as you would hope, but I
am a little afraid of the downside. I would much more like a context
sensative spell check that involves the terms around the correction.
3. If you're building your index on a<copyField />, you can add a
stopword filter that filters out all of the misspelt or rare words
from the field that the dictionary is based. This could be an
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant,
which is quite a lot and the dictionary would encompass a huge
amount of data. Adding stopword filters would be out of the question
as we presently have more than 30,000 products and this is for the
initial launch, we intend to have many many more.
As for your second question, I take it you're using (e)dismax with
multiple fields in "qf", right? The only way I know to handle this
is to create a<copyfield> that combines all of the fields you
search across. Use this combined field to base your dictionary.
Also, specifying "spellcheck.maxCollationTries" with a non-zero
value will weed out the nonsense word combinations that are likely
to occur when doing this, ensuring that any collations provided
will indeed yield hits. The downside to doing this, of course, is
it will make your first problem more acute in that there will be
even more terms in your index that the spellchecker will ignore
entirely, even if they're mispelled in context. Once again,
SOLR-2585 is designed to tackle this problem but it is still in its
early stages, and thus far it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it would
help with the above problem, but it did not.
I have now tried using it in the context of question 2. I tried
searching for 'Sigorney Wever' in the series name (which it's not
present in, as its an actor):
spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
Suggestions for 'Sigourney' Wever were returned, but no spelling
suggestions or ones for series names (which i doubt there would be)
should have been returned.
You might also be interested in
https://issues.apache.org/jira/browse/SOLR-2993 . Although this is
unrelated to your two questions, the patch on this issue introduces
a new "ConjunctionSolrSpellChecker" which theoretically could be
enhanced to do exactly what you want. That is, you could
(theoretically) create separate dictionaries for each of the fields
you're searching and let the CSSC combine the results& generate
collations, etc.
During the upgrade I switched to solr.DirectSolrSpellChecker, which
I presume will help with this? I am a senior developer (in
Java/Perl/Python/PHP) but I have not as yet looked at any of the
Solr source code. So I am in the dark when you say it could be
tailored for my needs. Also, how would it work? Query wise.. Would
it be like.. spellcheck.series_name.q= and spellcheck.actor.q= and
so on? If so that sounds tempting to try and achieve. But if you
could provide any pointers in what exactly would be required that
would really help.
Thanks again for your time,
David
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-----Original Message-----
From: David Radunz [mailto:da...@boxen.net]
Sent: Friday, January 13, 2012 11:42 PM
To: solr-user@lucene.apache.org
Subject: Improving Solr Spell Checker Results
Hey,
Firstly I would like to thank you all for creating such a great
searching platform. What I was wondering is whether it is possible to:
1. Have the spell checker take into account multiple words. For
example
if I search for "Sigourney Wever" it doesn't flag as a spelling
issue as
'wever' is a correctly spelled word. And if I searched for "Sigourney
Wevr" the suggestion is "Sigourney Wever". Of course the correct
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items
added on the field being searched. i.e. Searching for an actor would
only use the dictionary fields from the actor. This makes sense on
many
levels, as when you are field searching its useless to get a
correction
from another field as no values would match in any case.
Hopefully someone can help!
Thanks in advance,
David