We have a catalog of media content which is ingested into solr. We are trying to do a spell check on the title of the catalog item, to make sure that the client is able to correctly predict and correct the (mis)typed text. The requirement is that corrected text match a title in the catalog.
I have been playing around with spellcheck component and the handler on SOLR 4.2 . solrconfig.xml -------------- <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">text_spell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">mySpell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="distanceMeasure">internal</str> <float name="accuracy">0.5</float> <int name="maxEdits">2</int> <int name="minPrefix">1</int> <int name="maxInspections">5</int> <int name="minQueryLength">4</int> <float name="maxQueryFrequency">0.01</float> </lst> </searchComponent> <queryConverter name="queryConverter" class="com.foo.MultiWordSpellingQueryConverter"/> <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> <lst name="defaults"> <str name="df">mySpell</str> <str name="spellcheck.dictionary">default</str> <str name="spellcheck">on</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">10</str> <str name="spellcheck.alternativeTermCount">5</str> <str name="spellcheck.maxResultsForSuggest">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.collateExtendedResults">true</str> <str name="spellcheck.maxCollationTries">10</str> <str name="spellcheck.maxCollations">10</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler> schema.xml ------------ <types> <fieldType name="text_spell" class="solr.TextField" sortMissingLast="true" omitNorms="true" omitTermFreqAndPositions="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType> </types> <fields> <field name="mySpell" type="text_spell" indexed="true" stored="true" multiValued="true" /> </fields <copyField source="title" dest="mySpell" /> Notice that I am using a custom QueryConverter, with definitions as follows: /* MultiWordSpellingQueryConverter.java */ package com.foo; import org.apache.log4j.Logger; import org.apache.lucene.analysis.Token; import org.apache.solr.spelling.QueryConverter; public class MultiWordSpellingQueryConverter extends QueryConverter { private static Logger log = Logger.getLogger(MultiWordSpellingQueryConverter.class); static { System.out.println("********* Loading class MultiWordSpellingQueryConverter"); log.fatal("********* Loading class MultiWordSpellingQueryConverter"); } /** * Converts the original query string to a collection of Lucene Tokens. * * @param original the original query string * @return a Collection of Lucene Tokens */ public Collection<Token> convert( String original ) { if ( original == null ) { return Collections.emptyList(); } System.out.println("Original String : "+original); log.error("Original String : "+original); final Token token = new Token( original.toCharArray(), 0, original.length(), 0, original.length() ); return Arrays.asList( token ); } } I have followed directions as per another thread : http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tt3265257.html#a3281189 , because I feel this is what I really want. I have tried both placing the jar in the ${solr.home}/lib directory and un-jarring solr.war and adding the jar file created with the above Java compiled code into the WEB-INF/lib directory and re jarring it and placing it in the web-server deploy directory. I cannot tell if this file is even being invoked at spellcheck time. I have queryConverter tag defined in the solrconfig.xml file (refer to the solrconfig.xml definitions above). Query: http://localhost/solr/spell?q=((title:("charles%20and%20the%20chocolate%20factory")))&spellcheck.q=charles%20and%20the%20chocolat%20factory&spellcheck=true&spellcheck.collate=true Of course I have spelt charles incorrectly. There in fact exists in the catalog, a title with the name "Charlie and the chocolate factory" and the above query does not find it nor collate well enough to correct the spelling. I believe the error distance (or edits) is about 2. Charles should be spelt Charlie so based on Levenshtein's algorithm, it would find this as the best quickly find it and suggest it. Suggestions from my script look like the following: Title|Hits charles and the chocolate factory|205808| charles and the chocolate factor|205631| charles and the chocolates factory|205508| charley and the chocolate factory|203594| charles and the chocolata factory|205506| charles and the chocolate factoria|205544| charles and the chocolates factor|205330| charlet and the chocolate factory|203441| charley and the chocolate factor|203417| charley and the chocolates factory|203294| In the collations the above list is the list of suggested collations and the number of hits all extracted from the response XML to the above query. What I would expect to see is "Charlie and the Chocolate Factory" way at the top of the list since it is in my Catalog verbatim. None of the above listed collated suggestions are in the catalog. Not sure how I can achieve my goal of being able to suggest a corrected phrase that exists in the title in my catalog. I would appreciate any help on this front. Thanks in advance. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038.html Sent from the Solr - User mailing list archive at Nabble.com.