Hi, all. Sorry for any duplication - seems like what I sent yesterday never made it through...
We're having some troubles with the Solr Spellcheck Response. We're running version 3.1. Overview: If we search for something really ugly like: "kljhklsdjahfkljsdhf book rck" then when we get back the response, there's a suggestions list for 'rck', but no suggestions list for the other two words. For 'book', that's fine, because it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be any suggestions. For the ugly thing, though, there aren't any hits. The problem is that when we're handling the result, we can't tell the difference between no suggestions for a 'correctly spelled' term, and no suggestions for something that's odd like this. (Now - this is happening with searches that aren't as obviously garbage - i.e. words that are real words, just that just don't show up in the index and have no suggestions - this was just to illustrate the point). Our setup: We're running multiple shards, which may be part of the issue. For example, 'book' might be found in one of the shards, but not another. I don't *think* this has anything to do with our schema, since it's really how the Search Suggestions are being returned to us. But, here are some bits and pieces: >From schema.xml: <!-- Text field for spell checking --> <field name="textSpell" type="text" indexed="true" stored="false" multiValued="true" omitNorms="true"/> >From solrconfig.xml: <!-- The spell check component can return a list of alternative spelling suggestions. --> <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">textSpell</str> <str name="spellcheckIndexDir">./spellchecker</str> </lst> </searchComponent> What we'd really like to see is the response coming back with an indication that a word wasn't found / had no suggestions. We've hacked around in the code a little bit to do this, but were wondering if anyone has come across this, and what approaches you've taken. We created new classes which extend IndexBasedSpellChecker and SpellCheckComponent, as follows (package and imports excluded for (sort of) brevity). The methods are as taken from the overridden classes, with changes noted by "SD" type comments... /** * This has a slight modification of Solr's AbstractLuceneSpellChecker.getSuggestions(SpellingOptions). * The modification allows correctly spelled words to be returned in the suggestion. This modification working in tandem * with the SirsiDynixSpellCheckComponent allows words with no suggestions to be returned from the spell check component * even in a sharded search. * Changes are marked with SD in the comments. */ public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{ @Override public SpellingResult getSuggestions(SpellingOptions options) throws IOException { boolean shardRequest = false; SolrParams params = options.customParams; if(params!=null) { shardRequest = "true".equals(params.get(ShardParams.IS_SHARD)); } SpellingResult result = new SpellingResult(options.tokens); IndexReader reader = determineReader(options.reader); Term term = field != null ? new Term(field, "") : null; float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? spellChecker.getAccuracy() : options.accuracy; int count = Math.max(options.count, AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT); for (Token token : options.tokens) { String tokenText = new String(token.buffer(), 0, token.length()); String[] suggestions = spellChecker.suggestSimilar(tokenText, count, field != null ? reader : null, //workaround LUCENE-1295 field, options.onlyMorePopular, theAccuracy); if (suggestions.length == 1 && suggestions[0].equals(tokenText)) { //These are spelled the same, continue on List<String> suggList = Arrays.asList(suggestions); //SD added result.add(token, suggList); //SD added continue; } if (options.extendedResults == true && reader != null && field != null) { term = term.createTerm(tokenText); result.add(token, reader.docFreq(term)); int countLimit = Math.min(options.count, suggestions.length); if(countLimit>0) { for (int i = 0; i < countLimit; i++) { term = term.createTerm(suggestions[i]); result.add(token, suggestions[i], reader.docFreq(term)); } } else if(shardRequest) { List<String> suggList = Collections.emptyList(); result.add(token, suggList); } } else { if (suggestions.length > 0) { List<String> suggList = Arrays.asList(suggestions); if (suggestions.length > options.count) { suggList = suggList.subList(0, options.count); } result.add(token, suggList); } else if(shardRequest) { List<String> suggList = Collections.emptyList(); result.add(token, suggList); } } } return result; } } /** * This is a slight modification of Solr's SpellCheckComponent.toNamedList(boolean, SpellingResult, String, boolean, boolean). * The modification is designed so this class may work in tandem with SirsiDynixIndexBasedSpellChecker to return mispelled * words with no suggestions. */ public class SirsiDynixSpellCheckComponent extends SpellCheckComponent{ @Override protected NamedList toNamedList(boolean shardRequest, SpellingResult spellingResult, String origQuery, boolean extendedResults, boolean collate) { NamedList result = new NamedList(); Map<Token, LinkedHashMap<String, Integer>> suggestions = spellingResult.getSuggestions(); boolean hasFreqInfo = spellingResult.hasTokenFrequencyInfo(); boolean isCorrectlySpelled = false; int numSuggestions = 0; for(LinkedHashMap<String, Integer> theSuggestion : suggestions.values()) { if(theSuggestion.size()>0) { numSuggestions++; } } // will be flipped to false if any of the suggestions are not in the index and hasFreqInfo is true if(numSuggestions > 0) { isCorrectlySpelled = true; } for (Map.Entry<Token, LinkedHashMap<String, Integer>> entry : suggestions.entrySet()) { Token inputToken = entry.getKey(); Map<String, Integer> theSuggestions = entry.getValue(); if (theSuggestions != null) {//SD removed "&& (theSuggestions.size()>0 || shardRequest)" This is to allow misspelled words with no suggestions. (theSuggestions.size()>=0 is always true hence the removal) if(theSuggestions.size()>0 && !shardRequest && theSuggestions.containsKey(new String(inputToken.buffer(), 0, inputToken.length()))) { //SD added if block continue; //if this is not a shardRequest and the word is not mispelled, don't add it to the list of mispelled words } SimpleOrderedMap suggestionList = new SimpleOrderedMap(); suggestionList.add("numFound", theSuggestions.size()); suggestionList.add("startOffset", inputToken.startOffset()); suggestionList.add("endOffset", inputToken.endOffset()); // Logical structure of normal (non-extended) results: // "suggestion":["alt1","alt2"] // // Logical structure of the extended results: // "suggestion":[ // {"word":"alt1","freq":7}, // {"word":"alt2","freq":4} // ] if (extendedResults && hasFreqInfo) { suggestionList.add("origFreq", spellingResult.getTokenFrequency(inputToken)); ArrayList<SimpleOrderedMap> sugs = new ArrayList<SimpleOrderedMap>(); suggestionList.add("suggestion", sugs); for (Map.Entry<String, Integer> suggEntry : theSuggestions.entrySet()) { SimpleOrderedMap sugEntry = new SimpleOrderedMap(); sugEntry.add("word",suggEntry.getKey()); sugEntry.add("freq",suggEntry.getValue()); sugs.add(sugEntry); } } else { suggestionList.add("suggestion", theSuggestions.keySet()); } if (hasFreqInfo) { isCorrectlySpelled = isCorrectlySpelled && spellingResult.getTokenFrequency(inputToken) > 0; } result.add(new String(inputToken.buffer(), 0, inputToken.length()), suggestionList); } } if (hasFreqInfo) { result.add("correctlySpelled", isCorrectlySpelled); } else if(extendedResults && suggestions.size() == 0) { // if the word is misspelled, its added to suggestions with freqinfo result.add("correctlySpelled", true); } return result; } } Here's the xml we're getting back from the search (before applying the modified code): <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">56</int> <lst name="params"> <str name="spellcheck">true</str> <str name="facet">true</str> <str name="sort">score desc, RELEVANCE_SORT_nsort desc</str> <str name="shards.qt">spellcheckedStandard</str> <str name="hl.mergeContiguous">true</str> <str name="facet.limit">1000</str> <str name="hl">true</str> <str name="fl"> ELECTRONIC_ACCESS_display ISBN_display TITLE_boost FORMAT_display score MEDIA_TYPE_display AUTHOR_boost LOCALURL_display UPC_display id DOC_ID_display CHILD_SITE_display DS_EC PRIMARY_AUTHOR_boost PRIMARY_TITLE_boost DS_ID TOPIC_display ASSET_NAME_display OCLC_display</str> <str name="shards">localhost:8983/solr/SD_ILS/,localhost:8983/solr/SD_ASSET/</str> <arr name="facet.field"> <str>AUTHOR_facet</str> <str>FORMAT_facet</str> <str>LANGUAGE_facet</str> <str>PUBDATE_nfacet</str> <str>SUBJECT_facet</str> <str>ABCDEF_cfacet</str> </arr> <str name="qt">spellcheckedStandard</str> <arr name="fq"> <str>ACCESS_LEVEL_nfacet:"0"</str> <str>CLEARANCE_nfacet:"0"</str> <str>NEED_TO_KNOWS_facet:"@@EMPTY@@"</str> <str>CITIZENSHIPS_facet:"@@EMPTY@@"</str> <str>RESTRICTIONS_facet:"@@EMPTY@@"</str> </arr> <str name="facet.mincount">1</str> <str name="indent">true</str> <str name="hl.fl">*</str> <str name="rows">12</str> <str name="hl.snippets">5</str> <str name="start">0</str> <str name="q">TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^200.0 OR PRIMARY_AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^100.0 OR DOC_TEXT:"kljhklsdjahfkljsdhf book rck"~100^2 OR PRIMARY_TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^1000.0 OR AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^20.0 OR textFuzzy:kljhklsdjahfkljsdhf~0.7 AND textFuzzy:book~0.7 AND textFuzzy:rck~0.7</str> </lst> </lst> <result name="response" numFound="0" start="0" maxScore="0.0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="AUTHOR_facet"/> <lst name="FORMAT_facet"/> <lst name="LANGUAGE_facet"/> <lst name="PUBDATE_nfacet"/> <lst name="SUBJECT_facet"/> <lst name="ABCDEF_cfacet"/> </lst> <lst name="facet_dates"/> <lst name="facet_ranges"/> </lst> <lst name="highlighting"/> <lst name="spellcheck"> <lst name="suggestions"> <lst name="rck"> <int name="numFound">5</int> <int name="startOffset">362</int> <int name="endOffset">365</int> <int name="origFreq">0</int> <arr name="suggestion"> <lst> <str name="word">rock</str> <int name="freq">24000</int> </lst> <lst> <str name="word">rick</str> <int name="freq">6048</int> </lst> <lst> <str name="word">rack</str> <int name="freq">84</int> </lst> <lst> <str name="word">reck</str> <int name="freq">78</int> </lst> <lst> <str name="word">ruck</str> <int name="freq">30</int> </lst> </arr> </lst> <bool name="correctlySpelled">false</bool> </lst> </lst> </response> Thanks! Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com<http://www.sirsidynix.com/>