Solr Multiword Search
We have a catalog of media content which is ingested into solr. We are trying to do a spell check on the title of the catalog item, to make sure that the client is able to correctly predict and correct the (mis)typed text. The requirement is that corrected text match a title in the catalog. I have been playing around with spellcheck component and the handler on SOLR 4.2 . solrconfig.xml -- text_spell default mySpell solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 mySpell default on true 10 5 5 true true 10 10 spellcheck schema.xml Notice that I am using a custom QueryConverter, with definitions as follows: /* MultiWordSpellingQueryConverter.java */ package com.foo; import org.apache.log4j.Logger; import org.apache.lucene.analysis.Token; import org.apache.solr.spelling.QueryConverter; public class MultiWordSpellingQueryConverter extends QueryConverter { private static Logger log = Logger.getLogger(MultiWordSpellingQueryConverter.class); static { System.out.println("* Loading class MultiWordSpellingQueryConverter"); log.fatal("* Loading class MultiWordSpellingQueryConverter"); } /** * Converts the original query string to a collection of Lucene Tokens. * * @param original the original query string * @return a Collection of Lucene Tokens */ public Collection convert( String original ) { if ( original == null ) { return Collections.emptyList(); } System.out.println("Original String : "+original); log.error("Original String : "+original); final Token token = new Token( original.toCharArray(), 0, original.length(), 0, original.length() ); return Arrays.asList( token ); } } I have followed directions as per another thread : http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tt3265257.html#a3281189 , because I feel this is what I really want. I have tried both placing the jar in the ${solr.home}/lib directory and un-jarring solr.war and adding the jar file created with the above Java compiled code into the WEB-INF/lib directory and re jarring it and placing it in the web-server deploy directory. I cannot tell if this file is even being invoked at spellcheck time. I have queryConverter tag defined in the solrconfig.xml file (refer to the solrconfig.xml definitions above). Query: http://localhost/solr/spell?q=((title:("charles%20and%20the%20chocolate%20factory")))&spellcheck.q=charles%20and%20the%20chocolat%20factory&spellcheck=true&spellcheck.collate=true Of course I have spelt charles incorrectly. There in fact exists in the catalog, a title with the name "Charlie and the chocolate factory" and the above query does not find it nor collate well enough to correct the spelling. I believe the error distance (or edits) is about 2. Charles should be spelt Charlie so based on Levenshtein's algorithm, it would find this as the best quickly find it and suggest it. Suggestions from my script look like the following: Title|Hits charles and the chocolate factory|205808| charles and the chocolate factor|205631| charles and the chocolates factory|205508| charley and the chocolate factory|203594| charles and the chocolata factory|205506| charles and the chocolate factoria|205544| charles and the chocolates factor|205330| charlet and the chocolate factory|203441| charley and the chocolate factor|203417| charley and the chocolates factory|203294| In the collations the above list is the list of suggested collations and the number of hits all extracted from the response XML to the above query. What I would expect to see is "Charlie and the Chocolate Factory" way at the top of the list since it is in my Catalog verbatim. None of the above listed collated suggestions are in the catalog. Not sure how I can achieve my goal of being able to suggest a corrected phrase that exists in the title in my catalog. I would appreciate any help on this front. Thanks in advance. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Multiword Search
I have been trying to use the MultiWordSpellingQueryConverter.java since I need to be able to find the document that correspond to the suggested collations. At the moment it seems to be producing collations based on word matches and arbitrary words from the field are picked up to form collation and so nothing corresponds to any of the titles in our set of indexed documents. Could anyone please confirm that this would work if I took the following steps. steps: 1. Get the solr4.2.war file. 2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the solr-core-4.2.0.jar that to the classpath to compile the MultiWordSpellingQueryConverter.java . The code for this is in my previous post in this thread. 3. jar cvf multiwordspellchecker.jar com/foo/MultiWordSpellingQueryConverter.java 4. Copy this jar to the $SOLR_HOME/lib directory. 6. Define queryConverter. Question: Where does this need to go? I have just put this somewhere between the searchComponent and the requestHandler for spell checks. 5. Start webserver. I see this jar file getting registered at startup: 2013-04-03 12:56:22,243 INFO [org.apache.solr.core.SolrResourceLoader] (coreLoadExecutor-3-thread-1) Adding 'file:/solr/lib/multiwordspellchecker.jar' to classloader 6. When I run the spell query, I don't see my print statements, so I am not sure if this code is really being called. I don't think it may be the logging that is failing but rather this code not being called at all. I would appreciate any information on what I might be doing wrong. Please help. Thanks. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Multiword Search
Hi James, Thanks for the information you have provided. I tried your suggestion and it helped a lot. However, as close as this seems to what I want, I still need for it to match the exact phrases that closely match my search words. So while I am now using the search words in q and also spellcheck.q (which I believe starts to play a role only if there are no matches with the phrase entered and has to provide collations), and it not only finds "Charlie and the Chocolate Factory", it also finds any title that contains factory or charles in it (just like you mentioned it would). I also tried your suggestion of spellcheck.alternativeTermCount and set it to 5 (>0) in my solrconfig.xml and this still did the same thing. I am not using queryConverter at all any more, thanks for that suggestion. I still need it to find the closest match for the phrase that it finds in title. My query now is: solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory The results are anything that matches charles, and the factory and so I get lots of matches (bad for performance). If I group the above query on the content type, it ends up producing bogus results in categories that don't have a title evenly remotely close to "Charlie and the Chocolate Factory". Can this work somehow? If it finds a doc that has low score, just not provide it in the results. Is there a way to use a certain score threshold and only present things that are above this threshold from the terms matched perspective? I am getting a lot of matches for "and the" just because that is in the phrase being searched. I know I can make them stopwords so that they are ignored. Suggestion should be closest matches and nothing more. Can this be done? Appreciate your help. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053650.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Multiword Search
The following query is doing a word search (based on my previous post)... solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory It produces a lot of unwanted matches. In order to do a phrase search, I changed it to: solr/spell?q=("charles+and+the+choclit+factory"+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory It does not find any match for the words in the phrase I am looking for and does poorly in the suggested collations. I want phrase corrections. How do I achieve this? "charles and the chocolit factory" produces the following collations: false charles and the chocolat factory 2849777 charles and the chocolat factory charles and the chocalit factory 2849464 charles and the chocalit factory charles and the chocolat factors 2841190 charles and the chocolat factors charley and the chocolat factory 2827908 charley and the chocolat factory charles and the chocalit factors 2840877 charles and the chocalit factors charles and the chocklit factory 2849464 charles and the chocklit factory charles and the chocolat factorz 2841173 charles and the chocolat factorz charley and the chocalit factory 2827595 charley and the chocalit factory charley and the chocolat factors 2819321 charley and the chocolat factors charlies and the chocolat factory 2826661 charlies and the chocolat factory Notice number of hits. This does not look right? Please help. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053674.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Multiword Search
Hi James, Thanks for the response. Nope, I'm not using dismax or edismax. Just the standard solr query parser. Also by using the variable "spellcheck.collateParam.q.op=AND" I see this working. This also means that all the words need to correct and the maxEdits can only be 2 else it won't suggest with collations. How can I get correction on an entire sentence? maxEdits seem to be limited to a max of 2 otherwise see an exception in the logs. Therefore from my experiments with the same search terms: "charles and the chocolit factry" did not work. Too many edits "charlei and the chocolate factory" worked "charlie and the choclit factory" did not work "charlie and the chocolate factry" worked. "charlie and the chocoleat factory" worked. I tried the same thing with spellcheck.alternativeTermCount=100 and this did not help with collations. Need more ideas. Appreciate your help. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053879.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Multiword Search
Hi James, Thanks for the very useful tips, however, I am looking for searches that produce collations. I need a functionality where someone searching for "madona" sees results for "madona" and also get collations for "madonna". So a functionality like "Did you mean" can be provided. We need exact matches and provide suggestions if better ones exist from within our catalog? What I am seeing right now is that when searching for "madona", "madona" is returned but there are no collations for "madonna" appearing. I am using DirectSolrSpellChecker and have minQueryFrequency set at 0.01 . In theory it should produce some collations for madonna. I am not seeing any. Not sure what I need to do for this? I would appreciate any help. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4054130.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Multiword Search
Hi James, Your suggestions/tips for our spellcheck requirements were all very good. Thanks a lot for your help. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4055433.html Sent from the Solr - User mailing list archive at Nabble.com.