Solr Multiword Search

2013-04-01 Thread skmirch
We have a catalog of media content which is ingested into solr.   We are
trying to do a spell check on the title of the catalog item, to make sure
that the client is able to correctly predict and correct the (mis)typed
text. The requirement is that corrected text match a title in the catalog. 

I have been playing around with spellcheck component and the handler on SOLR
4.2 .  

solrconfig.xml
--


   text_spell

 
   default
   mySpell
   solr.DirectSolrSpellChecker
   internal
   0.5
   2
   1
   5
   4
   0.01
   
 


  

  

  mySpell
  
  
  default
  on
  true
  10
  5
  5
  true
  true
  10
  10


  spellcheck

  

schema.xml











   


   


Notice that I am using a custom QueryConverter, with definitions as follows:

/* MultiWordSpellingQueryConverter.java */
package com.foo;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Token;
import org.apache.solr.spelling.QueryConverter;

public class MultiWordSpellingQueryConverter extends QueryConverter {
private static Logger log =
Logger.getLogger(MultiWordSpellingQueryConverter.class);

static {
System.out.println("* Loading class
MultiWordSpellingQueryConverter");
log.fatal("* Loading class 
MultiWordSpellingQueryConverter");
}

/**
 * Converts the original query string to a collection of Lucene Tokens.
 * 
 * @param original the original query string
 * @return a Collection of Lucene Tokens
 */
public Collection convert( String original ) {
if ( original == null ) {
return Collections.emptyList();
}
System.out.println("Original String : "+original);
log.error("Original String : "+original);
final Token token = new Token( original.toCharArray(), 0,
original.length(), 0, original.length() );
return Arrays.asList( token );
}

}

I have followed directions as per another thread :
http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tt3265257.html#a3281189
, because I feel this is what I really want.

I have tried both placing the jar in the ${solr.home}/lib directory and
un-jarring solr.war and adding the jar file created with the above Java
compiled code into the WEB-INF/lib directory and re jarring it and placing
it in the web-server deploy directory.   I cannot tell if this file is even
being invoked at spellcheck time.  I have queryConverter tag defined in the
solrconfig.xml file (refer to the solrconfig.xml definitions above).

Query:
http://localhost/solr/spell?q=((title:("charles%20and%20the%20chocolate%20factory")))&spellcheck.q=charles%20and%20the%20chocolat%20factory&spellcheck=true&spellcheck.collate=true

Of course I have spelt charles incorrectly.  There in fact exists in the
catalog, a title with the name "Charlie and the chocolate factory" and the
above query does not find it nor collate well enough to correct the
spelling.  I believe the error distance (or edits) is about 2.  Charles
should be spelt Charlie so based on Levenshtein's algorithm,  it would find
this as the best quickly find it and suggest it. 

Suggestions from my script look like the following:
Title|Hits
charles and the chocolate factory|205808|
charles and the chocolate factor|205631|
charles and the chocolates factory|205508|
charley and the chocolate factory|203594|
charles and the chocolata factory|205506|
charles and the chocolate factoria|205544|
charles and the chocolates factor|205330|
charlet and the chocolate factory|203441|
charley and the chocolate factor|203417|
charley and the chocolates factory|203294|

In the collations the above list is the list of suggested collations and the
number of hits all extracted from the response XML to the above query.

What I would expect to see is "Charlie and the Chocolate Factory" way at the
top of the list since it is in my Catalog verbatim.  None of the above
listed collated suggestions are in the catalog.

Not sure how I can achieve my goal of being able to suggest a corrected
phrase that exists in the title in my catalog.  I would appreciate any help
on this front.

Thanks in advance.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Multiword Search

2013-04-03 Thread skmirch
I have been trying to use the MultiWordSpellingQueryConverter.java since I
need to be able to find the document that correspond to the suggested
collations.  At the moment it seems to be producing collations based on word
matches and arbitrary words from the field are picked up to form collation
and so nothing corresponds to any of the titles in our set of indexed
documents.

Could anyone please confirm that this would work if I took the following
steps.

steps:
1. Get the solr4.2.war file.
2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the
solr-core-4.2.0.jar that to the classpath to compile the
MultiWordSpellingQueryConverter.java . The code for this is in my previous
post in this thread.
3. jar cvf multiwordspellchecker.jar
com/foo/MultiWordSpellingQueryConverter.java
4. Copy this jar to the $SOLR_HOME/lib directory.
6. Define queryConverter.  Question: Where does this need to go? I have just
put this somewhere between the searchComponent and the requestHandler for
spell checks.
5. Start webserver. I see this jar file getting registered at startup:
2013-04-03 12:56:22,243 INFO  [org.apache.solr.core.SolrResourceLoader]
(coreLoadExecutor-3-thread-1) Adding
'file:/solr/lib/multiwordspellchecker.jar' to classloader
6. When I run the spell query, I don't see my print statements, so I am not
sure if this code is really being called.  I don't think it may be the
logging that is failing but rather this code not being called at all.

I would appreciate any information on what I might be doing wrong.  Please
help.

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Multiword Search

2013-04-03 Thread skmirch
Hi James,

Thanks for the information you have provided.  I tried your suggestion and
it helped a lot.  However, as close as this seems to what I want, I still
need for it to match the exact phrases that closely match my search words.  
So while I am now using the search words in q and also spellcheck.q (which I
believe starts to play a role only if there are no matches with the phrase
entered and has to provide collations), and it not only finds "Charlie and
the Chocolate Factory", it also finds any title that contains factory or
charles in it (just like you mentioned it would).  

I also tried your suggestion of spellcheck.alternativeTermCount and set it
to 5 (>0) in my solrconfig.xml and this still did the same thing.  I am not
using queryConverter at all any more, thanks for that suggestion.  

I still need it to find the closest match for the phrase that it finds in
title.  My query now is:
solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
The results are anything that matches charles, and the factory and so I get
lots of matches (bad for performance).  If I group the above query on the
content type, it ends up producing bogus results in categories that don't
have a title evenly remotely close to "Charlie and the Chocolate Factory".

Can this work somehow?  If it finds a doc that has low score, just not
provide it in the results.  Is there a way to use a certain score threshold
and only present things that are above this threshold from the terms matched
perspective?  I am getting a lot of matches for "and the" just because that
is in the phrase being searched.  I know I can make them stopwords so that
they are ignored.   Suggestion should be closest matches and nothing more. 
Can this be done?

Appreciate your help.
Thanks.
-- Sandeep





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053650.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Multiword Search

2013-04-03 Thread skmirch
The following query is doing a word search (based on my previous post)...

solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
 

It produces a lot of unwanted matches.


In order to do a phrase search, I changed it to:
solr/spell?q=("charles+and+the+choclit+factory"+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
 

It does not find any match for the words in the phrase I am looking for and
does poorly in the suggested collations.  I want phrase corrections.  How do
I achieve this?

"charles and the chocolit factory"
produces the following collations:
false

  charles and the chocolat factory
  2849777
  
charles
and
the
chocolat
factory
  


  charles and the chocalit factory
  2849464
  
charles
and
the
chocalit
factory
  


  charles and the chocolat factors
  2841190
  
charles
and
the
chocolat
factors
  


  charley and the chocolat factory
  2827908
  
charley
and
the
chocolat
factory
  


  charles and the chocalit factors
  2840877
  
charles
and
the
chocalit
factors
  


  charles and the chocklit factory
  2849464
  
charles
and
the
chocklit
factory
  


  charles and the chocolat factorz
  2841173
  
charles
and
the
chocolat
factorz
  


  charley and the chocalit factory
  2827595
  
charley
and
the
chocalit
factory
  


  charley and the chocolat factors
  2819321
  
charley
and
the
chocolat
factors
  


  charlies and the chocolat factory
  2826661
  
charlies
and
the
chocolat
factory
  

  

Notice number of hits.  This does not look right?  Please help.

Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053674.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Multiword Search

2013-04-04 Thread skmirch
Hi James,
Thanks for the response.

Nope, I'm not using dismax or edismax.  Just the standard solr query parser.

Also by using the variable "spellcheck.collateParam.q.op=AND" I see this
working.  This also means that all the words need to correct and the
maxEdits can only be 2 else it won't suggest with collations.

How can I get correction on an entire sentence?  maxEdits seem to be limited
to a max of 2 otherwise see an exception in the logs. 

Therefore from my experiments with the same search terms:
"charles and the chocolit factry" did not work.  Too many edits
"charlei and the chocolate factory" worked
"charlie and the choclit factory" did not work
"charlie and the chocolate factry" worked.
"charlie and the chocoleat factory" worked.

I tried the same thing with spellcheck.alternativeTermCount=100 and this did
not help with collations.

Need more ideas.  Appreciate your help.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053879.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Multiword Search

2013-04-05 Thread skmirch
Hi James,
Thanks for the very useful tips, however, I am looking for searches that
produce collations.

I need a functionality where someone searching for "madona" sees results for
"madona" and also get collations for "madonna".  So a functionality like
"Did you mean" can be provided.   We need exact matches and provide
suggestions if better ones exist from within our catalog?

What I am seeing right now is that when searching for "madona", "madona" is
returned but there are no collations for "madonna" appearing.  I am using
DirectSolrSpellChecker and have minQueryFrequency set at 0.01 . In theory it
should produce some collations for madonna.

I am not seeing any.
Not sure what I need to do for this?  I would appreciate any help.
Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4054130.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Multiword Search

2013-04-11 Thread skmirch
Hi James,
Your suggestions/tips for our spellcheck requirements were all very good. 
Thanks a lot for your help.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4055433.html
Sent from the Solr - User mailing list archive at Nabble.com.