Improving Solr Spell Checker Results

2012-01-13 Thread David Radunz

Hey,

Firstly I would like to thank you all for creating such a great 
searching platform. What I was wondering is whether it is possible to:


1. Have the spell checker take into account multiple words. For example 
if I search for "Sigourney Wever" it doesn't flag as a spelling issue as 
'wever' is a correctly spelled word. And if I searched for "Sigourney 
Wevr" the suggestion is "Sigourney Wever". Of course the correct 
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items 
added on the field being searched. i.e. Searching for an actor would 
only use the dictionary fields from the actor. This makes sense on many 
levels, as when you are field searching its useless to get a correction 
from another field as no values would match in any case.


Hopefully someone can help!

Thanks in advance,

David


Re: Improving Solr Spell Checker Results

2012-01-20 Thread David Radunz

Hey,

Thanks so much for your outstanding response. I have been buisy for 
a few days so have not had a chance to try it out. I have now tried to 
install trunc of solr and when i run 'ant test' I encounter the following:


[junit] Testsuite: 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader
[junit] Testcase: 
testRefreshReadRecreatedTaxonomy(org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader):  
FAILED

[junit] Expected InconsistentTaxonomyException
[junit] junit.framework.AssertionFailedError: Expected 
InconsistentTaxonomyException
[junit] at 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.doTestReadRecreatedTaxono(TestDirectoryTaxonomyReader.java:168)
[junit] at 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy(TestDirectoryTaxonomyReader.java:130)
[junit] at 
org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)



Should I ignore this (and other failed tests) and continue anyway?

Cheers,

David

On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your index.  So even if 
"wever" is misspelled in context, if it exists in the index the spell checker 
will not try correcting it.  There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).  See 
https://issues.apache.org/jira/browse/SOLR-2585

2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  But see 
the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it 
would.

3. If you're building your index on a, you can add a stopword 
filter that filters out all of the misspelt or rare words from the field that the 
dictionary is based.  This could be an arduous task, and it may or may not work well 
for your data.

As for your second question, I take it you're using (e)dismax with multiple fields in "qf", 
right?  The only way I know to handle this is to create a  that combines all of the 
fields you search across.  Use this combined field to base your dictionary.  Also, specifying 
"spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word 
combinations that are likely to occur when doing this, ensuring that any collations provided will indeed 
yield hits.  The downside to doing this, of course, is it will make your first problem more acute in that 
there will be even more terms in your index that the spellchecker will ignore entirely, even if they're 
mispelled in context.  Once again, SOLR-2585 is designed to tackle this problem but it is still in its 
early stages, and thus far it is Trunk-only.

You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 .  Although 
this is unrelated to your two questions, the patch on this issue introduces a new 
"ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly 
what you want.  That is, you could (theoretically) create separate dictionaries for each of 
the fields you're searching and let the CSSC combine the results&  generate collations, 
etc.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: David Radunz [mailto:da...@boxen.net]
Sent: Friday, January 13, 2012 11:42 PM
To: solr-user@lucene.apache.org
Subject: Improving Solr Spell Checker Results

Hey,

  Firstly I would like to thank you all for creating such a great
searching platform. What I was wondering is whether it is possible to:

1. Have the spell checker take into account multiple words. For example
if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
'wever' is a correctly spelled word. And if I searched for "Sigourney
Wevr" the suggestion is "Sigourney Wever". Of course the correct
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items
added on the field being searched. i.e. Searching for an actor would
only use the dictionary fields from the actor. This makes sense on many
levels, as when you are field searching its useless to get a correction
from another field as no values would match in any case.

Hopefully someone can help!

Thanks in advance,

David




Re: Improving Solr Spell Checker Results

2012-01-21 Thread David Radunz

James,

Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant test'. 
So I proceeded with trying your suggestions (for question 1 so far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your index.  So even if 
"wever" is misspelled in context, if it exists in the index the spell checker 
will not try correcting it.  There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).  See 
https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little unclear 
as to what exactly this patch does. Nor am I really clear what to set 
either of the options to, so I set them both to '5'. I tried to find the 
test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?



2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  But see 
the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it 
would.


Trying this did produce 'Signourney Weaver' as you would hope, but I am 
a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a stopword 
filter that filters out all of the misspelt or rare words from the field that the 
dictionary is based.  This could be an arduous task, and it may or may not work well 
for your data.
I am currently using a copyField for all terms that are relevant, which 
is quite a lot and the dictionary would encompass a huge amount of data. 
Adding stopword filters would be out of the question as we presently 
have more than 30,000 products and this is for the initial launch, we 
intend to have many many more.


As for your second question, I take it you're using (e)dismax with multiple fields in "qf", 
right?  The only way I know to handle this is to create a  that combines all of the 
fields you search across.  Use this combined field to base your dictionary.  Also, specifying 
"spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word 
combinations that are likely to occur when doing this, ensuring that any collations provided will indeed 
yield hits.  The downside to doing this, of course, is it will make your first problem more acute in that 
there will be even more terms in your index that the spellchecker will ignore entirely, even if they're 
mispelled in context.  Once again, SOLR-2585 is designed to tackle this problem but it is still in its 
early stages, and thus far it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it would 
help with the above problem, but it did not.


I have now tried using it in the context of question 2. I tried 
searching for 'Sigorney Wever' in the series name (which it's not 
present in, as its an actor):


spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5

Suggestions for 'Sigourney' Wever were returned, but no spelling 
suggestions or ones for series names (which i doubt there would be) 
should have been returned.




You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 .  Although 
this is unrelated to your two questions, the patch on this issue introduces a new 
"ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly 
what you want.  That is, you could (theoretically) create separate dictionaries for each of 
the fields you're searching and let the CSSC combine the results&  generate collations, 
etc.


During the upgrade I switched to solr.DirectSolrSpellChecker, which I 
presume will help with this? I am a senior developer (in 
Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr 
source code. So I am in the dark when you say it could be tailored for 
my needs. Also, how would it work? Query wise.. Would it be like.. 
spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that 
sounds tempting to try and achieve. But if you could provide any 
pointers in what exactly would be required that would really help.


Thanks again for your time,

David


James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Me

Re: Improving Solr Spell Checker Results

2012-01-21 Thread David Radunz

On 19/01/2012 12:21 AM, O. Klein wrote:

Dyer, James wrote

David,

The spellchecker normally won't give suggestions for any term in your
index.  So even if "wever" is misspelled in context, if it exists in the
index the spell checker will not try correcting it.  There are 3
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
See https://issues.apache.org/jira/browse/SOLR-2585


When using trunk and DirectSolrSpellChecker I do get suggestions for terms
that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
case is giving me very good suggestions even if documents with the
misspelled word in them were found.

This combined with maxCollationTries (with all terms required) is giving
some sort of context sensitive suggestions.

Is this correct or is there something I'm missing?


Hey,

Thanks for the input, but setting the thresholdTokenFrequency to 
0.001 has now excluded spell check suggesions that were correctly 
working. I.e. 'Matrx' now does not work, but when I remove the theshold 
again it suggests 'Matrix'. Si I guess to use this I would have to 
constantly reconfigure this property as the product database grows, 
which isn't really what I wanted.


Thanks for your input though,

David



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
Sent from the Solr - User mailing list archive at Nabble.com.




Failure noticed from new...@zju.edu.cn

2012-01-21 Thread David Radunz

Hey,

Every time I send a reply to the list I get a failure for 
new...@zju.edu.cn. Should I just ignore this? I am unsure if the message 
has been delivered...


Cheers,

David


Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
in the dictionary). Could something have changed in the trunk to make 
your patch no longer work? I had to manually merge the setup for the 
test case due to a new 'hyphens' test case. The settings I am use are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant 
test'. So I proceeded with trying your suggestions (for question 1 so 
far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your 
index.  So even if "wever" is misspelled in context, if it exists in 
the index the spell checker will not try correcting it.  There are 3 
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little unclear 
as to what exactly this patch does. Nor am I really clear what to set 
either of the options to, so I set them both to '5'. I tried to find 
the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, 
which is quite a lot and the dictionary would encompass a huge amount 
of data. Adding stopword filters would be out of the question as we 
presently have more than 30,000 products and this is for the initial 
launch, we intend to have many many more.


As for your second question, I take it you're using (e)dismax with 
multiple fields in "qf", right?  The only way I know to handle this 
is to create a  that combines all of the fields you search 
across.  Use this combined field to base your dictionary.  Also, 
specifying "spellcheck.maxCollationTries" with a non-zero value will 
weed out the nonsense word combinations that are likely to occur when 
doing this, ensuring that any collations provided will indeed yield 
hits.  The downside to doing this, of course, is it will make your 
first problem more acute in that there will be even more terms in 
your index that the spellchecker will ignore entirely, even if 
they're mispelled in context.  Once again, SOLR-2585 is designed to 
tackle this problem but it is still in its early stages, and thus far 
it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it would 
help with the above problem, but it did not.


I have now tried using it in the context of question 2. I tried 
searching for 'Sigorney Wever' in the series name (which it's not 
present in, as its an actor):


spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 



Suggestions for 'Sigourney' Wever were returned, but no spelling 
suggestions or ones for series names (which i doubt there would be) 
should have been returned.




You might also be interested in 
https://issues.apach

Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey James,

I have played around a bit more with the settings and tried setting 
spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This 
yields 'Sigourney Weaver' as ONE of the corrections, but it's the second 
one and not the first. Which is wrong if this is a patch for 'context 
sensative', because it doesn't really seem to honor any context at all. 
Unless I am missunderstanding this? Also, I don't really like 
maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 
and there are 100 results, then you offer no corrections at all even if 
the term is missing in the dictionary entirely.


If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.


So really, none of the solutions either in this patch or Solr offer 
what I would truely call context sensative spell checking. That being, 
in a full text search engine you find documents based on terms and how 
close they are togehter in the document. It makes more than perfect 
sense to treat the dictionary like this, so that when there are multiple 
terms it offers suggestions for the terms that match closely to whats 
entered surrounding the term.


Example:

"Sigourney Wever" would never appear in a document ever.
"Sigourney Weaver" however has many 'hits' in exactly that order of 
words.


So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.


Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
in the dictionary). Could something have changed in the trunk to make 
your patch no longer work? I had to manually merge the setup for the 
test case due to a new 'hyphens' test case. The settings I am use are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 



Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant 
test'. So I proceeded with trying your suggestions (for question 1 so 
far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in 
your index.  So even if "wever" is misspelled in context, if it 
exists in the index the spell checker will not try correcting it.  
There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little 
unclear as to what exactly this patch does. Nor am I really clear 
what to set either of the options to, so I set them both to '5'. I 
tried to find the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, 
which is quite a lot and the dictionary would encompass a huge amount 
of data. Adding stopword filters would be out of the question as we 
presently have more than 30,0

Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey,

I am trying to send this again as 'plain-text' to see if it 
delivers ok this time. All of the previous messages I sent should be below..


Cheers,

David

On 22/01/2012 11:42 PM, David Radunz wrote:

Hey James,

I have played around a bit more with the settings and tried 
setting spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of 
the corrections, but it's the second one and not the first. Which is 
wrong if this is a patch for 'context sensative', because it doesn't 
really seem to honor any context at all. Unless I am missunderstanding 
this? Also, I don't really like maxResultsForSuggest as it means 'all 
or nothing'. If you set it to 10 and there are 100 results, then you 
offer no corrections at all even if the term is missing in the 
dictionary entirely.


If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.


So really, none of the solutions either in this patch or Solr 
offer what I would truely call context sensative spell checking. That 
being, in a full text search engine you find documents based on terms 
and how close they are togehter in the document. It makes more than 
perfect sense to treat the dictionary like this, so that when there 
are multiple terms it offers suggestions for the terms that match 
closely to whats entered surrounding the term.


Example:

"Sigourney Wever" would never appear in a document ever.
"Sigourney Weaver" however has many 'hits' in exactly that order 
of words.


So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.


Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't 
even in the dictionary). Could something have changed in the trunk to 
make your patch no longer work? I had to manually merge the setup for 
the test case due to a new 'hyphens' test case. The settings I am use 
are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 



Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I 
updated from SVN trunk again today and was successfully able to run 
'ant test'. So I proceeded with trying your suggestions (for 
question 1 so far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in 
your index.  So even if "wever" is misspelled in context, if it 
exists in the index the spell checker will not try correcting it.  
There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little 
unclear as to what exactly this patch does. Nor am I really clear 
what to set either of the options to, so I set them both to '5'. I 
tried to find the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently

Re: Failure noticed from new...@zju.edu.cn

2012-01-22 Thread David Radunz

Hey,

That seems to have helped, I didn't get a failure notice re-sending 
the message. I'll have to keep that in mind.


Thanks very much,

David

On 23/01/2012 12:41 PM, Erick Erickson wrote:

I've seen the spam filter be pretty aggressive with HTML formatting etc,
what happens when you just send them as "plain text"?

Best
Erick

On Sat, Jan 21, 2012 at 7:24 AM, David Radunz  wrote:

Hey,

Every time I send a reply to the list I get a failure for
new...@zju.edu.cn. Should I just ignore this? I am unsure if the message has
been delivered...

Cheers,

David




Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey Erick,

Sure, can you explain the process to create the patch and upload it 
and i'll do it first thing tomorrow.


Thanks again for your help,

David

On 23/01/2012 12:51 PM, Erick Erickson wrote:

I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunz  wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, whoops.
So I have done that now and it seems to return 'correctlySpelled=true' for
'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
something have changed in the trunk to make your patch no longer work? I had
to manually merge the setup for the test case due to a new 'hyphens' test
case. The settings I am use are:


explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker


internal

0.5

2

1

5

4

0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David



On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated from
SVN trunk again today and was successfully able to run 'ant test'. So I
proceeded with trying your suggestions (for question 1 so far):

On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your
index.  So even if "wever" is misspelled in context, if it exists in the
index the spell checker will not try correcting it.  There are 3
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
  See https://issues.apache.org/jira/browse/SOLR-2585

I have tried using this with the original test case of 'Signorney Wever'.
I didn't notice any difference, although I am a little unclear as to what
exactly this patch does. Nor am I really clear what to set either of the
options to, so I set them both to '5'. I tried to find the test case it
mentions, but it's not present in SpellCheckCollatorTest.java .. Any
suggestions?


2. try "onlyMorePopular=true" in your request.
  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
  But see the September 2, 2011 comment in SOLR-2585 about why this might not
do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I am a
little afraid of the downside. I would much more like a context sensative
spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a
stopword filter that filters out all of the misspelt or rare words from the
field that the dictionary is based.  This could be an arduous task, and it
may or may not work well for your data.

I am currently using a copyField for all terms that are relevant, which is
quite a lot and the dictionary would encompass a huge amount of data. Adding
stopword filters would be out of the question as we presently have more than
30,000 products and this is for the initial launch, we intend to have many
many more.


As for your second question, I take it you're using (e)dismax with
multiple fields in "qf", right?  The only way I know to handle this is to
create athat combines all of the fields you search across.  Use
this combined field to base your dictionary.  Also, specifying
"spellcheck.maxCollationTries" with a non-zero value will weed out the
nonsense word combinations that are likely to occur when doing this,
ensuring that any collations provided will indeed yield hits.  The downside
to doing this, of course, is it will make your first problem more acute in
that there will be even more terms in your index that the spellchecker will
ignore entirely, even if they're mispelled in context.  Once again,
SOLR-2585 is designed to tackle this problem but it is still in its early
stages, and thus far it i

Re: Improving Solr Spell Checker Results

2012-01-23 Thread David Radunz

Hey,

Thanks for that, I have uploaded a new patch as advised.

Cheers,

David

On 23/01/2012 1:01 PM, Erick Erickson wrote:

David:

There's some good info here:
http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

But the short form is to go into solr_home and issue this command:
'svn diff>  SOLR-2585.patch'. IDE's may also have a "create patch"
feature, but I find the straight SVN command more reliable.

Note I'm not saying that your patch will necessarily be picked up, but
it's a thoughtful gesture to upload a more current patch. In your
comments please identify what code line you're working on (4.x? 3.x?).

And when you upload, down near the bottom of the dialog box there'll be
a radio button about "grant ASF license" which is fairly important to
click for legal reasons....

Thanks
Erick

On Sun, Jan 22, 2012 at 5:54 PM, David Radunz  wrote:

Hey Erick,

Sure, can you explain the process to create the patch and upload it and
i'll do it first thing tomorrow.

Thanks again for your help,

David


On 23/01/2012 12:51 PM, Erick Erickson wrote:

I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunzwrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585,
whoops.
So I have done that now and it seems to return 'correctlySpelled=true'
for
'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
something have changed in the trunk to make your patch no longer work? I
had
to manually merge the setup for the test case due to a new 'hyphens' test
case. The settings I am use are:


explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker


internal

0.5

2

1

5

4

0.01



spellchecker
true


With the query:


spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David



On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated
from
SVN trunk again today and was successfully able to run 'ant test'. So I
proceeded with trying your suggestions (for question 1 so far):

On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your
index.  So even if "wever" is misspelled in context, if it exists in
the
index the spell checker will not try correcting it.  There are 3
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
  See https://issues.apache.org/jira/browse/SOLR-2585

I have tried using this with the original test case of 'Signorney
Wever'.
I didn't notice any difference, although I am a little unclear as to
what
exactly this patch does. Nor am I really clear what to set either of the
options to, so I set them both to '5'. I tried to find the test case it
mentions, but it's not present in SpellCheckCollatorTest.java .. Any
suggestions?


2. try "onlyMorePopular=true" in your request.

  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
  But see the September 2, 2011 comment in SOLR-2585 about why this
might not
do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I am
a
little afraid of the downside. I would much more like a context
sensative
spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a
stopword filter that filters out all of the misspelt or rare words from
the
field that the dictionary is based.  This could be an arduous task, and
it
may or may not work well for your data.

I am currently using a copyField for all terms that are relevant, which
is
quite a lot and the dictionary would encompass a huge amount of data.
Adding
stopword filters would be out of the question as we presentl

Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

Shouldn't you be asking this question to the Magento people? You 
have an Enterprise edition, so you have paid for their support.


Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:

I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me "Did you mean ?" string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

I am using Magento Community Edition, I wrote my own Magento 
extension to integrate Solr and it works fine. So I really don't know 
what the Enterprise edition does. On a personal and unrelated note, I 
would never use Windows for a server; Unreliable and most of the system 
resources go towards the OS.


Cheers,

David

On 25/01/2012 3:30 PM, vishal_asc wrote:

Thanks David. As of now we are configuring it on local WAMP server and we have 
only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run index 
management in magento ?

I followed this site: 
http://www.summasolutions.net/blogposts/magento-apache-solr-set

Please let me know if you have some other info also.

Best Regards,
Vishal Porwal

From: David Radunz [via Lucene] 
[mailto:ml-node+s472066n3686805...@n3.nabble.com]
Sent: Wednesday, January 25, 2012 9:47 AM
To: Vishal Porwal
Subject: Re: solr not working with magento enterprise 1.11

Hey,

  Shouldn't you be asking this question to the Magento people? You
have an Enterprise edition, so you have paid for their support.

Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:


I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me "Did you mean ?" string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686805.html
To unsubscribe from solr not working with magento enterprise 1.11, click 
here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3686773&code=dmlzaGFsLnBvcndhbEBhc2NlbmR1bS5jb218MzY4Njc3M3w5NjEyMzY0MDE=>.
NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686818.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Multiple Data Directories and 1 SOLR instance

2012-01-26 Thread David Radunz

Hey,

Sounds like what you need to setup is "Multiple Cores" 
configuration. At first I confused this with "Multi Core CPU", but 
that's not what it's about. Basically it's a way to run multiple 'solr' 
cores/indexes/configurations from a single Solr instance (which will 
scale better as the resources will be shared). Have a read anyway: 
http://wiki.apache.org/solr/CoreAdmin


Cheers,

David

On 27/01/2012 8:18 AM, Nitin Arora wrote:

Hi,

We are using SOLR/Lucene to index/search the data about the user's of an
organization. The nature of data is brief information about the user's work.
Our data index requirement is to have segregated stores for each
organization and currently we have 10 organizations and we have to run 10
different instances of SOLR to serve search results for an organization. As
the new organizations are joining it is getting difficult to manage these
many instances.

I think now there is a need to use 1 SOLR instance and then have 10/multiple
different data directories for each organization.

When index/search request is received in SOLR we decide the data directory
based on the organization.

1. Is it possible to do the same in SOLR and how can we achieve the 
same?
2. Will it be a good design to use SOLR like this?
3. Is there any impact on the scalability if we are able to manage the
separate data directories inside SOLR?

Thanks in advance

Nitin


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Data-Directories-and-1-SOLR-instance-tp3691644p3691644.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: SpellCheck Help

2012-01-26 Thread David Radunz

Hey,

I really recommend you contact Magento pre-sales to find out why 
THEIR stuff doesn't work. The information you have provided is specific 
to magento... You can't expect people on a Solr mailing list to help you 
with a Magento problem. I guarantee you the issue is probably something 
Magento is doing, so try seeking support their first (Try their mailing 
lists if they have any, or on IRC: irc.freenode.org #magento).


I am not trying to be rude, rather to save you time and others effort.

Cheers,

David

On 27/01/2012 5:37 PM, vishal_asc wrote:

Downloaded Apache Solr from the URL: http://apache.dattatec.com//lucene/solr/
,
  extracted it at my windows machine.

Then started solr:  [solr-path]/example, and typed the following in a
terminal: java –jar start.jar.
it started and i can see the solr page at http://localhost:8983/solr/admin/

Now copied Magento [magento-instance-root]/lib/Apache/Solr/conf to
[Solr-instance-root]/example/solr/conf.

then again restared solr lots of activity was going on their. then I run
System->index management and at front end search box i tried to search a
product with incorrect spelling, in solr console i can see some activity but
at magento front end I couldnt get any result, why ?

I followed the steps given at this URL:
http://www.summasolutions.net/blogposts/magento-apache-solr-set#comment-615

Please look into it and let me know any other information you require.

I also want to know how i can implement facet and highlight search with
resulted output.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/SpellCheck-Help-tp3648589p3692518.html
Sent from the Solr - User mailing list archive at Nabble.com.




Performance improvement in large OR query using boosting (also, cache doesn't work?)

2012-12-14 Thread David Radunz

Hey Guys,

I have really been enjoying Solr and I can't really blame the 
slowness on solr as this is a pretty insane query. However, I am a 
little curious why a repeated query moments later also suffers from the 
same load time? Anyway, the queries are:


// 1st Query

INFO: [] webapp=/solr path=/select/ 
params={facet=on&fl=id,name,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_date,tax_class_id&sort=score+desc,retail_rating+desc,release_date+desc,year_made+desc&start=&q=**+-sku:"1029996"+-movie_id:"2665"+(series_names_attr_opt_id:"426317"^9000+OR+cat_id:"307"^1000+OR+cat_id:"308"^1000+OR+matching_genres:"Science+Fiction"^2000+OR+matching_genres:"Action"^1000+OR+matching_genres:"Thriller"^800+OR+matching_keywords:"superhero+team"^400+OR+matching_keywords:"superhero"^400+OR+matching_keywords:"superheroine"^300+OR+matching_keywords:"marvel+comic"^200+OR+matching_keywords:"costumed+hero"^100+OR+matching_keywords:"alien+life-form"^100+OR+matching_keywords:"thor"^100+OR+matching_keywords:"captain+america"^100+OR+matching_keywords:"the+incredible+hulk"^100+OR+matching_keywords:"iron+man"^100+OR+matching_keywords:"shape+shifting+alien"^50+OR+matching_keywords:"world+domination"^50+OR+matching_keywords:"human+alien"^50+OR+matching_keywords:"alien+invasion"^25+OR+matching_keywords:"super+strength"^25+OR+matching_keywords:"invisibility+cloak"^10+OR+matching_keywords:"warrior+race"^10+OR+matching_keywords:"alien+race"^5+OR+matching_keywords:"super+speed"^5+OR+matching_keywords:"flying+fortress"^5+OR+matching_keywords:"teleportation"^5+OR+matching_keywords:"creature"^5+OR+matching_keywords:"electromagnetic+pulse"^5+OR+matching_keywords:"immortality"^5+OR+matching_keywords:"mothership"^5+OR+matching_keywords:"mind+control"^5+OR+matching_keywords:"god"^5+OR+matching_keywords:"inventor"^5+OR+matching_keywords:"space+travel"^5+OR+matching_keywords:"fictional+government+agency"^5+OR+matching_keywords:"beautiful+woman"^5+OR+matching_keywords:"based+on+comic+book"^5+OR+matching_keywords:"army"^5+OR+matching_keywords:"blockbuster"^5+OR+matching_keywords:"mercenary"^5+OR+matching_keywords:"martial+arts"^5+OR+matching_keywords:"shield"^5+OR+matching_keywords:"captain"^5+OR+matching_keywords:"shot+in+the+head"^5+OR+matching_keywords:"shootout"^5+OR+matching_keywords:"flashback"^5+OR+matching_keywords:"pistol"^5+OR+matching_keywords:"airplane"^5+OR+matching_keywords:"helmet"^5+OR+matching_keywords:"car+accident"^5+OR+matching_keywords:"body+landing+on+a+car"^5+OR+matching_keywords:"spear"^5+OR+matching_keywords:"laboratory"^5+OR+matching_keywords:"warrior+woman"^5+OR+matching_keywords:"punching+bag"^5+OR+matching_keywords:"banquet"^5+OR+matching_keywords:"macguffin"^5+OR+matching_keywords:"mission"^5+OR+matching_keywords:"attack"^5+OR+matching_keywords:"hand+to+hand+combat"^5+OR+matching_keywords:"police+officer"^5+OR+matching_keywords:"robot"^5+OR+matching_keywords:"disguise"^5+OR+matching_keywords:"beating"^5+OR+matching_keywords:"falling+from+height"^5+OR+matching_keywords:"government+agent"^5+OR+matching_keywords:"battleship"^5+OR+matching_keywords:"parking+garage"^5+OR+matching_keywords:"head+butt"^5+OR+matching_keywords:"forest"^5+OR+matching_keywords:"crushed+to+death"^5+OR+matching_keywords:"deception"^5+OR+matching_keywords:"philanthropist"^5+OR+matching_keywords:"knife+fight"^5+OR+matching_keywords:"portal"^5+OR+matching_keywords:"knife"^5+OR+matching_keywords:"underwater+scene"^5+OR+matching_keywords:"exploding+plane"^5+OR+matching_keywords:"robot+suit"^5+OR+matching_keywords:"outer+space"^5+OR+matching_keywords:"stabbed+in+the+stomach"^5+OR+matching_keywords:"bodyguard"^5+OR+matching_keywords:"disaster+in+new+york"^5+OR+matching_keywords:"shot+in+the+chest"^5+OR+matching_keywords:"security+camera"^5+OR+matching_keywords:"rocket+launcher"^5+OR+matching_keywords:"tough+guy"^5+OR+matching_keywords:"secretary"^5+OR+matching_keywords:"monster"^5+OR+matching_keywords:"elevator"^5+OR+matching_keywords:"severed+arm"^5+OR+matching_keywords:"revenge"^5+OR+matching_keywords:"missile"^5+OR+matching_keywords:"kneeling"^5+OR+matching_keywords:"brawl"^5+OR+matching_keywords:"russian"^5+OR+matching_keywords:"scientist"^5+OR+matching_keywords:"super+computer"^5+OR+matching_keywords:"assault+rifle"^5+OR+matching_keywords:"adopted+brother"^5+OR+matching_keywords:"villain+arrested"^5+OR+matching_keywords:"man+punching+a+woman"^5+OR+matching_keywords:"soldier"^5+OR+matching_keywords:"national+guard"^5+OR+matching_keywords:"hammer"^5+OR+matching_keywords:"chase"^5+OR+matchin

Re: Performance improvement in large OR query using boosting (also, cache doesn't work?)

2012-12-14 Thread David Radunz

Hey,

Sorry for the delay, I had to enable larger head buffers in jetty 
to do this as a GET query (LOL). Anyway, I have put the results on 
pastebin to try and make it more presenable, though it's mostly failed.


1st Query: http://pastebin.com/uSGtQjA3   (query with a freshly started 
solr)
2nd Query: http://pastebin.com/4NbSdEHC  (as before, just the same query 
again)


Seemingly, the slowness happens in 'processing'. But yeah, i'm sure 
you guys would better understand all of that :)


Cheers,

David

On 14/12/2012 11:13 PM, Markus Jelsma wrote:

Hi,

This is insane indeed! Please enable debugging and report the prepare and 
process times for the query component. I think the prepare time is very high in 
both queries and the process time is slightly less for the second query due to 
caching.

Cheers,

  
  
-Original message-

From:David Radunz 
Sent: Fri 14-Dec-2012 13:04
To: solr-user@lucene.apache.org
Subject: Performance improvement in large OR query using boosting (also, cache 
doesn't work?)

Hey Guys,

  I have really been enjoying Solr and I can't really blame the
slowness on solr as this is a pretty insane query. However, I am a
little curious why a repeated query moments later also suffers from the
same load time? Anyway, the queries are:

// 1st Query

INFO: [] webapp=/solr path=/select/
params={facet=on&fl=id,name,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_date,tax_class_id&sort=score+desc,retail_rating+desc,release_date+desc,year_made+desc&start=&q=**+-sku:"1029996"+-movie_id:"2665"+(series_names_attr_opt_id:"426317"^9000+OR+cat_id:"307"^1000+OR+cat_id:"308"^1000+OR+matching_genres:"Science+Fiction"^2000+OR+matching_genres:"Action"^1000+OR+matching_genres:"Thriller"^800+OR+matching_keywords:"superhero+team"^400+OR+matching_keywords:"superhero"^400+OR+matching_keywords:"superheroine"^300+OR+matching_keywords:"marvel+comic"^200+OR+matching_keywords:"costumed+hero"^100+OR+matching_keywords:"alien+life-form"^100+OR+matching_keywords:"thor"^100+OR+matching_keywords:"captain+america"^100+OR+matching_keywords:"the+incredible+hulk"^100+OR+matching_keywords:"iron+man"^100+OR+matching_keywords:"shape+shifting+alien"^50+OR+matching_keywords:"worl

  
d+domination"^50+OR+matching_keywords:"human+alien"^50+OR+matching_keywords:"alien+invasion"^25+OR+matching_keywords:"super+strength"^25+OR+matching_keywords:"invisibility+cloak"^10+OR+matching_keywords:"warrior+race"^10+OR+matching_keywords:"alien+race"^5+OR+matching_keywords:"super+speed"^5+OR+matching_keywords:"flying+fortress"^5+OR+matching_keywords:"teleportation"^5+OR+matching_keywords:"creature"^5+OR+matching_keywords:"electromagnetic+pulse"^5+OR+matching_keywords:"immortality"^5+OR+matching_keywords:"mothership"^5+OR+matching_keywords:"mind+control"^5+OR+matching_keywords:"god"^5+OR+matching_keywords:"inventor"^5+OR+matching_keywords:"space+travel"^5+OR+matching_keywords:"fictional+government+agency"^5+OR+matching_keywords:"beautiful+woman"^5+OR+matching_keywords:"based+on+comic+book"^5+OR+matching_keywords:"army"^5+OR+matching_keywords:"blockbuster"^5+OR+matching_keywords:"me
  
rcenary"^5+OR+matching_keywords:"martial+arts"^5+OR+matching_keywords:"shield"^5+OR+matching_keywords:"captain"^5+OR+matching_keywords:"shot+in+the+head"^5+OR+matching_keywords:"shootout"^5+OR+matching_keywords:"flashback"^5+OR+matching_keywords:"pistol"^5+OR+matching_keywords:"airplane"^5+OR+matching_keywords:"helmet"^5+OR+matching_keywords:"car+accident"^5+OR+matching_keywords:"body+landing+on+a+car"^5+OR+matching_keywords:"spear"^5+OR+matching_keywords:"laboratory"^5+OR+matching_keywords:"warrior+woman"^5+OR+matching_keywords:"punching+bag"^5+OR+matching_keywords:"banquet"^5+OR+matching_keywords:"macguffin"^5+OR+matching_keywords:"mission"^5+OR+matching_keywords:"attack"^5+OR+matching_keywords:"hand+to+hand+combat"^5+OR+matching_keywords:"police+officer"^5+OR+matching_keywords:"robot"^5+OR+matching_keywords:"disguise"^5+OR+matching_keywords:"beating"^5+OR+matching_keywords:
  
"falling+from+height"^5+OR+matching_keywords:"government+agent"^5+OR+matching_keywords:"battleship"^5+OR+matching_keywords:"parking+garage"^5+OR+matching_keywords:"head+butt"^5+OR+matching_keywords:"forest"^5+OR+matching_keywords:"crushed+to+death"^5+OR+matching_keywords:"deception"^5+OR+matching_keywords:"philanthropist"^5+OR+matching_keywords:"knife+fight"^5+OR+matching_keywords:"portal"^5+OR+matching_keywords:"knife"^5+OR+matching_keywords:"underwater+scene"^5+OR+matching_keywords:"exploding+plane"^5+OR+matching_keywords:"robot+suit"^5+OR+matching_keywords:"outer+space"^5+OR+matching_keywords:"stabbed+in+the+sto

Re: Advanced search with results matrix

2012-05-04 Thread David Radunz

Hey Gnanam,

1. If I understand correctly you just need to perform one query. Like so 
(translated to propper syntax of course):
  ("SQL Server" OR SQL) OR ("Visual Basic" OR VB.NET) OR (Java AND 
JavaScript)
2. Every query you perform with Solr returns the 'results' count, if you 
ONLY want the results count simply set rows to 0 (but im guessing you 
will want both the results and the count as to avoid 2 trips).
  - The 'results count' is here: start="0"/>  (being numFound)


David


On 4/05/2012 4:46 PM, Gnanakumar wrote:

Hi,

First off, we're a happy user of Apache Solr v3.1 Enterprise search server,
integrated and successfully running in our LIVE Production server.

Now, we're enhancing our existing search feature in our web application as
explained below, that truly helps application users in making informed
decision before getting their search results:

There will be 3 textboxes provided and users can enter keyword phrases with
OR, AND combination within each textbox as shown below, for example:
Textbox 1: "SQL Server" OR SQL
Textbox 2: "Visual Basic" OR VB.NET
Textbox 3: Java AND JavaScript

If User clicks "Search" button, we want to present an intermediate or
"results matrix" page that would generate all possible combinations for 3
textboxes with how many records found for each combination as given below
(between combination it is AND operation).  This, as I said before, truly
helps application users in making informed decision/choice before getting
their search results:
+-+-+---
-
Matches |   Textbox 1 |   Textbox 2 | Textbox 3
+-+-+---
-
   200  |"SQL Server" OR SQL  |   |
   300  | |"Visual Basic" OR VB.NET | 
   400  | | | Java AND
JavaScript
   250  |"SQL Server" OR SQL  |"Visual Basic" OR VB.NET |   
   350  | |"Visual Basic" OR VB.NET | Java AND
JavaScript
   300  |"SQL Server" OR SQL  |   | Java AND
JavaScript
   100  |"SQL Server" OR SQL  |"Visual Basic" OR VB.NET | Java AND
JavaScript
+-+-+---
-
Only on clicking one of this "Matches" count will display actual results of
that particular search.

My questions are,
1) Do I need to run search separately for each combination or is it
possible to combine and obtain "results matrix" page by making "only" one
single call to  Apache Solr?  Or are they any plug-ins available
that provides functionality close to my use case?
2) How do I instruct Solr to return only count (not result) for the
search performed?
3) Any ideas/suggestions/approaches/resources are really appreciated
and welcomed

Regards,
Gnanam






Re: indexing unstructured text (tweets)

2012-05-28 Thread David Radunz

Hey,

I think you might be over-thinking this. Tweets are structured. You 
have the content (tweet), the user who tweeted it and various other meta 
data. So your 'document', might look like this:




ABCD1234
I bought some apples
JohnnyBoy



To get this structure, you can use any programming language your 
comfortable with and load it into Solr via various means. Obviously you 
can add more 'meta' fields that you get from twitter if you want as well.


David

On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
 From the books I am reading, "Apache Solr 3 Enterprise Search Server",
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named "Indexing documents with Solr Cell" seems to address my problem,
but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,
Giovanni Gherdovich




Weighted Search Results / Multi-Value Value's Not Aggregating Weight

2012-08-22 Thread David Radunz

Hey,

I have been having some problems getting good search results when 
using weighting against many fields with multi-values. After quite a bit 
of testing it seems to me that the problem is (at least as far as my 
query is concerned) is that the only one weighting is taken into account 
per field. For example, in a multi-value field if we have "Comedy" and 
"Romance" and have separate weightings for those - the one with the 
highest weighting is used (and not a combined weighting). Which means 
that searched for romantic comedy returns "Alvin and the Chipmunks" 
(Family, Children Comedy).


Query:

facet=on&fl=id,name,matching_genres,score,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_date&sort=score+desc,retail_rating+desc,release_date+desc&start=&q=**+-sku:"1019660"+-movie_id:"1805"+-movie_id:"1806"+(series_names_attr_opt_id:"454282"^9000+OR+cat_id:"22"^9+OR+cat_id:"248"^9+OR+cat_id:"249"^9+OR+matching_genres:"Comedy"^9+OR+matching_genres:"Romance"^7+OR+matching_genres:"Drama"^5)&fq=store_id:"1"+AND+avail_status_attr_opt_id:"available"+AND+(format_attr_opt_id:"372619")&rows=4

Now if I change matching_genres:"Romance"^7 to 
matching_genres:"Romance"^70 (adding a 0) suddenly the first result 
is "Sex and the City: The Movie / Sex and the City 2" (which ironically 
is "Drama", "Comedy", "Romance - The very combination we are looking for).


So is there a way to structure my query so that all of the 
multi-value values are treated individually? Aggregating the 
weighting/score?


Thanks in advance!

David



Re: Weighted Search Results / Multi-Value Value's Not Aggregating Weight

2012-08-22 Thread David Radunz

Hey,

Please disregard this, I worked out what the actual problem was. I 
am going to post another query with something else I discovered.


Thanks :)

David

On 22/08/2012 7:24 PM, David Radunz wrote:

Hey,

I have been having some problems getting good search results when 
using weighting against many fields with multi-values. After quite a 
bit of testing it seems to me that the problem is (at least as far as 
my query is concerned) is that the only one weighting is taken into 
account per field. For example, in a multi-value field if we have 
"Comedy" and "Romance" and have separate weightings for those - the 
one with the highest weighting is used (and not a combined weighting). 
Which means that searched for romantic comedy returns "Alvin and the 
Chipmunks" (Family, Children Comedy).


Query:

facet=on&fl=id,name,matching_genres,score,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_date&sort=score+desc,retail_rating+desc,release_date+desc&start=&q=**+-sku:"1019660"+-movie_id:"1805"+-movie_id:"1806"+(series_names_attr_opt_id:"454282"^9000+OR+cat_id:"22"^9+OR+cat_id:"248"^9+OR+cat_id:"249"^9+OR+matching_genres:"Comedy"^9+OR+matching_genres:"Romance"^7+OR+matching_genres:"Drama"^5)&fq=store_id:"1"+AND+avail_status_attr_opt_id:"available"+AND+(format_attr_opt_id:"372619")&rows=4 



Now if I change matching_genres:"Romance"^7 to 
matching_genres:"Romance"^70 (adding a 0) suddenly the first 
result is "Sex and the City: The Movie / Sex and the City 2" (which 
ironically is "Drama", "Comedy", "Romance - The very combination we 
are looking for).


So is there a way to structure my query so that all of the 
multi-value values are treated individually? Aggregating the 
weighting/score?


Thanks in advance!

David