Slow queries for common terms

2013-03-21 Thread David Parks
I've got a query that takes 15 seconds to return whenever I have the term
"book" in a query that isn't cached. That's a pretty common term in our
search index. We're indexing about 120 GB of text data. We only store terms
and IDs, no document data, and the disk is virtually unused, it's all CPU
time.

 

I haven't done much yet to optimizing and scale solr, as we're only trying
to support a small number of users in a private beta. I currently only have
a couple of gigs of ram dedicated to Solr (we've ordered more hardware for
it, but it's not in yet).

 

I wonder if there's something I can do in the short term to alleviate the
problem. Many searches work great, but these ones that take 15+ sec are a
black eye. I'd be happy with a short term fix followed in the near future by
a more proper long-term fix.

 

If I were to take a stab at this I'd say the following two are the short and
long term solutions:

. Short: Configure solr to short-circuit quries, thus reducing query
quality, but guaranteeing a certain response time (I'm ok with this
tradeoff, but is this possible or contains risks I need to consider more?)

. Long: Implement sharding, get more hardware resources for these
boxes and split up the index across multiple servers.

 

Am I on track in my thinking here?

Thanks,

David

 

My long query:





 



  0

  15464

  

true

true

cook book fourth baby



xml

  





  

1004505170125

0

1428753459934396420

  

1005401904542

0

1428760446538612739

  

1003707566177

0

1428772178357125123

  

1000610053924

0

1428787238421921794

  

1000611651986

0

1428796273825153026

  

1001419625706

0

1428823418682212355

  

1004804435353

0

1428823070202658818

  

1000514089336

0

1428755943804370945

  

1000329540261

0

1428805063523958786

  

1001607999738

0

1428757460650295298





  cook book fourth baby



  cook book fourth baby



  all:cook all:book all:fourth all:babi

  all:cook all:book all:fourth
all:babi

  



10.476225 = (MATCH) product of:

  13.9683 = (MATCH) sum of:

7.988946 = (MATCH) weight(all:cook in 3428426) [DefaultSimilarity],
result of:

  7.988946 = score(doc=3428426,freq=2.0 = termFreq=2.0

), product of:

0.5447923 = queryWeight, product of:

  5.925234 = idf(docFreq=931212, maxDocs=128248074)

  0.091944434 = queryNorm

14.664206 = fieldWeight in 3428426, product of:

  1.4142135 = tf(freq=2.0), with freq of:

2.0 = termFreq=2.0

  5.925234 = idf(docFreq=931212, maxDocs=128248074)

  1.75 = fieldNorm(doc=3428426)

0.29064935 = (MATCH) weight(all:book in 3428426) [DefaultSimilarity],
result of:

  0.29064935 = score(doc=3428426,freq=1.0 = termFreq=1.0

), product of:

0.12357436 = queryWeight, product of:

  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)

  0.091944434 = queryNorm

2.3520198 = fieldWeight in 3428426, product of:

  1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)

  1.75 = fieldNorm(doc=3428426)

5.688705 = (MATCH) weight(all:babi in 3428426) [DefaultSimilarity],
result of:

  5.688705 = score(doc=3428426,freq=1.0 = termFreq=1.0

), product of:

0.54670167 = queryWeight, product of:

  5.9460006 = idf(docFreq=912073, maxDocs=128248074)

  0.091944434 = queryNorm

10.405501 = fieldWeight in 3428426, product of:

  1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

  5.9460006 = idf(docFreq=912073, maxDocs=128248074)

  1.75 = fieldNorm(doc=3428426)

  0.75 = coord(3/4)





10.476225 = (MATCH) product of:

  13.9683 = (MATCH) sum of:

7.988946 = (MATCH) weight(all:cook in 4132020) [DefaultSimilarity],
result of:

  7.988946 = score(doc=4132020,freq=2.0 = termFreq=2.0

), product of:

0.5447923 = queryWeight, product of:

  5.925234 = idf(docFreq=931212, maxDocs=128248074)

  0.091944434 = queryNorm

14.664206 = fieldWeight in 4132020, product of:

  1.4142135 = tf(freq=2.0), with freq of:

2.0 = termFreq=2.0

  5.925234 = idf(docFreq=931212, maxDocs=128248074)

  1.75 = fieldNorm(doc=4132020)

0.29064935 = (MATCH) weight(all:book in 4132020) [DefaultSimilarity],
result of:

  0.29064935 = score(doc=4132020,freq=1.0 = termFreq=1.0

), product of:

0.12357436 = queryWeight, product of:

  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)

  0.091944434 = queryNorm

2.3520198 = fieldWeight in 4132020, product of:

  1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)

  1.75 = fieldNo

RE: Slow queries for common terms

2013-03-21 Thread David Parks
We have 300M documents, each about a paragraph of text on average. The index
is 140GB in size. I'm not sure how to find the IDF score, was that in the
debug query below?

It seems that any query with the word "book" in it triggers a 15 sec
response time (unless it's the 2nd time we run the same query). Looking at
terms, 'book' is the 2nd highest term with 90M documents in the index.

Calling 'book' a stop word doesn't seem reasonable, and while that article
on bigrams and common grams is fascinating, I wonder if it addresses this
situation, in which we aren't really likely to manage a bi-gram phrase match
between the search "book sales improvement", and the terms in the document:
"category book marketing and sales today the real guide to improving" right?
I think this is what's happening here, everything with a common phrase
"category book" is getting included, which seems logical and correct. 



-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Thursday, March 21, 2013 5:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Slow queries for common terms

Hi,

I think you can start by reading this blog
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-w
ords-part-2 and try out the approach using a dictionary of the most common
words in your index.

You don't say how many documents, avg. doc size, the IDF value of "book",
how much RAM, whether you utilize disk caching well enough and many other
things which could affect this situation. But the pure fact that only a few
common search words trigger such a delay would suggest commongrams as a
possible way forward.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

21. mars 2013 kl. 11:09 skrev David Parks :

> I've got a query that takes 15 seconds to return whenever I have the 
> term "book" in a query that isn't cached. That's a pretty common term 
> in our search index. We're indexing about 120 GB of text data. We only 
> store terms and IDs, no document data, and the disk is virtually 
> unused, it's all CPU time.
> 
> 
> 
> I haven't done much yet to optimizing and scale solr, as we're only 
> trying to support a small number of users in a private beta. I 
> currently only have a couple of gigs of ram dedicated to Solr (we've 
> ordered more hardware for it, but it's not in yet).
> 
> 
> 
> I wonder if there's something I can do in the short term to alleviate 
> the problem. Many searches work great, but these ones that take 15+ 
> sec are a black eye. I'd be happy with a short term fix followed in 
> the near future by a more proper long-term fix.
> 
> 
> 
> If I were to take a stab at this I'd say the following two are the 
> short and long term solutions:
> 
> . Short: Configure solr to short-circuit quries, thus reducing
query
> quality, but guaranteeing a certain response time (I'm ok with this 
> tradeoff, but is this possible or contains risks I need to consider 
> more?)
> 
> . Long: Implement sharding, get more hardware resources for these
> boxes and split up the index across multiple servers.
> 
> 
> 
> Am I on track in my thinking here?
> 
> Thanks,
> 
> David
> 
> 
> 
> My long query:
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  0
> 
>  15464
> 
>  
> 
>true
> 
>true
> 
>cook book fourth baby
> 
> 
> 
>xml
> 
>  
> 
> 
> 
> 
> 
>  
> 
>1004505170125
> 
>0
> 
>1428753459934396420
> 
>  
> 
>1005401904542
> 
>0
> 
>1428760446538612739
> 
>  
> 
>1003707566177
> 
>0
> 
>1428772178357125123
> 
>  
> 
>1000610053924
> 
>0
> 
>1428787238421921794
> 
>  
> 
>1000611651986
> 
>0
> 
>1428796273825153026
> 
>  
> 
>1001419625706
> 
>0
> 
>1428823418682212355
> 
>  
> 
>1004804435353
> 
>0
> 
>1428823070202658818
> 
>  
> 
>1000514089336
> 
>0
> 
>1428755943804370945
> 
>  
> 
>1000329540261
> 
>0
> 
>1428805063523958786
> 
>  
> 
>1001607999738
> 
>0
> 
>1428757460650295298
> 
> 
> 
> 
> 
>  cook book fourth baby
> 
> 
> 
>  cook book fourth baby
> 
> 
> 
>  all:cook all:book all:fourth all:babi
> 
>  all:cook all:book all:fourth 
> all:babi
> 
>  
> 
>
> 
> 10.476225 = (MATCH) product o

Slow queries for common terms

2013-03-21 Thread David Parks
I've got a query that takes 15 seconds to return whenever I have the term 
"book" in a query that isn't cached. That's a pretty common term in our search 
index. We're indexing about 120 GB of text data. We only store terms and IDs, 
no document data, and the disk is virtually unused, it's all CPU time.

I haven't done much yet to optimizing and scale solr, as we're only trying to 
support a small number of users in a private beta. I currently only have a 
couple of gigs of ram dedicated to Solr (we've ordered more hardware for it, 
but it's not in yet).

I wonder if there's something I can do in the short term to alleviate the 
problem. Many searches work great, but these ones that take 15+ sec are a black 
eye. I'd be happy with a short term fix followed in the near future by a more 
proper long-term fix.

If I were to take a stab at this I'd say the following two are the short and 
long term solutions:

* Short: Configure solr to short-circuit quries, thus reducing query 
quality, but guaranteeing a certain response time (I'm ok with this tradeoff, 
but is this possible or contains risks I need to consider more?)

* Long: Implement sharding, get more hardware resources for these boxes 
and split up the index across multiple servers.

Am I on track in my thinking here?
Thanks,
David

My long query:




  0
  15464
  
true
true
cook book fourth baby

xml
  


  
1004505170125
0
1428753459934396420
  
1005401904542
0
1428760446538612739
  
1003707566177
0
1428772178357125123
  
1000610053924
0
1428787238421921794
  
1000611651986
0
1428796273825153026
  
1001419625706
0
1428823418682212355
  
1004804435353
0
1428823070202658818
  
1000514089336
0
1428755943804370945
  
1000329540261
0
1428805063523958786
  
1001607999738
0
1428757460650295298


  cook book fourth baby

  cook book fourth baby

  all:cook all:book all:fourth all:babi
  all:cook all:book all:fourth all:babi
  

10.476225 = (MATCH) product of:
  13.9683 = (MATCH) sum of:
7.988946 = (MATCH) weight(all:cook in 3428426) [DefaultSimilarity], result 
of:
  7.988946 = score(doc=3428426,freq=2.0 = termFreq=2.0
), product of:
0.5447923 = queryWeight, product of:
  5.925234 = idf(docFreq=931212, maxDocs=128248074)
  0.091944434 = queryNorm
14.664206 = fieldWeight in 3428426, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  5.925234 = idf(docFreq=931212, maxDocs=128248074)
  1.75 = fieldNorm(doc=3428426)
0.29064935 = (MATCH) weight(all:book in 3428426) [DefaultSimilarity], 
result of:
  0.29064935 = score(doc=3428426,freq=1.0 = termFreq=1.0
), product of:
0.12357436 = queryWeight, product of:
  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)
  0.091944434 = queryNorm
2.3520198 = fieldWeight in 3428426, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)
  1.75 = fieldNorm(doc=3428426)
5.688705 = (MATCH) weight(all:babi in 3428426) [DefaultSimilarity], result 
of:
  5.688705 = score(doc=3428426,freq=1.0 = termFreq=1.0
), product of:
0.54670167 = queryWeight, product of:
  5.9460006 = idf(docFreq=912073, maxDocs=128248074)
  0.091944434 = queryNorm
10.405501 = fieldWeight in 3428426, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  5.9460006 = idf(docFreq=912073, maxDocs=128248074)
  1.75 = fieldNorm(doc=3428426)
  0.75 = coord(3/4)


10.476225 = (MATCH) product of:
  13.9683 = (MATCH) sum of:
7.988946 = (MATCH) weight(all:cook in 4132020) [DefaultSimilarity], result 
of:
  7.988946 = score(doc=4132020,freq=2.0 = termFreq=2.0
), product of:
0.5447923 = queryWeight, product of:
  5.925234 = idf(docFreq=931212, maxDocs=128248074)
  0.091944434 = queryNorm
14.664206 = fieldWeight in 4132020, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  5.925234 = idf(docFreq=931212, maxDocs=128248074)
  1.75 = fieldNorm(doc=4132020)
0.29064935 = (MATCH) weight(all:book in 4132020) [DefaultSimilarity], 
result of:
  0.29064935 = score(doc=4132020,freq=1.0 = termFreq=1.0
), product of:
0.12357436 = queryWeight, product of:
  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)
  0.091944434 = queryNorm
2.3520198 = fieldWeight in 4132020, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  1.3440113 = idf(docFreq=90917737, maxDocs=128248074)
  1.75 = fieldNorm(doc=4132020)
5.688705 = (MATCH) weight(all:babi in 4132020) [DefaultSimilarity], result 
of:
  5.688705 = score(doc=

RE: Slow queries for common terms

2013-03-21 Thread David Parks
I figured I was trying to pull a coup here, but this is a temporary
configuration while we only run a few users through an early beta. The
performance is perfectly good for most terms, it's just this books term. I'm
curious how adding RAM will solve that. I can see how deploying solr cloud
and sharding should affect it, but would simply giving Solr 16GB of ram
improve query time with this one term that is common to 90M of the 300M
documents?

In due time I do plan to implement solr cloud and run the whole thing
through proper load testing. Right now I'm just trying to get it to "work"
for a few users. If you could elaborate a bit on your thinking I'd be quite
grateful.

David


-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Thursday, March 21, 2013 8:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Slow queries for common terms

Hi,

If you say that you try to index 300M docs in ONE single Solr server, with
"a few gigs" of RAM, then that's the reason for some bad performance right
there. You should benchmark to find the sweet-spot of how many documents you
want to fit per node/shard and still have acceptable indexing/query
performance.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

21. mars 2013 kl. 12:43 skrev David Parks :

> We have 300M documents, each about a paragraph of text on average. The 
> index is 140GB in size. I'm not sure how to find the IDF score, was 
> that in the debug query below?
> 
> It seems that any query with the word "book" in it triggers a 15 sec 
> response time (unless it's the 2nd time we run the same query). 
> Looking at terms, 'book' is the 2nd highest term with 90M documents in the
index.
> 
> Calling 'book' a stop word doesn't seem reasonable, and while that 
> article on bigrams and common grams is fascinating, I wonder if it 
> addresses this situation, in which we aren't really likely to manage a 
> bi-gram phrase match between the search "book sales improvement", and the
terms in the document:
> "category book marketing and sales today the real guide to improving"
right?
> I think this is what's happening here, everything with a common phrase 
> "category book" is getting included, which seems logical and correct.
> 
> 
> 
> -Original Message-
> From: Jan Høydahl [mailto:jan@cominvent.com]
> Sent: Thursday, March 21, 2013 5:43 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Slow queries for common terms
> 
> Hi,
> 
> I think you can start by reading this blog 
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-co
> mmon-w
> ords-part-2 and try out the approach using a dictionary of the most 
> common words in your index.
> 
> You don't say how many documents, avg. doc size, the IDF value of 
> "book", how much RAM, whether you utilize disk caching well enough and 
> many other things which could affect this situation. But the pure fact 
> that only a few common search words trigger such a delay would suggest 
> commongrams as a possible way forward.
> 
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com Solr Training - www.solrtraining.com
> 
> 21. mars 2013 kl. 11:09 skrev David Parks :
> 
>> I've got a query that takes 15 seconds to return whenever I have the 
>> term "book" in a query that isn't cached. That's a pretty common term 
>> in our search index. We're indexing about 120 GB of text data. We 
>> only store terms and IDs, no document data, and the disk is virtually 
>> unused, it's all CPU time.
>> 
>> 
>> 
>> I haven't done much yet to optimizing and scale solr, as we're only 
>> trying to support a small number of users in a private beta. I 
>> currently only have a couple of gigs of ram dedicated to Solr (we've 
>> ordered more hardware for it, but it's not in yet).
>> 
>> 
>> 
>> I wonder if there's something I can do in the short term to alleviate 
>> the problem. Many searches work great, but these ones that take 15+ 
>> sec are a black eye. I'd be happy with a short term fix followed in 
>> the near future by a more proper long-term fix.
>> 
>> 
>> 
>> If I were to take a stab at this I'd say the following two are the 
>> short and long term solutions:
>> 
>> . Short: Configure solr to short-circuit quries, thus reducing
> query
>> quality, but guaranteeing a certain response time (I'm ok with this 
>> tradeoff, but is this possible or con

RE: Slow queries for common terms

2013-03-23 Thread David Parks
I see the CPU working very hard, and at the same time I see 2 MB/sec disk
access for that 15 seconds. I am not running it this instant, but it seems
to me that there was more CPU cycles available, so unless it's an issue of
not being able to multithread it any  further I'd say it's more IO related.

I'm going to set up solr cloud and shard across the 2 servers I have
available for now. It's not an optimal setup we have while we're in a
private beta period, but maybe it'll improve things (I've got 2 servers with
2x 4TB disks in raid-0 shared with the webservers).

I'll work towards some improved IO performance and maybe more shards and see
how things go. I'll also be able to up the RAM in just a couple of weeks.

Are there any settings I should think of in terms of improving cache
performance when I can give it say 10GB of RAM?

Thanks, this has been tremendously helpful.

David


-Original Message-
From: Tom Burton-West [mailto:tburt...@umich.edu] 
Sent: Saturday, March 23, 2013 1:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Slow queries for common terms

Hi David and Jan,

I wrote the blog post, and David, you are right, the problem we had was with
phrase queries because our positions lists are so huge.  Boolean
queries don't need to read the positions lists.   I think you need to
determine whether you are CPU bound or I/O bound.It is possible that
you are I/O bound and reading the term frequency postings for 90 million
docs is taking a long time.  In that case, More memory in the machine (but
not dedicated to Solr) might help because Solr relies on OS disk caching for
caching the postings lists.  You would still need to do some cache warming
with your most common terms.

On the other hand as Jan pointed out, you may be cpu bound because Solr
doesn't have early termination and has to rank all 90 million docs in order
to show the top 10 or 25.

Did you try the OR search to see if your CPU is at 100%?

Tom

On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl  wrote:

> Hi
>
> There might not be a final cure with more RAM if you are CPU bound.
> Scoring 90M docs is some work. Can you check what's going on during 
> those
> 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search 
> which generates >100mill hits and see if that is slow too, even if you 
> don't use frequent words.
>
> I'm sure you can find other frequent terms in your corpus which 
> display similar behaviour, words which are even more frequent than 
> "book". Are you using "AND" as default operator? You will benefit from 
> limiting the number of results as much as possible.
>
> The real solution is to shard across N number of servers, until you 
> reach the desired performance for the desired indexing/querying load.
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com Solr Training - www.solrtraining.com
>
>



RE: Slow queries for common terms

2013-03-25 Thread David Parks
"book" by itself returns in 4s (non-optimized disk IO), running it a second
time returned 0s, so I think I can presume that the query was not cached the
first time. This system has been up for week, so it's warm.

I'm going to give your article a good long read, thanks for that.   

I guess good fast disks/SSDs and sharding should also improve on the base 4
sec query time. How _does_ Google get their queries times down to 0.35s
anyway? I presume their indexes are larger than my 150G index. :)

I still am a bit worried about what will happen when my index is 500GB
(it'll happen soon enough), even with sharding... well... I'd just need a
lot of servers it seems, and my feeling of it is that if I need a lot of
servers for a few users, how will it scale to many users?

Thanks for the great discussion,
Dave


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, March 25, 2013 10:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Slow queries for common terms

take a look here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

looking at memory consumption can be a bit tricky to interpret with
MMapDirectory.

But you say "I see the CPU working very hard" which implies that your issue
is just scoring 90M documents. A way to test: try q=*:*&fq=field:book. My
bet is that that will be much faster, in which case scoring is your
choke-point and you'll need to spread that load across more servers, i.e.
shard.

When running the above, make sure of a couple of things:
1> you haven't run the fq query before (or you have filterCache turned
completely off).
2> you _have_ run a query or two that warms up your low-level caches.
Doesn't matter what, just as long as it doesn't have an fq clause.

Best
Erick



On Sat, Mar 23, 2013 at 3:10 AM, David Parks  wrote:

> I see the CPU working very hard, and at the same time I see 2 MB/sec 
> disk access for that 15 seconds. I am not running it this instant, but 
> it seems to me that there was more CPU cycles available, so unless 
> it's an issue of not being able to multithread it any  further I'd say
it's more IO related.
>
> I'm going to set up solr cloud and shard across the 2 servers I have 
> available for now. It's not an optimal setup we have while we're in a 
> private beta period, but maybe it'll improve things (I've got 2 
> servers with 2x 4TB disks in raid-0 shared with the webservers).
>
> I'll work towards some improved IO performance and maybe more shards 
> and see how things go. I'll also be able to up the RAM in just a 
> couple of weeks.
>
> Are there any settings I should think of in terms of improving cache 
> performance when I can give it say 10GB of RAM?
>
> Thanks, this has been tremendously helpful.
>
> David
>
>
> -Original Message-
> From: Tom Burton-West [mailto:tburt...@umich.edu]
> Sent: Saturday, March 23, 2013 1:38 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Slow queries for common terms
>
> Hi David and Jan,
>
> I wrote the blog post, and David, you are right, the problem we had 
> was with phrase queries because our positions lists are so huge.  
> Boolean
> queries don't need to read the positions lists.   I think you need to
> determine whether you are CPU bound or I/O bound.It is possible that
> you are I/O bound and reading the term frequency postings for 90 
> million docs is taking a long time.  In that case, More memory in the 
> machine (but not dedicated to Solr) might help because Solr relies on 
> OS disk caching for caching the postings lists.  You would still need 
> to do some cache warming with your most common terms.
>
> On the other hand as Jan pointed out, you may be cpu bound because 
> Solr doesn't have early termination and has to rank all 90 million 
> docs in order to show the top 10 or 25.
>
> Did you try the OR search to see if your CPU is at 100%?
>
> Tom
>
> On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl 
> wrote:
>
> > Hi
> >
> > There might not be a final cure with more RAM if you are CPU bound.
> > Scoring 90M docs is some work. Can you check what's going on during 
> > those
> > 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search 
> > which generates >100mill hits and see if that is slow too, even if 
> > you don't use frequent words.
> >
> > I'm sure you can find other frequent terms in your corpus which 
> > display similar behaviour, words which are even more frequent than 
> > "book". Are you using "AND" as default operator? You will benefit 
> > from limiting the number of results as much as possible.
> >
> > The real solution is to shard across N number of servers, until you 
> > reach the desired performance for the desired indexing/querying load.
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS - 
> > www.cominvent.com Solr Training - www.solrtraining.com
> >
> >
>
>



RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-02 Thread David Parks
Isn't this an AWS security groups question? You should probably post this 
question on the AWS forums, but for the moment, here's the basic reading 
material - go set up your EC2 security groups and lock down your systems.


http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

If you just want to password protect Solr here are the instructions:

http://wiki.apache.org/solr/SolrSecurity

But I most certainly would not leave it open to the world even with a password 
(note that the basic password authentication sends passwords in clear text if 
you're not using HTTPS, best lock the thing down behind a firewall).

Dave


-Original Message-
From: DC tech [mailto:dctech1...@gmail.com] 
Sent: Tuesday, April 02, 2013 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh  wrote:

>say &debugQuery=true&mlt=true and see the scores for the MLT query, not 
>a sample query. You can use Amazon ec2 to bring up your solr, you 
>should be able to get a micro instance for free trial.
>
>
>On Mon, Apr 1, 2013 at 5:10 AM, dc tech  wrote:
>
>> I did try the raw query against the *simi* field and those seem to 
>> return results in the order expected.
>> For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
>> Running a query with those words against the simi field returns the 
>> expected models (X5, Audi Q5, etc) and then the subsequent documents 
>> have decreasing relevance. So the basic query mechanism seems to be fine.
>>
>> The issue just seems to be with MoreLikeThis component and handler.
>> I can post the index on a public SOLR instance - any suggestions? (or 
>> for
>> hosting)
>>
>>
>> On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
>> > >wrote:
>>
>> > If you can bring up your solr setup on a public machine then im 
>> > sure a
>> lot
>> > of debugging can be done. Without that, i think what you should 
>> > look at
>> is
>> > the tf-idf scores of the terms like "camry" etc. Usually idf is the 
>> > deciding factor into which results show at the top (tf should be 1 
>> > for
>> your
>> > data).
>> > Enable &debugQuery=true and look at explain section to see show 
>> > score is getting calculated.
>> >
>> > You should try giving different boosts to class, type, drive, size 
>> > to control the results.
>> >
>> >
>> > On Sun, Mar 31, 2013 at 8:52 PM, dc tech  wrote:
>> >
>> >> I am running some experiments on more like this and the results 
>> >> seem rather odd - I am doing something wrong but just cannot figure out 
>> >> what.
>> >> Basically, the similarity results are decent - but not great.
>> >>
>> >> *Issue 1  = Quality*
>> >> Toyota Camry : finds Altima (good) but then next one is Camry 
>> >> Hybrid whereas it should have found Accord.
>> >> I have normalized the data into a simi field which has only the 
>> >> attributes that I care about.
>> >> Without the simi field, I could not get mlt.qf boosts to work well
>> enough
>> >> to return results
>> >>
>> >> *Issue 2*
>> >> Some fields do not work at all. For instance, text+simi (in 
>> >> mlt.fl)
>> works
>> >> whereas just simi does not.
>> >> So some weirdness that am just not understanding.
>> >>
>> >> Would be grateful for your guidance !
>> >>
>> >>
>> >> Here is the setup:
>> >> *1. SOLR Version*
>> >> solr-spec 4.2.0.2013.03.06.22.32.13
>> >> solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
>> >> lucene-spec 4.2.0
>> >> lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
>> >>
>> >> *2. Machine Information*
>> >> Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
>> >> 19.0-b09)
>> >> Windows 7 Home 64 Bit with 4 GB RAM
>> >>
>> >> *3. Sample Data *
>> >> I created this 'dummy' data of cars  - the idea being that these 
>> >> would
>> be
>> >> sufficient and simple to generate similarity and understand how it 
>> >> would work.
>> >> There are 181 rows in the data set (I have attached it for 
>> >> reference in CSV format)
>> >>
>> >> [image: Inline image 1]
>> >>
>> >> *4. SCHEMA*
>> >> *Field Definitions*
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> stored="true"
>> >> termVectors="true" multiValued="true"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >> *
>> >> *
>> >> *Copy Fields*
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >>   
>> >> *  *
>> >> *  
>> >> *
>> >> *  *
>> >> *  
>> >> *
>> >>
>> >> Note that the "simi" field ends u

SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Step 1: distribute processing

We have 2 servers in which we'll run 2 SolrCloud instances on.

We'll define 2 shards so that both servers are busy for each request
(improving response time of the request).

 

Step 2: Failover

We would now like to ensure that if either of the servers goes down (we're
very unlucky with disks), that the other will be able to take over
automatically.

So we define 2 shards with a replication factor of 2.

 

So we have:

. Server 1: Shard 1, Replica 2

. Server 2: Shard 2, Replica 1

 

Question:

But in SolrCloud, replicas are active right? So isn't it now possible that
the load balancer will have Server 1 process *both* parts of a request,
after all, it has both shards due to the replication, right?



RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
But my concern is this, when we have just 2 servers:
 - I want 1 to be able to take over in case the other fails, as you point
out.
 - But when *both* servers are up I don't want the SolrCloud load balancer
to have Shard1 and Replica2 do the work (as they would both reside on the
same physical server).

Does that make sense? I want *both* server1 & server2 sharing the processing
of every request, *and* I want the failover capability.

I'm probably missing some bit of logic here, but I want to be sure I
understand the architecture.

Dave



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Thursday, April 18, 2013 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Correct. This is what you want if server 2 goes down.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:11 AM, "David Parks"  wrote:

> Step 1: distribute processing
>
> We have 2 servers in which we'll run 2 SolrCloud instances on.
>
> We'll define 2 shards so that both servers are busy for each request 
> (improving response time of the request).
>
>
>
> Step 2: Failover
>
> We would now like to ensure that if either of the servers goes down 
> (we're very unlucky with disks), that the other will be able to take 
> over automatically.
>
> So we define 2 shards with a replication factor of 2.
>
>
>
> So we have:
>
> . Server 1: Shard 1, Replica 2
>
> . Server 2: Shard 2, Replica 1
>
>
>
> Question:
>
> But in SolrCloud, replicas are active right? So isn't it now possible 
> that the load balancer will have Server 1 process *both* parts of a 
> request, after all, it has both shards due to the replication, right?
>
>



RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
I think I still don't understand something here. 

My concern right now is that query times are very slow for 120GB index (14s
on avg), I've seen a lot of disk activity when running queries.

I'm hoping that distributing that query across 2 servers is going to improve
the query time, specifically I'm hoping that we can distribute that disk
activity because we don't have great disks on there (yet).

So, with disk IO being a factor in mind, running the query on one box, vs.
across 2 *should* be a concern right?

Admittedly, this is the first step in what will probably be many to try to
work our query times down from 14s to what I want to be around 1s.

Dave


-Original Message-
From: Timothy Potter [mailto:thelabd...@gmail.com] 
Sent: Thursday, April 18, 2013 9:16 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Hi Dave,

This sounds more like a budget / deployment issue vs. anything
architectural. You want 2 shards with replication so you either need
sufficient capacity on each of your 2 servers to host 2 Solr instances or
you need 4 servers. You need to avoid starving Solr of necessary RAM, disk
performance, and CPU regardless of how you lay out the cluster otherwise
performance will suffer. My guess is if each Solr had sufficient resources,
you wouldn't actually notice much difference in query performance.

Tim


On Thu, Apr 18, 2013 at 8:03 AM, David Parks  wrote:

> But my concern is this, when we have just 2 servers:
>  - I want 1 to be able to take over in case the other fails, as you 
> point out.
>  - But when *both* servers are up I don't want the SolrCloud load 
> balancer to have Shard1 and Replica2 do the work (as they would both 
> reside on the same physical server).
>
> Does that make sense? I want *both* server1 & server2 sharing the 
> processing of every request, *and* I want the failover capability.
>
> I'm probably missing some bit of logic here, but I want to be sure I 
> understand the architecture.
>
> Dave
>
>
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
> Sent: Thursday, April 18, 2013 8:13 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud loadbalancing, replication, and failover
>
> Correct. This is what you want if server 2 goes down.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Apr 18, 2013 3:11 AM, "David Parks"  wrote:
>
> > Step 1: distribute processing
> >
> > We have 2 servers in which we'll run 2 SolrCloud instances on.
> >
> > We'll define 2 shards so that both servers are busy for each request 
> > (improving response time of the request).
> >
> >
> >
> > Step 2: Failover
> >
> > We would now like to ensure that if either of the servers goes down 
> > (we're very unlucky with disks), that the other will be able to take 
> > over automatically.
> >
> > So we define 2 shards with a replication factor of 2.
> >
> >
> >
> > So we have:
> >
> > . Server 1: Shard 1, Replica 2
> >
> > . Server 2: Shard 2, Replica 1
> >
> >
> >
> > Question:
> >
> > But in SolrCloud, replicas are active right? So isn't it now 
> > possible that the load balancer will have Server 1 process *both* 
> > parts of a request, after all, it has both shards due to the
replication, right?
> >
> >
>
>



RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Wow! That was the most pointed, concise discussion of hardware requirements
I've seen to date, and it's fabulously helpful, thank you Shawn!  We
currently have 2 servers that I can dedicate about 12GB of ram to Solr on
(we're moving to these 2 servers now). I can upgrade further if it's needed
& justified, and your discussion helps me justify that such an upgrade is
the right thing to do.

So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should
be in the free and clear then right?  This seems reasonable and doable.

In this more extreme example the failover properties of solr cloud become
more clear. I couldn't possibly run a replica shard without doubling the
memory, so really replication isn't reasonable until I have double the
hardware, then the load balancing scheme makes perfect sense. With 3
servers, 50GB of RAM and 120GB index I should just backup the index
directory I think.

My previous though to run replication just for failover would have actually
resulted in LOWER performance because I would have halved the memory
available to the master & replica. So the previous question is answered as
well now.

Question: if I had 1 server with 60GB of memory and 120GB index, would solr
make full use of the 60GB of memory? Thus trimming disk access in half. Or
is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
consuming the full 5GB of RAM assigned to it with a 120GB index.

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 11:51 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/18/2013 8:12 PM, David Parks wrote:
> I think I still don't understand something here. 
> 
> My concern right now is that query times are very slow for 120GB index 
> (14s on avg), I've seen a lot of disk activity when running queries.
> 
> I'm hoping that distributing that query across 2 servers is going to 
> improve the query time, specifically I'm hoping that we can distribute 
> that disk activity because we don't have great disks on there (yet).
> 
> So, with disk IO being a factor in mind, running the query on one box, vs.
> across 2 *should* be a concern right?
> 
> Admittedly, this is the first step in what will probably be many to 
> try to work our query times down from 14s to what I want to be around 1s.

I went through my mailing list archive to see what all you've said about
your setup.  One thing that I can't seem to find is a mention of how much
total RAM is in each of your servers.  I apologize if it was actually there
and I overlooked it.

In one email thread, you wanted to know whether Solr is CPU-bound or
IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is the
slowest piece of the puzzle. The way to get good performance out of Solr is
to have enough memory that you can take the disk mostly out of the equation
by having the operating system cache the index in RAM.  If you don't have
enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy
in iowait, unable to do much real work.  If you DO have enough RAM to cache
all (or most) of your index, then Solr will be CPU-bound.

With 120GB of total index data on each server, you would want at least 128GB
of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and
that Solr is the only thing running on the machine.  If you have more
servers and shards, you can reduce the per-server memory requirement because
the amount of index data on each server would go down.  I am aware of the
cost associated with this kind of requirement - each of my Solr servers has
64GB.

If you are sharing the server with another program, then you want to have
enough RAM available for Solr's heap, Solr's data, the other program's heap,
and the other program's data.  Some programs (like
MySQL) completely skip the OS disk cache and instead do that caching
themselves with heap memory that's actually allocated to the program.
If you're using a program like that, then you wouldn't need to count its
data.

Using SSDs for storage can speed things up dramatically and may reduce the
total memory requirement to some degree, but even an SSD is slower than RAM.
The transfer speed of RAM is faster, and from what I understand, the latency
is at least an order of magnitude quicker - nanoseconds vs microseconds.

In another thread, you asked about how Google gets such good response times.
Although Google's software probably works differently than Solr/Lucene, when
it comes right down to it, all search engines do similar jobs and have
similar requirements.  I would imagine that Google gets incredible response
time because they have incredible amounts of RAM at their disposal that keep
the important bits of their index instantly availabl

RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Interesting. I'm trying to correlate this new understanding to what I see on
my servers.  I've got one server with 5GB dedicated to solr, solr dashboard
reports a 167GB index actually.

When I do many typical queries I see between 3MB and 9MB of disk reads
(watching iostat).

But solr's dashboard only shows 710MB of memory in use (this box has had
many hundreds of queries put through it, and has been up for 1 week). That
doesn't quite correlate with my understanding that Solr would cache the
index as much as possible. 

Should I be thinking that things aren't configured correctly here?

Dave


-Original Message-
From: John Nielsen [mailto:j...@mcb.dk] 
Sent: Friday, April 19, 2013 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Well, to consume 120GB of RAM with a 120GB index, you would have to query
over every single GB of data.

If you only actually query over, say, 500MB of the 120GB data in your dev
environment, you would only use 500MB worth of RAM for caching. Not 120GB


On Fri, Apr 19, 2013 at 7:55 AM, David Parks  wrote:

> Wow! That was the most pointed, concise discussion of hardware 
> requirements I've seen to date, and it's fabulously helpful, thank you 
> Shawn!  We currently have 2 servers that I can dedicate about 12GB of 
> ram to Solr on (we're moving to these 2 servers now). I can upgrade 
> further if it's needed & justified, and your discussion helps me 
> justify that such an upgrade is the right thing to do.
>
> So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I 
> should be in the free and clear then right?  This seems reasonable and 
> doable.
>
> In this more extreme example the failover properties of solr cloud 
> become more clear. I couldn't possibly run a replica shard without 
> doubling the memory, so really replication isn't reasonable until I 
> have double the hardware, then the load balancing scheme makes perfect 
> sense. With 3 servers, 50GB of RAM and 120GB index I should just 
> backup the index directory I think.
>
> My previous though to run replication just for failover would have 
> actually resulted in LOWER performance because I would have halved the 
> memory available to the master & replica. So the previous question is 
> answered as well now.
>
> Question: if I had 1 server with 60GB of memory and 120GB index, would 
> solr make full use of the 60GB of memory? Thus trimming disk access in 
> half. Or is it an all-or-nothing thing?  In a dev environment, I 
> didn't notice SOLR consuming the full 5GB of RAM assigned to it with a
120GB index.
>
> Dave
>
>
> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: Friday, April 19, 2013 11:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud loadbalancing, replication, and failover
>
> On 4/18/2013 8:12 PM, David Parks wrote:
> > I think I still don't understand something here.
> >
> > My concern right now is that query times are very slow for 120GB 
> > index (14s on avg), I've seen a lot of disk activity when running
queries.
> >
> > I'm hoping that distributing that query across 2 servers is going to 
> > improve the query time, specifically I'm hoping that we can 
> > distribute that disk activity because we don't have great disks on there
(yet).
> >
> > So, with disk IO being a factor in mind, running the query on one 
> > box,
> vs.
> > across 2 *should* be a concern right?
> >
> > Admittedly, this is the first step in what will probably be many to 
> > try to work our query times down from 14s to what I want to be around
1s.
>
> I went through my mailing list archive to see what all you've said 
> about your setup.  One thing that I can't seem to find is a mention of 
> how much total RAM is in each of your servers.  I apologize if it was 
> actually there and I overlooked it.
>
> In one email thread, you wanted to know whether Solr is CPU-bound or 
> IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O 
> is the slowest piece of the puzzle. The way to get good performance 
> out of Solr is to have enough memory that you can take the disk mostly 
> out of the equation by having the operating system cache the index in 
> RAM.  If you don't have enough RAM for that, then Solr becomes 
> IO-bound, and your CPUs will be busy in iowait, unable to do much real 
> work.  If you DO have enough RAM to cache all (or most) of your index, 
> then Solr will be CPU-bound.
>
> With 120GB of total index data on each server, you would want at least 
> 128GB of RAM per server, assuming you are only giving 8-16GB of RAM 

RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Ok, I understand better now.

The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey
allocation of 602MB, and light grey of an additional 108MB, for a JVM total
of 710MB allocated. If I understand correctly, Solr memory utilization is
*not* for caching (unless I configured document caches or some of the other
cache options in Solr, which don't seem to apply in this case, and I haven't
altered from their defaults).

So assuming this box was dedicated to 1 solr instance/shard. What JVM heap
should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure
the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server
instance).

Would I be wise to just put the index on a RAM disk and guarantee
performance?  Assuming I installed sufficient RAM?

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 4:19 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 2:15 AM, David Parks wrote:
> Interesting. I'm trying to correlate this new understanding to what I 
> see on my servers.  I've got one server with 5GB dedicated to solr, 
> solr dashboard reports a 167GB index actually.
> 
> When I do many typical queries I see between 3MB and 9MB of disk reads 
> (watching iostat).
> 
> But solr's dashboard only shows 710MB of memory in use (this box has 
> had many hundreds of queries put through it, and has been up for 1 
> week). That doesn't quite correlate with my understanding that Solr 
> would cache the index as much as possible.

There are two memory sections on the dashboard.  The one at the top shows
the operating system view of physical memory.  That is probably showing
virtually all of it in use.  Most UNIX platforms will show you the same info
with 'top' or 'free'.  Some of them, like Solaris, require different tools.
I assume you're not using Windows, because you mention iostat.

The other memory section is for the JVM, and that only covers the memory
used by Solr.  The dark grey section is the amount of Java heap memory
currently utilized by Solr and its servlet container.  The light grey
section represents the memory that the JVM has allocated from system memory.
If any part of that bar is white, then Java has not yet requested the
maximum configured heap.  Typically a long-running Solr install will have
only dark and light grey, no white.

The operating system is what caches your index, not Solr.  The bulk of your
RAM should be unallocated.  With your index size, the OS will use all
unallocated RAM for the disk cache.  If a program requests some of that RAM,
the OS will instantly give it up.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Wow, thank you for those benchmarks Toke, that really gives me some firm 
footing to stand on in knowing what to expect and thinking out which path to 
venture down. It's tremendously appreciated!

Dave


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Friday, April 19, 2013 5:17 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote:
> Using SSDs for storage can speed things up dramatically and may reduce 
> the total memory requirement to some degree,

We have been using SSDs for several years in our servers. It is our clear 
experience that "to some degree" should be replaced with "very much" in the 
above.

Our current SSD-equipped servers each holds a total of 127GB of index data 
spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB 
are left for disk cache.

"We" are the State and University Library, Denmark and our search engine is the 
primary (and arguably only) way to locate resources for our users. The average 
raw search time is 32ms for non-faceted queries and 616ms for heavy faceted 
(which is much too slow. Dang! I thought I fixed that).

>  but even an SSD is slower than RAM.  The transfer speed of RAM is 
> faster, and from what I understand, the latency is at least an order 
> of magnitude quicker - nanoseconds vs microseconds.

True, but you might as well argue that everyone should go for the fastest CPU 
possible, as it will be, well, faster than the slower ones.

The question is almost never to get the fastest possible, but to get a good 
price/performance tradeoff. I would argue that SSDs fit that bill very well for 
a great deal of the "My search is too slow"-threads that are spun on this 
mailing list. Especially for larger indexes.

Regards,
Toke Eskildsen



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Again, thank you for this incredible information, I feel on much firmer
footing now. I'm going to test distributing this across 10 servers,
borrowing a Hadoop cluster temporarily, and see how it does with enough
memory to have the whole index cached. But I'm thinking that we'll try the
SSD route as our index will probably rest in the 1/2 terabyte range
eventually, there's still a lot of active development.

I guess the RAM disk would work in our case also, as we only index in
batches, and eventually I'd like to do that off of Solr and just update the
index (I'm presuming this is doable in solr cloud, but I haven't put it to
task yet). If I could purpose Hadoop to index the shards, that would be
ideal, though I haven't quite figured out how to go about it yet.

David


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 9:42 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 3:48 AM, David Parks wrote:
> The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has 
> dark grey allocation of 602MB, and light grey of an additional 108MB, 
> for a JVM total of 710MB allocated. If I understand correctly, Solr 
> memory utilization is
> *not* for caching (unless I configured document caches or some of the 
> other cache options in Solr, which don't seem to apply in this case, 
> and I haven't altered from their defaults).

Right.  Solr does have caches, but they serve specific purposes.  The OS is
much better at general large-scale caching than Solr is.  Solr caches get
cleared (and possibly re-warmed) whenever you issue a commit on your index
that makes new documents visible.

> So assuming this box was dedicated to 1 solr instance/shard. What JVM 
> heap should I set? Does that matter? 24GB JVM heap? Or keep it lower 
> and ensure the OS cache has plenty of room to operate? (this is an 
> Ubuntu 12.10 server instance).

The JVM heap to use is highly dependent on the nature of your queries, the
number of documents, the number of unique terms, etc.  The best thing to do
is try it out with a relatively large heap, see how much memory actually
gets used inside the JVM.  The jvisualvm and jconsole tools will give you
nice graphs of JVM memory usage.  The jstat program will give you raw
numbers on the commandline that you'll need to add to get the full picture.
Due to the garbage collection model that Java uses, what you'll see is a
sawtooth pattern - memory usage goes up to max heap, then garbage collection
reduces it to the actual memory used.
 Generally speaking, you want to have more heap available than the "low"
point of that sawtooth pattern.  If that low point is around 3GB when you
are hitting your index hard with queries and updates, then you would want to
give Solr a heap of 4 to 6 GB.

> Would I be wise to just put the index on a RAM disk and guarantee 
> performance?  Assuming I installed sufficient RAM?

A RAM disk is a very good way to guarantee performance - but RAM disks are
ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
reindex.  Also remember that you actually need at *least* twice the size of
your index so that Solr (Lucene) has enough room to do merges, and the
worst-case scenario is *three* times the index size.  Merging happens during
normal indexing, not just when you optimize.  If you have enough RAM for
three times your index size and it takes less than an hour or two to rebuild
the index, then a RAM disk might be a viable way to go.  I suspect that this
won't work for you.

Thanks,
Shawn



Bug? JSON output changes when switching to solr cloud

2013-04-21 Thread David Parks
We just took an installation of 4.1 which was working fine and changed it to
run as solr cloud. We encountered the most incredibly bizarre apparent bug:

In the JSON output, a colon ':' changed to a comma ',', which of course
broke the JSON parser.  I'm guessing I should file this as a bug, but it was
so odd I thought I'd post here before doing so. Demo below:

Here is a query on our previous single-server instance:

Query:
--
http://10.1.3.28:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalog_name
&start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&group.li
mit=50

Response:
-
{"responseHeader":{"status":0,"QTime":15714,"params":{"fl":"score,id,unique_
catalog_name","start":"0","q":"book","group.limit":"50","group.field":"uniqu
e_catalog_name","group":"true","wt":"json","rows":"50"}},"grouped":{"unique_
catalog_name":{"matches":106711214,"groups":[{"groupValue":"ls:2653","doclis
t":{"numFound":103981882,"start":0,"maxScore":4.7039795,"docs":[{"id":"10055
02088784","score":4.7039795},{"id":"1005500291075","score":4.7039795},{"id":
"1000810546074","score":4.7039795},{"id":"1000611003270","score":4.7039795},

Note this part:
--
  {"unique_catalog_name":{"matches":



Now we run that same query on a server that was derived from the same build,
just configuration changes to run it in distributed "solr cloud" mode.

Query:
-
http://10.1.3.18:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalog_name
&start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&group.li
mit=50

Response:
-{"responseHeader":{"status":0,"QTime":8855,"params":{"fl":"scor
e,id,unique_catalog_name","start":"0","q":"book","group.limit":"50","group.f
ield":"unique_catalog_name","group":"true","wt":"json","rows":"50"}},"groupe
d":["unique_catalog_name",{"matches":106711214,"groups":[{"groupValue":"ls:2
653","doclist":{"numFound":103981882,"start":0,"maxScore":4.7042913,"docs":[
{"id":"1005502088784","score":4.7042913},{"id":"1000611003270","score":4.704
2913},{"id":"1005500291075","score":4.703668},{"id":"1000810546074","score":
4.703668},

Note how it's changed:

  "unique_catalog_name",{"matches":






RE: Bug? JSON output changes when switching to solr cloud

2013-04-22 Thread David Parks
Thanks Yonik! That was fast!
We switched over to XML for the moment and will switch back to JSON when 4.3
comes out.
Dave


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, April 22, 2013 8:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Bug? JSON output changes when switching to solr cloud

Thanks David,

I've confirmed this is still a problem in trunk and opened
https://issues.apache.org/jira/browse/SOLR-4746

-Yonik
http://lucidworks.com


On Sun, Apr 21, 2013 at 11:16 PM, David Parks 
wrote:
> We just took an installation of 4.1 which was working fine and changed 
> it to run as solr cloud. We encountered the most incredibly bizarre
apparent bug:
>
> In the JSON output, a colon ':' changed to a comma ',', which of 
> course broke the JSON parser.  I'm guessing I should file this as a 
> bug, but it was so odd I thought I'd post here before doing so. Demo
below:
>
> Here is a query on our previous single-server instance:
>
> Query:
> --
> http://10.1.3.28:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalo
> g_name 
> &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&gr
> oup.li
> mit=50
>
> Response:
> -
> {"responseHeader":{"status":0,"QTime":15714,"params":{"fl":"score,id,u
> nique_ 
> catalog_name","start":"0","q":"book","group.limit":"50","group.field":
> "uniqu 
> e_catalog_name","group":"true","wt":"json","rows":"50"}},"grouped":{"u
> nique_ 
> catalog_name":{"matches":106711214,"groups":[{"groupValue":"ls:2653","
> doclis
> t":{"numFound":103981882,"start":0,"maxScore":4.7039795,"docs":[{"id":
> "10055
>
02088784","score":4.7039795},{"id":"1005500291075","score":4.7039795},{"id":
> "1000810546074","score":4.7039795},{"id":"1000611003270","score":4.703
> 9795},
>
> Note this part:
> --
>   {"unique_catalog_name":{"matches":
>
>
>
> Now we run that same query on a server that was derived from the same 
> build, just configuration changes to run it in distributed "solr cloud"
mode.
>
> Query:
> -
> http://10.1.3.18:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalo
> g_name 
> &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&gr
> oup.li
> mit=50
>
> Response:
> -{"responseHeader":{"status":0,"QTime":8855,"params":{"fl"
> :"scor 
> e,id,unique_catalog_name","start":"0","q":"book","group.limit":"50","g
> roup.f 
> ield":"unique_catalog_name","group":"true","wt":"json","rows":"50"}},"
> groupe
> d":["unique_catalog_name",{"matches":106711214,"groups":[{"groupValue"
> :"ls:2 
> 653","doclist":{"numFound":103981882,"start":0,"maxScore":4.7042913,"d
> ocs":[
> {"id":"1005502088784","score":4.7042913},{"id":"1000611003270","score"
> :4.704
>
2913},{"id":"1005500291075","score":4.703668},{"id":"1000810546074","score":
> 4.703668},
>
> Note how it's changed:
> 
>   "unique_catalog_name",{"matches":
>
>
>
>



Indexing off of the production servers

2013-05-06 Thread David Parks
I've had trouble figuring out what options exist if I want to perform all
indexing off of the production servers (I'd like to keep them only for user
queries).

 

We index data in batches roughly daily, ideally I'd index all solr cloud
shards offline, then move the final index files to the solr cloud instance
that needs it and flip a switch and have it use the new index.

 

Is this possible via either:

1.   Doing the indexing in Hadoop?? (this would be ideal as we have a
significant investment in a hadoop cluster already), or

2.   Maintaining a separate "master" server that handles indexing and
the nodes that receive user queries update their index from there (I seem to
recall reading about this configuration in 3.x, but now we're using solr
cloud)

 

Is there some ideal solution I can use to "protect" the production solr
instances from degraded performance during large index processing periods?

 

Thanks!

David



RE: Indexing off of the production servers

2013-05-06 Thread David Parks
I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just off-loading
the whole indexing process. We may just want to re-index the whole thing to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we move
some update to production that requires full re-indexing (every week or so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson 

> The only problem with using Hadoop (or whatever) is that you need to 
> be sure that documents end up on the same shard, which means that you 
> have to use the same routing mechanism that SolrCloud uses. The custom 
> doc routing may help here
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just 
> do the indexing during off hours or Have you measured an indexing 
> degradation during your heavy indexing? Indexing has costs, no 
> question, but it's worth asking whether the costs are heavy enough to 
> be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > use Map/Reduce jobs you split your workload, process it, and then 
> > reduce step takes into account. Let me explain you new SolrCloud 
> > architecture. You start your SolrCluoud with a numShards parameter. 
> > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your 
> > indexing workload will divided into 5 so it means that you have 
> > parallelized your data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5 
> > shards, 5 leaders of them and every shard has 2 replica. When you 
> > send a query into a SolrCloud every replica will help you for 
> > searching and if
> you
> > add more replicas to your SolrCloud your search performance will
improve.
> >
> >
> > 2013/5/6 David Parks 
> >
> >> I've had trouble figuring out what options exist if I want to 
> >> perform
> all
> >> indexing off of the production servers (I'd like to keep them only 
> >> for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr 
> >> cloud shards offline, then move the final index files to the solr 
> >> cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.   Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.   Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there 
> >> (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using 
> >> solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production 
> >> solr instances from degraded performance during large index 
> >> processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>



RE: Indexing off of the production servers

2013-05-06 Thread David Parks
So, am I following this correctly by saying that, this proposed solution
would present us a way to index a collection on an offline/dev solr cloud
instance and *move* that pre-prepared index to the production server using
an alias/rename trick?

That seems like a reasonably doable solution. I also wonder how much work it
is to build the shards programmatically (e.g. directly in a hadoop/java
environment), cutting out the extra step of needing another solr instances
running on a staging environment somewhere. Then using this technique to
swap in the shards.

I might do something like this first and then look into simplifying, and
further automating, later on. And if it is indeed possible to build a hadoop
driver for indexing, I think that would be a useful tool for the community
at large. So I'm still curious about it, at least as a thought exercise, if
nothing else.

Thanks,
Dave


-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, May 06, 2013 9:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

Thanks for your answer. I have read that at somewhere:

I believe "redirect" from replica to leader would happen only at index time,
so a doc first gets indexed to leader and from there it's replicated to
non-leader shards.

Is that true? I want to make clear the things in my mind otherwise I want to
ask a separate question about what happens for indexing and querying at
SolrCloud.

2013/5/6 Shawn Heisey 

> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> > Excellent idea !
> > And it is possible to use collection aliasing with the CREATEALIAS 
> > to make this transparent for the query side.
> >
> > ex. with 2 collections named :
> > collection_1
> > collection_2
> >
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_1
> >
> > "collectionalias" is now a virtual collection pointing to collection_1.
> >
> > Index on collection_2, then :
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_2
> >
> > "collectionalias" now is an alias to collection_2.
> >
> >
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Col
> lections_API
>
> Awesome idea, Andre! I was wondering whether you might have to delete 
> the original alias before creating the new one, but a quick look at 
> the issue for collection aliasing shows that this isn't the case.
>
> https://issues.apache.org/jira/browse/SOLR-4497
>
> The wiki doesn't mention the DELETEALIAS action.  I won't have time 
> right now to update the wiki.
>
> Thanks,
> Shawn
>
>



RE: Solr Cloud with large synonyms.txt

2013-05-06 Thread David Parks
Wouldn't it make more sense to only store a pointer to a synonyms file in
zookeeper? Maybe just make the synonyms file accessible via http so other
boxes can copy it if needed? Zookeeper was never meant for storing
significant amounts of data.


-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Tuesday, May 07, 2013 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud with large synonyms.txt

See discussion here
http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html

One idea was compression. Perhaps if we add gzip support to SynonymFilter it
can read synonyms.txt.gz which would then fit larger raw dicts?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 18:32 skrev Son Nguyen :

> Hello,
> 
> I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper
(the Zookeeer is on different machine, version 3.4.5).
> I've tried to start with a 1.7MB synonyms.txt, but got a
"ConnectionLossException":
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt
>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270)
>at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267)
>at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java
:65)
>at
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267)
>at
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436)
>at
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315)
>at
org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135)
>at
org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955)
>at
org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285)
>... 43 more
> 
> I did some researches on internet and found out that because Zookeeper
znode size limit is 1MB. I tried to increase the system property
"jute.maxbuffer" but it won't work.
> Does anyone have experience of dealing with it?
> 
> Thanks,
> Son



RE: Solr Cloud with large synonyms.txt

2013-05-08 Thread David Parks
I can see your point, though I think edge cases would be one concern, if
someone *can* create a very large synonyms file, someone *will* create that
file.  What  would you set the zookeeper max data size to be? 50MB? 100MB?
Someone is going to do something bad if there's nothing to tell them not to.
Today solr cloud just crashes if you try to create a modest sized synonyms
file, clearly at a minimum some zookeeper settings should be configured out
of the box.  Any reasonable setting you come up with for zookeeper is
virtually guaranteed to fail for some percentage of users over a reasonably
sized user-base (which solr has).

What if I plugged in a 200MB synonyms file just for testing purposes (I
don't care about performance implications)?  I don't think most users would
catch the footnote in the docs that calls out a max synonyms file size.

Dave


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Tuesday, May 07, 2013 11:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud with large synonyms.txt

I'm not so worried about the large file in zk issue myself.

The concern is that you start storing and accessing lots of large files in
ZK. This is not what it was made for, and everything stays in RAM, so they
guard against this type of usage.

We are talking about a config file that is loaded on Core load though. It's
uploaded and read very rarely. On modern hardware and networks, making that
file 5MB rather than 1MB is not going to ruin your day. It just won't. Solr
does not use ZooKeeper heavily - in a steady state cluster, it doesn't read
or write from ZooKeeper at all to any degree that registers. I'm going to
have to see problems loading these larger config files from ZooKeeper before
I'm worried that it's a problem.

- Mark

On May 7, 2013, at 12:21 PM, Son Nguyen  wrote:

> Mark,
> 
> I tried to set that property on both ZK (I have only one ZK instance) and
Solr, but it still didn't work.
> But I read somewhere that ZK is not really designed for keeping large data
files, so this solution - increasing jute.maxbuffer (if I can implement it)
should be just temporary.
> 
> Son
> 
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com] 
> Sent: Tuesday, May 07, 2013 9:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Cloud with large synonyms.txt
> 
> 
> On May 7, 2013, at 10:24 AM, Mark Miller  wrote:
> 
>> 
>> On May 6, 2013, at 12:32 PM, Son Nguyen  wrote:
>> 
>>> I did some researches on internet and found out that because Zookeeper
znode size limit is 1MB. I tried to increase the system property
"jute.maxbuffer" but it won't work.
>>> Does anyone have experience of dealing with it?
>> 
>> Perhaps hit up the ZK list? They doc it as simply raising jute.maxbuffer,
though you have to do it for each ZK instance.
>> 
>> - Mark
>> 
> 
> "the system property must be set on all servers and clients otherwise
problems will arise."
> 
> Make sure you try passing it both to ZK *and* to Solr.
> 
> - Mark
> 



RE: More Like This and Caching

2013-05-09 Thread David Parks
I'm not the expert here, but perhaps what you're noticing is actually the
OS's disk cache. The actual solr index isn't cached by solr, but as you read
the blocks off disk the OS disk cache probably did cache those blocks for
you. On the 2nd run the index blocks were read out of memory.

There was a very extensive discussion on this list not long back titled:
"Re: SolrCloud loadbalancing, replication, and failover" look that thread up
and you'll get a lot of in-depth on the topic.

David


-Original Message-
From: Giammarco Schisani [mailto:giamma...@schisani.com] 
Sent: Thursday, May 09, 2013 2:59 PM
To: solr-user@lucene.apache.org
Subject: More Like This and Caching

Hi all,

Could anybody explain which Solr cache (e.g. queryResultCache,
documentCache, fieldCache, etc.) can be used by the More Like This handler?

One of my colleagues had previously suggested that the More Like This
handler does not take advantage of any of the Solr caches.

However, if I issue two identical MLT requests to the same Solr instance,
the second request will execute much faster than the first request (for
example, the first request will execute in 200ms and the second request will
execute in 20ms). This makes me believe that at least one of the Solr caches
is being used by the More Like This handler.

I think the "documentCache" is the cache that is most likely being used, but
would you be able to confirm?

As information, I am currently using Solr version 3.6.1.

Kind regards,
Giammarco Schisani



RE: Is the CoreAdmin RENAME method atomic?

2013-05-09 Thread David Parks
Find the discussion titled "Indexing off the production servers" just a week
ago in this same forum, there is a significant discussion of this feature
that you will probably want to review.


-Original Message-
From: Lan [mailto:dung@gmail.com] 
Sent: Friday, May 10, 2013 3:42 AM
To: solr-user@lucene.apache.org
Subject: Is the CoreAdmin RENAME method atomic?

We need to implement a locking mechanism for a full-reindexing SOLR server
pool. We could use a database, Zookeeper as our locking mechanism but thats
a lot of work. Could solr do it?

I noticed the core admin RENAME function
(http://wiki.apache.org/solr/CoreAdmin#RENAME) Is this an synchronous atomic
operation?

What I'm thinking is we create a solr core named 'lock' and any process that
wants to obtain a solr server from the pool tries to rename the 'lock' core
to say 'lock.someuniqueid'. If it fails, then it tries another server in the
pools or waits a bit. If it succeeds, it reindexes it's data and then
renames 'lock.someuniqueid' back to 'lock' to return the server back to the
pool.









--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-the-CoreAdmin-RENAME-method-atomic-tp4
061944.html
Sent from the Solr - User mailing list archive at Nabble.com.



Boosting documents with terms derived from clustering - good idea?

2013-05-14 Thread David Parks
We have a number of queries that produce good results based on the textual
data, but are contextually wrong (for example, an "SSD hard drive" search
matches the music album "SSD hip hop drives us crazy".

 

Textually a fair match, but SSD is a term that strongly relates to technical
documents.

 

We'd like to be able to direct this query more strictly in the direction of
the technical documents based on the term "SSD".  I am considering whether
it would be worth trying to cluster all documents, thus tending to group the
music with the music and tech items with the tech items. Then pulling out
the term vectors that define each group; do a human review of that data; and
plug it back into the documents of each cluster as a separate search field
that gets boosted.

 

In my head it seems like a plausible way to weigh terms like SSD to the
cluster of items that it most closely associates.

 

Should I spend the effort to find out?

Yeh or neh?



MoreLikeThis supporting multiple document IDs as input?

2012-12-25 Thread David Parks
I'm unclear on this point from the documentation. Is it possible to give
Solr X # of document IDs and tell it that I want documents similar to those
X documents?

Example:

  - The user is browsing 5 different articles
  - I send Solr the IDs of these 5 articles so I can present the user other
similar articles

I see this example for sending it 1 document ID:
http://localhost:8080/solr/select/?qt=mlt&q=id:[document
id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10

But can I send it 2+ document IDs as the query?



RE: MoreLikeThis supporting multiple document IDs as input?

2012-12-26 Thread David Parks
Someone else suggested this query: q=id:[1001 OR 1002], where the 
numbers represent multiple IDs, but if I get it, you're saying that these 
ultimate get turned into just one document and we get similar documents to just 
that one. 

MoreLikeThese sounds promising. Is this in one of the development builds, or is 
it just and addon I need to install? I haven't done much customization of Solr 
yet.

Thanks!
Dave


-Original Message-
From: Roman Chyla [mailto:roman.ch...@gmail.com] 
Sent: Wednesday, December 26, 2012 3:57 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis supporting multiple document IDs as input?

Jay Luker has written MoreLikeThese which is probably what you want. You may 
give it a try, though I am not sure if it works with Solr4.0 at this point (we 
didn't port it yet)

https://github.com/romanchyla/montysolr/blob/MLT/contrib/adsabs/src/java/org/apache/solr/handler/MoreLikeTheseHandler.java

roman

On Wed, Dec 26, 2012 at 12:06 AM, Jack Krupansky wrote:

> MLT has both a request handler and a search component.
>
> The MLT handler returns similar documents only for the first document 
> that the query matches.
>
> The MLT search component returns similar documents for each of the 
> documents in the search results, but processes each search result base 
> document one at a time and keeps its similar documents segregated by 
> each of the base documents.
>
> It sounds like you wanted to merge the base search results and then 
> find documents similar to that merged super-document. Is that what you 
> were really seeking, as opposed to what the MLT component does? 
> Unfortunately, you can't do that with the components as they are.
>
> You would have to manually merge the values from the base documents 
> and then you could POST that text back to the MLT handler and find 
> similar documents using the posted text rather than a query. Kind of 
> messy, but in theory that should work.
>
> -- Jack Krupansky
>
> -Original Message- From: David Parks
> Sent: Tuesday, December 25, 2012 5:04 AM
> To: solr-user@lucene.apache.org
> Subject: MoreLikeThis supporting multiple document IDs as input?
>
>
> I'm unclear on this point from the documentation. Is it possible to 
> give Solr X # of document IDs and tell it that I want documents 
> similar to those X documents?
>
> Example:
>
>  - The user is browsing 5 different articles
>  - I send Solr the IDs of these 5 articles so I can present the user 
> other similar articles
>
> I see this example for sending it 1 document ID:
> http://localhost:8080/solr/**select/?qt=mlt&q=id:[document<http://loca
> lhost:8080/solr/select/?qt=mlt&q=id:[document>
> id]&mlt.fl=[field1],[field2],[**field3]&fl=id&rows=10
>
> But can I send it 2+ document IDs as the query?
>



RE: solr + jetty deployment issue

2012-12-27 Thread David Parks
Do you see any errors coming in on the console, stderr?

I start solr this way and redirect the stdout and stderr to log files, when
I have a problem stderr generally has the answer:

java \
-server \
-Djetty.port=8080 \
-Dsolr.solr.home=/opt/solr \
-Dsolr.data.dir=/mnt/solr_data \
-jar /opt/solr/start.jar >/opt/solr/logs/stdout.log
2>/opt/solr/logs/stderr.log &



-Original Message-
From: Sushrut Bidwai [mailto:bidwai.sush...@gmail.com] 
Sent: Thursday, December 27, 2012 7:40 PM
To: solr-user@lucene.apache.org
Subject: solr + jetty deployment issue

Hi,

I am having trouble with getting solr + jetty to work. I am following all
instructions to the letter from - http://wiki.apache.org/solr/SolrJetty. I
also created a work folder - /opt/solr/work. I am also setting tmpdir to a
new path in /etc/default/jetty . I am confirming the tmpdir is set to the
new path from admin dashboard, under args.

It works like a charm. But when I restart jetty multiple times, after 3/4
such restarts it starts hanging. Admin pages just dont load and my app fails
to acquire a connection with solr.

What I might be missing? Should I be rather looking at my code and see if I
am not committing correctly?

Please let me know if you have faced similar issue in the past and how to
tackle it.

Thank you.

--
Best Regards,
Sushrut



MoreLikeThis only returns 1 result

2012-12-27 Thread David Parks
I'm doing a query like this for MoreLikeThis, sending it a document ID. But
the only result I ever get back is the document ID I sent it. The debug
response is below.

If I read it correctly, it's taking "id:1004401713626" as the term (not the
document ID) and only finding it once. But I want it to match the document
with ID 1004401713626 of course. I tried &q=id[1004410713626], but that
generates an exception:

Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse
'id:[1004401713626]': Encountered " "]" "] "" at line 1, column 17.
Was expecting one of:
"TO" ...
 ...
 ...

This must be easy, but the documentation is minimal.

My Query:
http://107.23.102.164:8080/solr/select/?qt=mlt&q=id:[1004401713626]&rows=10&;
mlt.fl=item_name,item_brand,short_description,long_description,catalog_names
,categories,keywords,attributes,facetime&mlt.mintf=2&mlt.mindf=5&mlt.maxqt=1
00&mlt.boost=false&debugQuery=true


  
0
1

  5
  
item_name,item_brand,short_description,long_description,catalog_names,catego
ries,keywords,attributes,facetime

  false
  true
  id:1004401713626
  2
  100
  mlt
  10


  
0
1004401713626
  


  id:1004401713626
  id:1004401713626
  id:1004401713626
  id:1004401713626
  

18.29481 = (MATCH) fieldWeight(id:1004401713626 in 2843152), product of: 1.0
= tf(termFreq(id:1004401713626)=1) 18.29481 = idf(docFreq=1,
maxDocs=64873893) 1.0 = fieldNorm(field=id, doc=2843152)

  



RE: MoreLikeThis only returns 1 result

2012-12-27 Thread David Parks
Ok, that worked, I had the /mlt request handler misconfigured (forgot a
'/'). It's working now. Thanks!

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, December 28, 2012 11:38 AM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis only returns 1 result

Sounds like it is simply dispatching to the normal search request handler. 
Although you specified qt=mlt, make sure you enable the legacy select
handler dispatching in solrconfig.xml.

Change:



to



Or, simply address the MLT handler directly:

http://107.23.102.164:8080/solr/mlt?q=...

Or, use the MoreLikeThis search component:

http://localhost:8983/solr/select?q=...&mlt=true&;...

See:
http://wiki.apache.org/solr/MoreLikeThis

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Thursday, December 27, 2012 9:59 PM
To: solr-user@lucene.apache.org
Subject: MoreLikeThis only returns 1 result

I'm doing a query like this for MoreLikeThis, sending it a document ID. But
the only result I ever get back is the document ID I sent it. The debug
response is below.

If I read it correctly, it's taking "id:1004401713626" as the term (not the
document ID) and only finding it once. But I want it to match the document
with ID 1004401713626 of course. I tried &q=id[1004410713626], but that
generates an exception:

Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse
'id:[1004401713626]': Encountered " "]" "] "" at line 1, column 17.
Was expecting one of:
"TO" ...
 ...
 ...

This must be easy, but the documentation is minimal.

My Query:
http://107.23.102.164:8080/solr/select/?qt=mlt&q=id:[1004401713626]&rows=10&;
mlt.fl=item_name,item_brand,short_description,long_description,catalog_names
,categories,keywords,attributes,facetime&mlt.mintf=2&mlt.mindf=5&mlt.maxqt=1
00&mlt.boost=false&debugQuery=true


  
0
1

  5
  
item_name,item_brand,short_description,long_description,catalog_names,catego
ries,keywords,attributes,facetime

  false
  true
  id:1004401713626
  2
  100
  mlt
  10


  
0
1004401713626
  


  id:1004401713626
  id:1004401713626
  id:1004401713626
  id:1004401713626
  

18.29481 = (MATCH) fieldWeight(id:1004401713626 in 2843152), product of: 1.0
= tf(termFreq(id:1004401713626)=1) 18.29481 = idf(docFreq=1,
maxDocs=64873893) 1.0 = fieldNorm(field=id, doc=2843152) 
   



RE: MoreLikeThis supporting multiple document IDs as input?

2012-12-27 Thread David Parks
I'm somewhat new to Solr (it's running, I've been through the books, but I'm
no master). What I hear you say is that MLT *can* accept, say 5, documents
and provide results, but the results would essentially be the same as
running the query 5 times for each document?

If that's the case, I might accept it. I would just have to merge them
together at the end (perhaps I'd take the top 2 of each result, for
example).

Being somewhat new I'm a little confused by the difference between a "Search
Component" and a "Handler". I've got the /mlt handler working and I'm using
that. But how's that different from a "Search Component"? Is that referring
to the default /solr/select?q="..." style query?

And if what I said about multiple documents above is correct, what's the
syntax to try that out?

Thanks very much for the great help!
Dave


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, December 26, 2012 12:07 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis supporting multiple document IDs as input?

MLT has both a request handler and a search component.

The MLT handler returns similar documents only for the first document that
the query matches.

The MLT search component returns similar documents for each of the documents
in the search results, but processes each search result base document one at
a time and keeps its similar documents segregated by each of the base
documents.

It sounds like you wanted to merge the base search results and then find
documents similar to that merged super-document. Is that what you were
really seeking, as opposed to what the MLT component does? Unfortunately,
you can't do that with the components as they are.

You would have to manually merge the values from the base documents and then
you could POST that text back to the MLT handler and find similar documents
using the posted text rather than a query. Kind of messy, but in theory that
should work.

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Tuesday, December 25, 2012 5:04 AM
To: solr-user@lucene.apache.org
Subject: MoreLikeThis supporting multiple document IDs as input?

I'm unclear on this point from the documentation. Is it possible to give
Solr X # of document IDs and tell it that I want documents similar to those
X documents?

Example:

  - The user is browsing 5 different articles
  - I send Solr the IDs of these 5 articles so I can present the user other
similar articles

I see this example for sending it 1 document ID:
http://localhost:8080/solr/select/?qt=mlt&q=id:[document
id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10

But can I send it 2+ document IDs as the query? 



RE: MoreLikeThis supporting multiple document IDs as input?

2012-12-27 Thread David Parks
So the Search Components are executed in series an _every_ request. I
presume then that they look at the request parameters and decide what and
whether to take action.

So in the case of the MLT component this was said:

> The MLT search component returns similar documents for each of the 
> documents in the search results, but processes each search result base 
> document one at a time and keeps its similar documents segregated by 
> each of the base documents.

So what I think I understand is that the Query Component (presumably this
guy: org.apache.solr.handler.component.QueryComponent) takes the input from
the "q" parameter and returns a result (the "q=id:123456" ensure that the
Query Component will return just this one document).

The MltComponent then looks at the result from the QueryComponent and
generates its results.

The part that is still confusing is understanding the difference between
these two comments:

 - The MLT search component returns similar documents for each of the
documents in the search results
 - The MLT handler returns similar documents only for the first document
that the query matches.



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Friday, December 28, 2012 1:26 PM
To: solr-user@lucene.apache.org
Subject: RE: MoreLikeThis supporting multiple document IDs as input?

Hi Dave,

Think of search components as a chain of Java classes that get executed
during each search request. If you open solrconfig.xml you will see how they
are defined and used.

HTH

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Dec 28, 2012 12:06 AM, "David Parks"  wrote:

> I'm somewhat new to Solr (it's running, I've been through the books, 
> but I'm no master). What I hear you say is that MLT *can* accept, say 
> 5, documents and provide results, but the results would essentially be 
> the same as running the query 5 times for each document?
>
> If that's the case, I might accept it. I would just have to merge them 
> together at the end (perhaps I'd take the top 2 of each result, for 
> example).
>
> Being somewhat new I'm a little confused by the difference between a 
> "Search Component" and a "Handler". I've got the /mlt handler working 
> and I'm using that. But how's that different from a "Search 
> Component"? Is that referring to the default /solr/select?q="..." 
> style query?
>
> And if what I said about multiple documents above is correct, what's 
> the syntax to try that out?
>
> Thanks very much for the great help!
> Dave
>
>
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Wednesday, December 26, 2012 12:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: MoreLikeThis supporting multiple document IDs as input?
>
> MLT has both a request handler and a search component.
>
> The MLT handler returns similar documents only for the first document 
> that the query matches.
>
> The MLT search component returns similar documents for each of the 
> documents in the search results, but processes each search result base 
> document one at a time and keeps its similar documents segregated by 
> each of the base documents.
>
> It sounds like you wanted to merge the base search results and then 
> find documents similar to that merged super-document. Is that what you 
> were really seeking, as opposed to what the MLT component does? 
> Unfortunately, you can't do that with the components as they are.
>
> You would have to manually merge the values from the base documents 
> and then you could POST that text back to the MLT handler and find 
> similar documents using the posted text rather than a query. Kind of 
> messy, but in theory that should work.
>
> -- Jack Krupansky
>
> -Original Message-
> From: David Parks
> Sent: Tuesday, December 25, 2012 5:04 AM
> To: solr-user@lucene.apache.org
> Subject: MoreLikeThis supporting multiple document IDs as input?
>
> I'm unclear on this point from the documentation. Is it possible to 
> give Solr X # of document IDs and tell it that I want documents 
> similar to those X documents?
>
> Example:
>
>   - The user is browsing 5 different articles
>   - I send Solr the IDs of these 5 articles so I can present the user 
> other similar articles
>
> I see this example for sending it 1 document ID:
> http://localhost:8080/solr/select/?qt=mlt&q=id:[document
> id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10
>
> But can I send it 2+ document IDs as the query?
>
>



What do I need to research to solve the problem of returning good results for a generic term?

2012-12-28 Thread David Parks
I'm sure this is a complex problem requiring many iterations of work, so I'm
just looking for pointers in the right direction of research here.

 

I have a base term, such as let's say "black dress" that I might search for.
Someone searching on this term is most logically looking for black dresses.
In my dataset I have black dresses, but I also have many CDs with the term
"black dress" in them (it's not so uncommon of a song title). 

 

I would want the CDs to show up if I search for a more specific term like
"black dress CD", but I would want the black dresses to show up for the less
specific term "black dress".

 

Google image search is excellent at handling this example. A pretty vanilla
installation of Solr isn't yet great at it.

 

So. just looking for a nudge in the right direction here. What should I go
read up on first to start learning how to improve on these results?

 

 



RE: MoreLikeThis supporting multiple document IDs as input?

2013-01-03 Thread David Parks
I'm not seeing the results I would expect. In the previous email below it's
stated that the "MLT search component" returns N results and K similar
documents per EACH of the N results.

If I'm not mistaken I access the "MLT search component" via a query to
/solr/select/?qt=mlt, such as this:

http://10.0.0.1:8080/solr/select/?qt=mlt&terms=true&q=shoes&rows=3

The query above for a simple term such as "shoes" can return many documents.
But I limited the results to 3, and I see 3 results, and the results don't
appear to me any different than doing this query:

http://107.23.102.164:8080/solr/select/?q=shoes&rows=3

So that suggests to me that solr maybe isn't handing things off to the MLT
component as expected (I don't know what results to expect so it's hard for
me to know where I'm trying to get to).

So add in a debugQuery=on parameter and I see this, possibly useful
reference:

LuceneQParser

It also appears that the MoreLikeThisComponent did indeed run



So maybe I should ask exactly what results I should be expecting here? 

Thanks very much!
David


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, December 28, 2012 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis supporting multiple document IDs as input?

Try a query that returns multiple results and you will see the difference.

MLT search component: n results, k similar documents per EACH of the n
results

MLT request handler: only FIRST result is examined, so only k similar
documents for that ONE (first) TOP search result.

Are you really saying that you don't comprehend what the difference is, or
simply that you don't LIKE the difference?! Or, maybe that you are wondering
WHY they are different? That latter question I don't have the answer to.

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Friday, December 28, 2012 2:48 AM
To: solr-user@lucene.apache.org
Subject: RE: MoreLikeThis supporting multiple document IDs as input?

So the Search Components are executed in series an _every_ request. I
presume then that they look at the request parameters and decide what and
whether to take action.

So in the case of the MLT component this was said:

> The MLT search component returns similar documents for each of the 
> documents in the search results, but processes each search result base 
> document one at a time and keeps its similar documents segregated by 
> each of the base documents.

So what I think I understand is that the Query Component (presumably this
guy: org.apache.solr.handler.component.QueryComponent) takes the input from
the "q" parameter and returns a result (the "q=id:123456" ensure that the
Query Component will return just this one document).

The MltComponent then looks at the result from the QueryComponent and
generates its results.

The part that is still confusing is understanding the difference between
these two comments:

- The MLT search component returns similar documents for each of the
documents in the search results
- The MLT handler returns similar documents only for the first document that
the query matches.



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
Sent: Friday, December 28, 2012 1:26 PM
To: solr-user@lucene.apache.org
Subject: RE: MoreLikeThis supporting multiple document IDs as input?

Hi Dave,

Think of search components as a chain of Java classes that get executed
during each search request. If you open solrconfig.xml you will see how they
are defined and used.

HTH

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Dec 28, 2012 12:06 AM, "David Parks"  wrote:

> I'm somewhat new to Solr (it's running, I've been through the books, 
> but I'm no master). What I hear you say is that MLT *can* accept, say 
> 5, documents and provide results, but the results would essentially be 
> the same as running the query 5 times for each document?
>
> If that's the case, I might accept it. I would just have to merge them 
> together at the end (perhaps I'd take the top 2 of each result, for 
> example).
>
> Being somewhat new I'm a little confused by the difference between a 
> "Search Component" and a "Handler". I've got the /mlt handler working 
> and I'm using that. But how's that different from a "Search 
> Component"? Is that referring to the default /solr/select?q="..."
> style query?
>
> And if what I said about multiple documents above is correct, what's 
> the syntax to try that out?
>
> Thanks very much for the great help!
> Dave
>
>
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Wednesday, December 26

RE: MoreLikeThis supporting multiple document IDs as input?

2013-01-04 Thread David Parks
Aha! &mlt=true, that was the key I hadn't worked out before (thought it was
&qt=mlt that achieved that), things are looking rosy now, and these results
are a perfect fit for my needs. Thanks very much for your time to help
explain this!!

David


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, January 03, 2013 8:46 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis supporting multiple document IDs as input?

The MLT search component is enabled using &mlt=true and works on any normal
Solr query. It gives a batch of similar documents for each search result of
the original query, one batch per original query result. It uses the
&mlt.count=n parameter to control how many similar results to return for
each original query result.

The MLT request handler is a standalone request handler that does a query,
takes the first result, and then returns one batch of documents that are
similar to that one document. You have to configure the handler yourself,
but typically it would have the name "/mlt", so you would write:

http://10.0.0.1:8080/solr/mlt/?q=shoes&rows=3

It will show you both the single document from the original query and then
the batch of documents that are most similar to the top terms from that one
original document.

Add &debugQuery=true or &debug=query or &debug=results to see the terms that
are used in the secondary queries that find the similar documents.

There are a bunch a parameters that you have to tune for either approach.

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Thursday, January 03, 2013 4:11 AM
To: solr-user@lucene.apache.org
Subject: RE: MoreLikeThis supporting multiple document IDs as input?

I'm not seeing the results I would expect. In the previous email below it's
stated that the "MLT search component" returns N results and K similar
documents per EACH of the N results.

If I'm not mistaken I access the "MLT search component" via a query to
/solr/select/?qt=mlt, such as this:

http://10.0.0.1:8080/solr/select/?qt=mlt&terms=true&q=shoes&rows=3

The query above for a simple term such as "shoes" can return many documents.
But I limited the results to 3, and I see 3 results, and the results don't
appear to me any different than doing this query:

http://107.23.102.164:8080/solr/select/?q=shoes&rows=3

So that suggests to me that solr maybe isn't handing things off to the MLT
component as expected (I don't know what results to expect so it's hard for
me to know where I'm trying to get to).

So add in a debugQuery=on parameter and I see this, possibly useful
reference:

LuceneQParser

It also appears that the MoreLikeThisComponent did indeed run



So maybe I should ask exactly what results I should be expecting here?

Thanks very much!
David


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, December 28, 2012 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis supporting multiple document IDs as input?

Try a query that returns multiple results and you will see the difference.

MLT search component: n results, k similar documents per EACH of the n
results

MLT request handler: only FIRST result is examined, so only k similar
documents for that ONE (first) TOP search result.

Are you really saying that you don't comprehend what the difference is, or
simply that you don't LIKE the difference?! Or, maybe that you are wondering
WHY they are different? That latter question I don't have the answer to.

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Friday, December 28, 2012 2:48 AM
To: solr-user@lucene.apache.org
Subject: RE: MoreLikeThis supporting multiple document IDs as input?

So the Search Components are executed in series an _every_ request. I
presume then that they look at the request parameters and decide what and
whether to take action.

So in the case of the MLT component this was said:

> The MLT search component returns similar documents for each of the 
> documents in the search results, but processes each search result base 
> document one at a time and keeps its similar documents segregated by 
> each of the base documents.

So what I think I understand is that the Query Component (presumably this
guy: org.apache.solr.handler.component.QueryComponent) takes the input from
the "q" parameter and returns a result (the "q=id:123456" ensure that the
Query Component will return just this one document).

The MltComponent then looks at the result from the QueryComponent and
generates its results.

The part that is still confusing is understanding the difference between
these two comments:

- The MLT search component returns similar documents for each of the
documents in the search results
- The MLT handler returns similar documents only for 

Search strategy - improving search quality for short search terms such as "doll"

2013-01-16 Thread David Parks
I'm a beginner-intermediate solr admin, I've set up the basics for our
application and it runs well.

 

Now it's time for me to dig in and start tuning and improving queries.

 

My next target is searches on simple terms such as "doll" which, in google,
would return documents about, well, "toy dolls", because that's the most
common usage of the simple term "doll". But in my index it predominantly
returns documents about CDs with the song "Doll Face", and "My baby doll" in
them.

 

I'm not directly asking how to solve this as much as I'm asking what
direction I should be looking in to learn what I need to know to tackle the
general issue myself.

 

Left on my own I would start looking at categorizing the CD's into a facet
called "music", reasonably doable in my dataset. Then I need to reduce the
boost-value of the entire facet/category of music unless certain pre-defined
query terms exist, such as [music, cd, song, listen, dvd, , etc.]. 

 

I don't yet know how to do all of this, but after a couple more good books I
should be "dangerous".

 

So the question to this list:

 

-  Am I on the right track here?  If not, can you point me in a
direction to go?

 

 



RE: Search strategy - improving search quality for short search terms such as "doll"

2013-01-16 Thread David Parks
Thanks for the recommendation. I'll start this book today.

In my example, "doll" is one example of a million I might only guess at, 
whereas the category "music", and "book" tend to interferes in many places and 
seem to be a more limited set of categories to deal with.

Dave


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, January 17, 2013 12:01 AM
To: solr-user@lucene.apache.org
Subject: Re: Search strategy - improving search quality for short search terms 
such as "doll"

Sounds like 'Doll' could be a category for you, while "Doll face" is a title. 
Maybe the categories should get a higher boost in eDismax definition over the 
titles?

Related, you may find the following book interesting:
http://rosenfeldmedia.com/books/searchanalytics/

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Jan 16, 2013 at 4:40 AM, David Parks  wrote:

> I'm a beginner-intermediate solr admin, I've set up the basics for our 
> application and it runs well.
>
>
>
> Now it's time for me to dig in and start tuning and improving queries.
>
>
>
> My next target is searches on simple terms such as "doll" which, in 
> google, would return documents about, well, "toy dolls", because 
> that's the most common usage of the simple term "doll". But in my 
> index it predominantly returns documents about CDs with the song "Doll Face", 
> and "My baby doll"
> in
> them.
>
>
>
> I'm not directly asking how to solve this as much as I'm asking what 
> direction I should be looking in to learn what I need to know to 
> tackle the general issue myself.
>
>
>
> Left on my own I would start looking at categorizing the CD's into a 
> facet called "music", reasonably doable in my dataset. Then I need to 
> reduce the boost-value of the entire facet/category of music unless 
> certain pre-defined query terms exist, such as [music, cd, song, 
> listen, dvd,  exhaustive list>, etc.].
>
>
>
> I don't yet know how to do all of this, but after a couple more good 
> books I should be "dangerous".
>
>
>
> So the question to this list:
>
>
>
> -  Am I on the right track here?  If not, can you point me in a
> direction to go?
>
>
>
>
>
>



RE: Search strategy - improving search quality for short search terms such as "doll"

2013-01-16 Thread David Parks
My issue is more that the search term doll shows up in both documents on CDs
as well as documents about toys. But I have 10 CD documents for every toy
document, so my searches for "doll" tend to show the CDs most prominently.
But that's not the way a user thinks. If they want the CD documents they'll
search for "doll face", or "doll face song", more specific queries (which
work fine), but if they want the toy they might just search for "doll".

If I run the searches "doll" and "doll song" on google image search you'll
clearly see that google has solved this problem perfectly. "doll" returns
toy dolls, and "doll song" returns music and anime results.

I'm striving for this type of result.



-Original Message-
From: Amit Jha [mailto:shanuu@gmail.com] 
Sent: Wednesday, January 16, 2013 11:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Search strategy - improving search quality for short search
terms such as "doll"

Its all about the data data set, here I mean index. If you have documents
containing "toy" and "doll" it will return that in result set. 

What I understood that you are talking about the context of the query. For
example if you search "books on MK Gandhi" and "books by MK Gandhi" both
queries have different context.

Context based search at some level achieved by natural language processing.
This one you can look at for better search.

Look for solr wiki & mailing list would be great source of learning.


Rgds
AJ

On 16-Jan-2013, at 15:10, "David Parks"  wrote:

> I'm a beginner-intermediate solr admin, I've set up the basics for our 
> application and it runs well.
> 
> 
> 
> Now it's time for me to dig in and start tuning and improving queries.
> 
> 
> 
> My next target is searches on simple terms such as "doll" which, in 
> google, would return documents about, well, "toy dolls", because 
> that's the most common usage of the simple term "doll". But in my 
> index it predominantly returns documents about CDs with the song "Doll 
> Face", and "My baby doll" in them.
> 
> 
> 
> I'm not directly asking how to solve this as much as I'm asking what 
> direction I should be looking in to learn what I need to know to 
> tackle the general issue myself.
> 
> 
> 
> Left on my own I would start looking at categorizing the CD's into a 
> facet called "music", reasonably doable in my dataset. Then I need to 
> reduce the boost-value of the entire facet/category of music unless 
> certain pre-defined query terms exist, such as [music, cd, song, 
> listen, dvd, , etc.].
> 
> 
> 
> I don't yet know how to do all of this, but after a couple more good 
> books I should be "dangerous".
> 
> 
> 
> So the question to this list:
> 
> 
> 
> -  Am I on the right track here?  If not, can you point me in a
> direction to go?
> 
> 
> 
> 
> 



Field Collapsing - Anything in the works for multi-valued fields?

2013-01-17 Thread David Parks
I want to configure Field Collapsing, but my target field is multi-valued
(e.g. the field I want to group on has a variable # of entries per document,
1-N entries).

I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that
grouping doesn't support multi-valued fields yet.

Anything in the works on that front by chance?  Any common work-arounds?




RE: Field Collapsing - Anything in the works for multi-valued fields?

2013-01-17 Thread David Parks
The documents are individual products which come from 1 or more vendors.
Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document.
Most fields are multi valued (short_description from each of the 2 vendors,
long_description, product_name, vendor, etc. the same).

I'd like to collapse on the vendor in an attempt to ensure that vast
collections of books, music, and movies, by just a few vendors, don't
overwhelm the results simply due to the fact that they have every search
term imaginable due to the sheer volume of books, CDs, and DVDs, in relation
to other product items.

But in this case there is clearly 1...N vendors per document, solidly a
multi-valued field. And it's hard to put a maximum number of vendors
possible.

Thanks,
Dave


-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Friday, January 18, 2013 2:32 AM
To: solr-user
Subject: Re: Field Collapsing - Anything in the works for multi-valued
fields?

David,

What's the documents and the field? It can help to suggest workaround.


On Thu, Jan 17, 2013 at 5:51 PM, David Parks  wrote:

> I want to configure Field Collapsing, but my target field is 
> multi-valued (e.g. the field I want to group on has a variable # of 
> entries per document, 1-N entries).
>
> I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that 
> grouping doesn't support multi-valued fields yet.
>
> Anything in the works on that front by chance?  Any common work-arounds?
>
>
>


--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 



RE: Field Collapsing - Anything in the works for multi-valued fields?

2013-01-18 Thread David Parks
If I understand the reading, you've suggested that I index the vendor names
as their own document (currently this is a multi-valued field of each
document).

Each such "vendor document" would just have a single valued 'name' field.

Each normal product document would contain a multi-valued field that is a
list of "vendor document IDs" and that we use to join the query results with
the vendor documents.

I presume this means that I would have some kind of dynamic field created
from the join which I could use as the 'group.field' value? 

I didn't quite follow the last point.



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Friday, January 18, 2013 9:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Field Collapsing - Anything in the works for multi-valued
fields?

Hi,

Instead of the multi-valued fields, would parent-child setup for you here?

See http://search-lucene.com/?q=solr+join&fc_type=wiki

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Jan 17, 2013 at 8:04 PM, David Parks  wrote:

> The documents are individual products which come from 1 or more vendors.
> Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document.
> Most fields are multi valued (short_description from each of the 2 
> vendors, long_description, product_name, vendor, etc. the same).
>
> I'd like to collapse on the vendor in an attempt to ensure that vast 
> collections of books, music, and movies, by just a few vendors, don't 
> overwhelm the results simply due to the fact that they have every 
> search term imaginable due to the sheer volume of books, CDs, and 
> DVDs, in relation to other product items.
>
> But in this case there is clearly 1...N vendors per document, solidly 
> a multi-valued field. And it's hard to put a maximum number of vendors 
> possible.
>
> Thanks,
> Dave
>
>
> -Original Message-
> From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
> Sent: Friday, January 18, 2013 2:32 AM
> To: solr-user
> Subject: Re: Field Collapsing - Anything in the works for multi-valued 
> fields?
>
> David,
>
> What's the documents and the field? It can help to suggest workaround.
>
>
> On Thu, Jan 17, 2013 at 5:51 PM, David Parks 
> wrote:
>
> > I want to configure Field Collapsing, but my target field is 
> > multi-valued (e.g. the field I want to group on has a variable # of 
> > entries per document, 1-N entries).
> >
> > I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) 
> > that grouping doesn't support multi-valued fields yet.
> >
> > Anything in the works on that front by chance?  Any common work-arounds?
> >
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>
>



After upgrade to solr4, search doesn't work

2013-03-05 Thread David Parks
I just upgraded from solr3 to solr4, and I wiped the previous work and
reloaded 500,000 documents.

I see in solr that I loaded the documents, and from the console, if I do a
query "*:*" I see documents returned.

I copied a single word from the text of the query results I got from "*:*"
but any query I do with a term returns 0 results, even though it's clear
from the "*:*" query that solr has that document.

Any ideas on where to start looking here?

David




Re: After upgrade to solr4, search doesn't work

2013-03-05 Thread David Parks
Good though, thanks for the quick reply too.

Seems that this is still set to my unique ID field:

  
    
     
       explicit
       10
       id
     

I wonder if I have somehow lost the configuration that specifies that the other 
fields should be searched as well, though my schema hasn't changed and they're 
certainly indexed:






 From: Jack Krupansky 
To: solr-user@lucene.apache.org 
Sent: Wednesday, March 6, 2013 1:34 PM
Subject: Re: After upgrade to solr4, search doesn't work
 
You may simply need to set the default value of the "df" parameter in the 
/select request handler in solrconfig.xml to be your default query field name 
if it is not "text".

-- Jack Krupansky

-----Original Message- From: David Parks
Sent: Wednesday, March 06, 2013 1:26 AM
To: solr-user@lucene.apache.org
Subject: After upgrade to solr4, search doesn't work

I just upgraded from solr3 to solr4, and I wiped the previous work and
reloaded 500,000 documents.

I see in solr that I loaded the documents, and from the console, if I do a
query "*:*" I see documents returned.

I copied a single word from the text of the query results I got from "*:*"
but any query I do with a term returns 0 results, even though it's clear
from the "*:*" query that solr has that document.

Any ideas on where to start looking here?

David

Re: After upgrade to solr4, search doesn't work

2013-03-05 Thread David Parks
All but the unique ID field use the out-of-the-box default text_en_splitting 
field type, this copied over from v3 to v4 without change as far as I know.

I've done the import from scratch (deleted the solr data directory and 
re-imported and committed).








 From: mani arasu 
To: solr-user@lucene.apache.org 
Sent: Wednesday, March 6, 2013 1:37 PM
Subject: Re: After upgrade to solr4, search doesn't work
 
You should probably be looking at which Analyzer you used in solr version
3.x and which one you are using in solr version 4.x.
If there is any change in that you may have to do either of the following:

   - Do a full-import so that documents are created according to your new
   schema
   - Do a search on the previously created documents, considering the way
   your documents are Analysed and Indexed as per solr version 3.x


On Wed, Mar 6, 2013 at 11:56 AM, David Parks  wrote:

> I just upgraded from solr3 to solr4, and I wiped the previous work and
> reloaded 500,000 documents.
>
> I see in solr that I loaded the documents, and from the console, if I do a
> query "*:*" I see documents returned.
>
> I copied a single word from the text of the query results I got from "*:*"
> but any query I do with a term returns 0 results, even though it's clear
> from the "*:*" query that solr has that document.
>
> Any ideas on where to start looking here?
>
> David
>
>
>

Re: After upgrade to solr4, search doesn't work

2013-03-05 Thread David Parks
Oops, I didn't include the full XML there, hopefully this formats ok.






 From: David Parks 
To: "solr-user@lucene.apache.org"  
Sent: Wednesday, March 6, 2013 1:58 PM
Subject: Re: After upgrade to solr4, search doesn't work
 
All but the unique ID field use the out-of-the-box default text_en_splitting 
field type, this copied over from v3 to v4 without change as far as I know.

I've done the import from scratch (deleted the solr data directory and 
re-imported and committed).








From: mani arasu 
To: solr-user@lucene.apache.org 
Sent: Wednesday, March 6, 2013 1:37 PM
Subject: Re: After upgrade to solr4, search doesn't work

You should probably be looking at which Analyzer you used in solr version
3.x and which one you are using in solr version 4.x.
If there is any change in that you may have to do either of the following:

   - Do a full-import so that documents are created according to your new
   schema
   - Do a search on the previously created documents, considering the way
   your documents are Analysed and Indexed as per solr version 3.x


On Wed, Mar 6, 2013 at 11:56 AM, David Parks  wrote:

> I just upgraded from solr3 to solr4, and I wiped the previous work and
> reloaded 500,000 documents.
>
> I see in solr that I loaded the documents, and from the console, if I do a
> query "*:*" I see documents returned.
>
> I copied a single word from the text of the query results I got from "*:*"
> but any query I do with a term returns 0 results, even though it's clear
> from the "*:*" query that solr has that document.
>
> Any ideas on where to start looking here?
>
> David
>
>
>

Re: After upgrade to solr4, search doesn't work

2013-03-05 Thread David Parks
Ah, I think I see the issue, in the debug results it's only searching the id 
field, which is the unique ID, that must have gotten changed in the upgrade. In 
fact I think I might have had a misconfiguration in the 3.x version here. Can I 
set it to query multiple fields by default? I tried a comma separated list of 
my fields here but that was invalid.

dvddvdid:dvdid:dvd




 From: David Parks 
To: "solr-user@lucene.apache.org"  
Sent: Wednesday, March 6, 2013 1:52 PM
Subject: Re: After upgrade to solr4, search doesn't work
 
Good though, thanks for the quick reply too.

Seems that this is still set to my unique ID field:

  
    
     
       explicit
       10
       id
     

I wonder if I have somehow lost the configuration that specifies that the other 
fields should be searched as well, though my schema hasn't changed and they're 
certainly indexed:






From: Jack Krupansky 
To: solr-user@lucene.apache.org 
Sent: Wednesday, March 6, 2013 1:34 PM
Subject: Re: After upgrade to solr4, search doesn't work

You may simply need to set the default value of the "df" parameter in the 
/select request handler in solrconfig.xml to be your default query field name 
if it is not "text".

-- Jack Krupansky

-Original Message- From: David Parks
Sent: Wednesday, March 06, 2013 1:26 AM
To: solr-user@lucene.apache.org
Subject: After upgrade to solr4, search doesn't work

I just upgraded from solr3 to solr4, and I wiped the previous work and
reloaded 500,000 documents.

I see in solr that I loaded the documents, and from the console, if I do a
query "*:*" I see documents returned.

I copied a single word from the text of the query results I got from "*:*"
but any query I do with a term returns 0 results, even though it's clear
from the "*:*" query that solr has that document.

Any ideas on where to start looking here?

David

RE: After upgrade to solr4, search doesn't work

2013-03-07 Thread David Parks
I had actually totally blown my previous configuration and didn't know it
(luckily it didn't reach production this way). I'm glad I ran into this
problem. I had defaulted the queries to one of the most useful fields and
never realized I wasn't searching the others. Thanks very much for all your
help on this, it certainly helped me get my configuration straight and the
upgrade to 4 is now complete.

All the best,
David


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, March 06, 2013 7:56 PM
To: solr-user@lucene.apache.org; David Parks
Subject: Re: After upgrade to solr4, search doesn't work

I imagine that you had a "qf" parameter in your old query request handler,
so add "qf" it to the new query request handler. "df" is used only if "qf" 
is missing.

-- Jack Krupansky

-Original Message-
From: David Parks
Sent: Wednesday, March 06, 2013 2:18 AM
To: solr-user@lucene.apache.org ; David Parks
Subject: Re: After upgrade to solr4, search doesn't work

Ah, I think I see the issue, in the debug results it's only searching the id
field, which is the unique ID, that must have gotten changed in the upgrade.

In fact I think I might have had a misconfiguration in the 3.x version here.

Can I set it to query multiple fields by default? I tried a comma separated
list of my fields here but that was invalid.

dvddvdid:dvdid:dvd




From: David Parks 
To: "solr-user@lucene.apache.org" 
Sent: Wednesday, March 6, 2013 1:52 PM
Subject: Re: After upgrade to solr4, search doesn't work

Good though, thanks for the quick reply too.

Seems that this is still set to my unique ID field:

  

 
   explicit
   10
   id
 

I wonder if I have somehow lost the configuration that specifies that the
other fields should be searched as well, though my schema hasn't changed and
they're certainly indexed:






From: Jack Krupansky 
To: solr-user@lucene.apache.org
Sent: Wednesday, March 6, 2013 1:34 PM
Subject: Re: After upgrade to solr4, search doesn't work

You may simply need to set the default value of the "df" parameter in the
/select request handler in solrconfig.xml to be your default query field
name if it is not "text".

-- Jack Krupansky

-Original Message- From: David Parks
Sent: Wednesday, March 06, 2013 1:26 AM
To: solr-user@lucene.apache.org
Subject: After upgrade to solr4, search doesn't work

I just upgraded from solr3 to solr4, and I wiped the previous work and
reloaded 500,000 documents.

I see in solr that I loaded the documents, and from the console, if I do a
query "*:*" I see documents returned.

I copied a single word from the text of the query results I got from "*:*"
but any query I do with a term returns 0 results, even though it's clear
from the "*:*" query that solr has that document.

Any ideas on where to start looking here?

David 



Is Solr more CPU bound or IO bound?

2013-03-17 Thread David Parks
I'm spec'ing out some hardware for a first go at our production Solr
instance, but I haven't spent enough time loadtesting it yet.

 

What I want to ask if how IO intensive solr is vs. CPU intensive, typically.

 

Specifically I'm considering whether to dual-purpose the Solr servers to run
Solr and another CPU-only application we have. I know Solr uses a fair
amount of CPU, but if it also is very disk intensive it might be a net
benefit to have more instances running Solr and share the CPU resources with
the other app than to run Solr separate from the other CPU app that wouldn't
otherwise use the disk.

 

Thoughts on this?

 

Thanks,

David

 



RE: Is Solr more CPU bound or IO bound?

2013-03-17 Thread David Parks
Thank you, Manu, for that excellent discussion on the topic, I could have
been more detailed about my use case.

We'll be indexing off-of the main production servers (either on a master, or
in Hadoop, we're yet to build out that piece of the puzzle). We don't store
documents at all, we only store the index data and return a document ID,
each document is maybe 1k of text, small.  We do have a few "interesting"
queries in which we do some grouping.

We currently index 100GB of input data, that'll grow 2x or 3x in the near
future.

So based on your experience, it seems likely that we'll be CPU bound (heavy
queries against a static index updated nightly from the master), thus
nullifying the advantage of dual-purposing a box with another CPU bound app.

Very useful discussion, I'll get proper load tests done in time but this
helps direct my thinking now.

David



-Original Message-
From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel
Le Normand
Sent: Monday, March 18, 2013 9:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Is Solr more CPU bound or IO bound?

Your question is a typical use-case dependent, the bottleneck will change
from user to user.

These are two main issues that will affect the answer:
1. How do you index: what is your indexing rate (how many docs a days)? how
big is a typical document? how many documents do you plan on indexing in
tota? do you store fields? calculate their term vectors?
2. How looks you retrieval process: What's the query rate expected? Are
there common queries (taking advantage of the cache)? Complexity of queries
(faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan to
retrieve stored fields or only id's?

After answering all that there's an interative game between hardware
configuration and software configuration (how do you split your shards, use
your cache, tuning your merges and flushes etc) that would also affect the
IO / CPU bounded answer.

In my use-case for example the indexing part is IO bounded, but as my
indexing rate is much below the rate my machine could initially provide it
didn't affect my hardware spec.
After fine tuning my configuration i discovered my retrieval process was CPU
bounded and was directly affecting my avg response time, while the IO rate
in cache usage was quite low.

Try describing your use case in more details with the above questions so
we'd be able to give you guidelines.

Best,
Manu


On Mon, Mar 18, 2013 at 3:55 AM, David Parks  wrote:

> I'm spec'ing out some hardware for a first go at our production Solr 
> instance, but I haven't spent enough time loadtesting it yet.
>
>
>
> What I want to ask if how IO intensive solr is vs. CPU intensive, 
> typically.
>
>
>
> Specifically I'm considering whether to dual-purpose the Solr servers 
> to run Solr and another CPU-only application we have. I know Solr uses 
> a fair amount of CPU, but if it also is very disk intensive it might 
> be a net benefit to have more instances running Solr and share the CPU 
> resources with the other app than to run Solr separate from the other 
> CPU app that wouldn't otherwise use the disk.
>
>
>
> Thoughts on this?
>
>
>
> Thanks,
>
> David
>
>
>
>