Slow queries for common terms
I've got a query that takes 15 seconds to return whenever I have the term "book" in a query that isn't cached. That's a pretty common term in our search index. We're indexing about 120 GB of text data. We only store terms and IDs, no document data, and the disk is virtually unused, it's all CPU time. I haven't done much yet to optimizing and scale solr, as we're only trying to support a small number of users in a private beta. I currently only have a couple of gigs of ram dedicated to Solr (we've ordered more hardware for it, but it's not in yet). I wonder if there's something I can do in the short term to alleviate the problem. Many searches work great, but these ones that take 15+ sec are a black eye. I'd be happy with a short term fix followed in the near future by a more proper long-term fix. If I were to take a stab at this I'd say the following two are the short and long term solutions: . Short: Configure solr to short-circuit quries, thus reducing query quality, but guaranteeing a certain response time (I'm ok with this tradeoff, but is this possible or contains risks I need to consider more?) . Long: Implement sharding, get more hardware resources for these boxes and split up the index across multiple servers. Am I on track in my thinking here? Thanks, David My long query: 0 15464 true true cook book fourth baby xml 1004505170125 0 1428753459934396420 1005401904542 0 1428760446538612739 1003707566177 0 1428772178357125123 1000610053924 0 1428787238421921794 1000611651986 0 1428796273825153026 1001419625706 0 1428823418682212355 1004804435353 0 1428823070202658818 1000514089336 0 1428755943804370945 1000329540261 0 1428805063523958786 1001607999738 0 1428757460650295298 cook book fourth baby cook book fourth baby all:cook all:book all:fourth all:babi all:cook all:book all:fourth all:babi 10.476225 = (MATCH) product of: 13.9683 = (MATCH) sum of: 7.988946 = (MATCH) weight(all:cook in 3428426) [DefaultSimilarity], result of: 7.988946 = score(doc=3428426,freq=2.0 = termFreq=2.0 ), product of: 0.5447923 = queryWeight, product of: 5.925234 = idf(docFreq=931212, maxDocs=128248074) 0.091944434 = queryNorm 14.664206 = fieldWeight in 3428426, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.925234 = idf(docFreq=931212, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 0.29064935 = (MATCH) weight(all:book in 3428426) [DefaultSimilarity], result of: 0.29064935 = score(doc=3428426,freq=1.0 = termFreq=1.0 ), product of: 0.12357436 = queryWeight, product of: 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 0.091944434 = queryNorm 2.3520198 = fieldWeight in 3428426, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 5.688705 = (MATCH) weight(all:babi in 3428426) [DefaultSimilarity], result of: 5.688705 = score(doc=3428426,freq=1.0 = termFreq=1.0 ), product of: 0.54670167 = queryWeight, product of: 5.9460006 = idf(docFreq=912073, maxDocs=128248074) 0.091944434 = queryNorm 10.405501 = fieldWeight in 3428426, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.9460006 = idf(docFreq=912073, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 0.75 = coord(3/4) 10.476225 = (MATCH) product of: 13.9683 = (MATCH) sum of: 7.988946 = (MATCH) weight(all:cook in 4132020) [DefaultSimilarity], result of: 7.988946 = score(doc=4132020,freq=2.0 = termFreq=2.0 ), product of: 0.5447923 = queryWeight, product of: 5.925234 = idf(docFreq=931212, maxDocs=128248074) 0.091944434 = queryNorm 14.664206 = fieldWeight in 4132020, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.925234 = idf(docFreq=931212, maxDocs=128248074) 1.75 = fieldNorm(doc=4132020) 0.29064935 = (MATCH) weight(all:book in 4132020) [DefaultSimilarity], result of: 0.29064935 = score(doc=4132020,freq=1.0 = termFreq=1.0 ), product of: 0.12357436 = queryWeight, product of: 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 0.091944434 = queryNorm 2.3520198 = fieldWeight in 4132020, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 1.75 = fieldNo
RE: Slow queries for common terms
We have 300M documents, each about a paragraph of text on average. The index is 140GB in size. I'm not sure how to find the IDF score, was that in the debug query below? It seems that any query with the word "book" in it triggers a 15 sec response time (unless it's the 2nd time we run the same query). Looking at terms, 'book' is the 2nd highest term with 90M documents in the index. Calling 'book' a stop word doesn't seem reasonable, and while that article on bigrams and common grams is fascinating, I wonder if it addresses this situation, in which we aren't really likely to manage a bi-gram phrase match between the search "book sales improvement", and the terms in the document: "category book marketing and sales today the real guide to improving" right? I think this is what's happening here, everything with a common phrase "category book" is getting included, which seems logical and correct. -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Thursday, March 21, 2013 5:43 PM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms Hi, I think you can start by reading this blog http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-w ords-part-2 and try out the approach using a dictionary of the most common words in your index. You don't say how many documents, avg. doc size, the IDF value of "book", how much RAM, whether you utilize disk caching well enough and many other things which could affect this situation. But the pure fact that only a few common search words trigger such a delay would suggest commongrams as a possible way forward. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 21. mars 2013 kl. 11:09 skrev David Parks : > I've got a query that takes 15 seconds to return whenever I have the > term "book" in a query that isn't cached. That's a pretty common term > in our search index. We're indexing about 120 GB of text data. We only > store terms and IDs, no document data, and the disk is virtually > unused, it's all CPU time. > > > > I haven't done much yet to optimizing and scale solr, as we're only > trying to support a small number of users in a private beta. I > currently only have a couple of gigs of ram dedicated to Solr (we've > ordered more hardware for it, but it's not in yet). > > > > I wonder if there's something I can do in the short term to alleviate > the problem. Many searches work great, but these ones that take 15+ > sec are a black eye. I'd be happy with a short term fix followed in > the near future by a more proper long-term fix. > > > > If I were to take a stab at this I'd say the following two are the > short and long term solutions: > > . Short: Configure solr to short-circuit quries, thus reducing query > quality, but guaranteeing a certain response time (I'm ok with this > tradeoff, but is this possible or contains risks I need to consider > more?) > > . Long: Implement sharding, get more hardware resources for these > boxes and split up the index across multiple servers. > > > > Am I on track in my thinking here? > > Thanks, > > David > > > > My long query: > > > > > > > > > > 0 > > 15464 > > > >true > >true > >cook book fourth baby > > > >xml > > > > > > > > > >1004505170125 > >0 > >1428753459934396420 > > > >1005401904542 > >0 > >1428760446538612739 > > > >1003707566177 > >0 > >1428772178357125123 > > > >1000610053924 > >0 > >1428787238421921794 > > > >1000611651986 > >0 > >1428796273825153026 > > > >1001419625706 > >0 > >1428823418682212355 > > > >1004804435353 > >0 > >1428823070202658818 > > > >1000514089336 > >0 > >1428755943804370945 > > > >1000329540261 > >0 > >1428805063523958786 > > > >1001607999738 > >0 > >1428757460650295298 > > > > > > cook book fourth baby > > > > cook book fourth baby > > > > all:cook all:book all:fourth all:babi > > all:cook all:book all:fourth > all:babi > > > > > > 10.476225 = (MATCH) product o
Slow queries for common terms
I've got a query that takes 15 seconds to return whenever I have the term "book" in a query that isn't cached. That's a pretty common term in our search index. We're indexing about 120 GB of text data. We only store terms and IDs, no document data, and the disk is virtually unused, it's all CPU time. I haven't done much yet to optimizing and scale solr, as we're only trying to support a small number of users in a private beta. I currently only have a couple of gigs of ram dedicated to Solr (we've ordered more hardware for it, but it's not in yet). I wonder if there's something I can do in the short term to alleviate the problem. Many searches work great, but these ones that take 15+ sec are a black eye. I'd be happy with a short term fix followed in the near future by a more proper long-term fix. If I were to take a stab at this I'd say the following two are the short and long term solutions: * Short: Configure solr to short-circuit quries, thus reducing query quality, but guaranteeing a certain response time (I'm ok with this tradeoff, but is this possible or contains risks I need to consider more?) * Long: Implement sharding, get more hardware resources for these boxes and split up the index across multiple servers. Am I on track in my thinking here? Thanks, David My long query: 0 15464 true true cook book fourth baby xml 1004505170125 0 1428753459934396420 1005401904542 0 1428760446538612739 1003707566177 0 1428772178357125123 1000610053924 0 1428787238421921794 1000611651986 0 1428796273825153026 1001419625706 0 1428823418682212355 1004804435353 0 1428823070202658818 1000514089336 0 1428755943804370945 1000329540261 0 1428805063523958786 1001607999738 0 1428757460650295298 cook book fourth baby cook book fourth baby all:cook all:book all:fourth all:babi all:cook all:book all:fourth all:babi 10.476225 = (MATCH) product of: 13.9683 = (MATCH) sum of: 7.988946 = (MATCH) weight(all:cook in 3428426) [DefaultSimilarity], result of: 7.988946 = score(doc=3428426,freq=2.0 = termFreq=2.0 ), product of: 0.5447923 = queryWeight, product of: 5.925234 = idf(docFreq=931212, maxDocs=128248074) 0.091944434 = queryNorm 14.664206 = fieldWeight in 3428426, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.925234 = idf(docFreq=931212, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 0.29064935 = (MATCH) weight(all:book in 3428426) [DefaultSimilarity], result of: 0.29064935 = score(doc=3428426,freq=1.0 = termFreq=1.0 ), product of: 0.12357436 = queryWeight, product of: 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 0.091944434 = queryNorm 2.3520198 = fieldWeight in 3428426, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 5.688705 = (MATCH) weight(all:babi in 3428426) [DefaultSimilarity], result of: 5.688705 = score(doc=3428426,freq=1.0 = termFreq=1.0 ), product of: 0.54670167 = queryWeight, product of: 5.9460006 = idf(docFreq=912073, maxDocs=128248074) 0.091944434 = queryNorm 10.405501 = fieldWeight in 3428426, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.9460006 = idf(docFreq=912073, maxDocs=128248074) 1.75 = fieldNorm(doc=3428426) 0.75 = coord(3/4) 10.476225 = (MATCH) product of: 13.9683 = (MATCH) sum of: 7.988946 = (MATCH) weight(all:cook in 4132020) [DefaultSimilarity], result of: 7.988946 = score(doc=4132020,freq=2.0 = termFreq=2.0 ), product of: 0.5447923 = queryWeight, product of: 5.925234 = idf(docFreq=931212, maxDocs=128248074) 0.091944434 = queryNorm 14.664206 = fieldWeight in 4132020, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.925234 = idf(docFreq=931212, maxDocs=128248074) 1.75 = fieldNorm(doc=4132020) 0.29064935 = (MATCH) weight(all:book in 4132020) [DefaultSimilarity], result of: 0.29064935 = score(doc=4132020,freq=1.0 = termFreq=1.0 ), product of: 0.12357436 = queryWeight, product of: 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 0.091944434 = queryNorm 2.3520198 = fieldWeight in 4132020, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.3440113 = idf(docFreq=90917737, maxDocs=128248074) 1.75 = fieldNorm(doc=4132020) 5.688705 = (MATCH) weight(all:babi in 4132020) [DefaultSimilarity], result of: 5.688705 = score(doc=
RE: Slow queries for common terms
I figured I was trying to pull a coup here, but this is a temporary configuration while we only run a few users through an early beta. The performance is perfectly good for most terms, it's just this books term. I'm curious how adding RAM will solve that. I can see how deploying solr cloud and sharding should affect it, but would simply giving Solr 16GB of ram improve query time with this one term that is common to 90M of the 300M documents? In due time I do plan to implement solr cloud and run the whole thing through proper load testing. Right now I'm just trying to get it to "work" for a few users. If you could elaborate a bit on your thinking I'd be quite grateful. David -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Thursday, March 21, 2013 8:01 PM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms Hi, If you say that you try to index 300M docs in ONE single Solr server, with "a few gigs" of RAM, then that's the reason for some bad performance right there. You should benchmark to find the sweet-spot of how many documents you want to fit per node/shard and still have acceptable indexing/query performance. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 21. mars 2013 kl. 12:43 skrev David Parks : > We have 300M documents, each about a paragraph of text on average. The > index is 140GB in size. I'm not sure how to find the IDF score, was > that in the debug query below? > > It seems that any query with the word "book" in it triggers a 15 sec > response time (unless it's the 2nd time we run the same query). > Looking at terms, 'book' is the 2nd highest term with 90M documents in the index. > > Calling 'book' a stop word doesn't seem reasonable, and while that > article on bigrams and common grams is fascinating, I wonder if it > addresses this situation, in which we aren't really likely to manage a > bi-gram phrase match between the search "book sales improvement", and the terms in the document: > "category book marketing and sales today the real guide to improving" right? > I think this is what's happening here, everything with a common phrase > "category book" is getting included, which seems logical and correct. > > > > -Original Message- > From: Jan Høydahl [mailto:jan@cominvent.com] > Sent: Thursday, March 21, 2013 5:43 PM > To: solr-user@lucene.apache.org > Subject: Re: Slow queries for common terms > > Hi, > > I think you can start by reading this blog > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-co > mmon-w > ords-part-2 and try out the approach using a dictionary of the most > common words in your index. > > You don't say how many documents, avg. doc size, the IDF value of > "book", how much RAM, whether you utilize disk caching well enough and > many other things which could affect this situation. But the pure fact > that only a few common search words trigger such a delay would suggest > commongrams as a possible way forward. > > -- > Jan Høydahl, search solution architect Cominvent AS - > www.cominvent.com Solr Training - www.solrtraining.com > > 21. mars 2013 kl. 11:09 skrev David Parks : > >> I've got a query that takes 15 seconds to return whenever I have the >> term "book" in a query that isn't cached. That's a pretty common term >> in our search index. We're indexing about 120 GB of text data. We >> only store terms and IDs, no document data, and the disk is virtually >> unused, it's all CPU time. >> >> >> >> I haven't done much yet to optimizing and scale solr, as we're only >> trying to support a small number of users in a private beta. I >> currently only have a couple of gigs of ram dedicated to Solr (we've >> ordered more hardware for it, but it's not in yet). >> >> >> >> I wonder if there's something I can do in the short term to alleviate >> the problem. Many searches work great, but these ones that take 15+ >> sec are a black eye. I'd be happy with a short term fix followed in >> the near future by a more proper long-term fix. >> >> >> >> If I were to take a stab at this I'd say the following two are the >> short and long term solutions: >> >> . Short: Configure solr to short-circuit quries, thus reducing > query >> quality, but guaranteeing a certain response time (I'm ok with this >> tradeoff, but is this possible or con
RE: Slow queries for common terms
I see the CPU working very hard, and at the same time I see 2 MB/sec disk access for that 15 seconds. I am not running it this instant, but it seems to me that there was more CPU cycles available, so unless it's an issue of not being able to multithread it any further I'd say it's more IO related. I'm going to set up solr cloud and shard across the 2 servers I have available for now. It's not an optimal setup we have while we're in a private beta period, but maybe it'll improve things (I've got 2 servers with 2x 4TB disks in raid-0 shared with the webservers). I'll work towards some improved IO performance and maybe more shards and see how things go. I'll also be able to up the RAM in just a couple of weeks. Are there any settings I should think of in terms of improving cache performance when I can give it say 10GB of RAM? Thanks, this has been tremendously helpful. David -Original Message- From: Tom Burton-West [mailto:tburt...@umich.edu] Sent: Saturday, March 23, 2013 1:38 AM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms Hi David and Jan, I wrote the blog post, and David, you are right, the problem we had was with phrase queries because our positions lists are so huge. Boolean queries don't need to read the positions lists. I think you need to determine whether you are CPU bound or I/O bound.It is possible that you are I/O bound and reading the term frequency postings for 90 million docs is taking a long time. In that case, More memory in the machine (but not dedicated to Solr) might help because Solr relies on OS disk caching for caching the postings lists. You would still need to do some cache warming with your most common terms. On the other hand as Jan pointed out, you may be cpu bound because Solr doesn't have early termination and has to rank all 90 million docs in order to show the top 10 or 25. Did you try the OR search to see if your CPU is at 100%? Tom On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl wrote: > Hi > > There might not be a final cure with more RAM if you are CPU bound. > Scoring 90M docs is some work. Can you check what's going on during > those > 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search > which generates >100mill hits and see if that is slow too, even if you > don't use frequent words. > > I'm sure you can find other frequent terms in your corpus which > display similar behaviour, words which are even more frequent than > "book". Are you using "AND" as default operator? You will benefit from > limiting the number of results as much as possible. > > The real solution is to shard across N number of servers, until you > reach the desired performance for the desired indexing/querying load. > > -- > Jan Høydahl, search solution architect Cominvent AS - > www.cominvent.com Solr Training - www.solrtraining.com > >
RE: Slow queries for common terms
"book" by itself returns in 4s (non-optimized disk IO), running it a second time returned 0s, so I think I can presume that the query was not cached the first time. This system has been up for week, so it's warm. I'm going to give your article a good long read, thanks for that. I guess good fast disks/SSDs and sharding should also improve on the base 4 sec query time. How _does_ Google get their queries times down to 0.35s anyway? I presume their indexes are larger than my 150G index. :) I still am a bit worried about what will happen when my index is 500GB (it'll happen soon enough), even with sharding... well... I'd just need a lot of servers it seems, and my feeling of it is that if I need a lot of servers for a few users, how will it scale to many users? Thanks for the great discussion, Dave -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, March 25, 2013 10:04 PM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms take a look here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html looking at memory consumption can be a bit tricky to interpret with MMapDirectory. But you say "I see the CPU working very hard" which implies that your issue is just scoring 90M documents. A way to test: try q=*:*&fq=field:book. My bet is that that will be much faster, in which case scoring is your choke-point and you'll need to spread that load across more servers, i.e. shard. When running the above, make sure of a couple of things: 1> you haven't run the fq query before (or you have filterCache turned completely off). 2> you _have_ run a query or two that warms up your low-level caches. Doesn't matter what, just as long as it doesn't have an fq clause. Best Erick On Sat, Mar 23, 2013 at 3:10 AM, David Parks wrote: > I see the CPU working very hard, and at the same time I see 2 MB/sec > disk access for that 15 seconds. I am not running it this instant, but > it seems to me that there was more CPU cycles available, so unless > it's an issue of not being able to multithread it any further I'd say it's more IO related. > > I'm going to set up solr cloud and shard across the 2 servers I have > available for now. It's not an optimal setup we have while we're in a > private beta period, but maybe it'll improve things (I've got 2 > servers with 2x 4TB disks in raid-0 shared with the webservers). > > I'll work towards some improved IO performance and maybe more shards > and see how things go. I'll also be able to up the RAM in just a > couple of weeks. > > Are there any settings I should think of in terms of improving cache > performance when I can give it say 10GB of RAM? > > Thanks, this has been tremendously helpful. > > David > > > -Original Message- > From: Tom Burton-West [mailto:tburt...@umich.edu] > Sent: Saturday, March 23, 2013 1:38 AM > To: solr-user@lucene.apache.org > Subject: Re: Slow queries for common terms > > Hi David and Jan, > > I wrote the blog post, and David, you are right, the problem we had > was with phrase queries because our positions lists are so huge. > Boolean > queries don't need to read the positions lists. I think you need to > determine whether you are CPU bound or I/O bound.It is possible that > you are I/O bound and reading the term frequency postings for 90 > million docs is taking a long time. In that case, More memory in the > machine (but not dedicated to Solr) might help because Solr relies on > OS disk caching for caching the postings lists. You would still need > to do some cache warming with your most common terms. > > On the other hand as Jan pointed out, you may be cpu bound because > Solr doesn't have early termination and has to rank all 90 million > docs in order to show the top 10 or 25. > > Did you try the OR search to see if your CPU is at 100%? > > Tom > > On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl > wrote: > > > Hi > > > > There might not be a final cure with more RAM if you are CPU bound. > > Scoring 90M docs is some work. Can you check what's going on during > > those > > 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search > > which generates >100mill hits and see if that is slow too, even if > > you don't use frequent words. > > > > I'm sure you can find other frequent terms in your corpus which > > display similar behaviour, words which are even more frequent than > > "book". Are you using "AND" as default operator? You will benefit > > from limiting the number of results as much as possible. > > > > The real solution is to shard across N number of servers, until you > > reach the desired performance for the desired indexing/querying load. > > > > -- > > Jan Høydahl, search solution architect Cominvent AS - > > www.cominvent.com Solr Training - www.solrtraining.com > > > > > >
RE: MoreLikeThis - Odd results - what am I doing wrong?
Isn't this an AWS security groups question? You should probably post this question on the AWS forums, but for the moment, here's the basic reading material - go set up your EC2 security groups and lock down your systems. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html If you just want to password protect Solr here are the instructions: http://wiki.apache.org/solr/SolrSecurity But I most certainly would not leave it open to the world even with a password (note that the basic password authentication sends passwords in clear text if you're not using HTTPS, best lock the thing down behind a firewall). Dave -Original Message- From: DC tech [mailto:dctech1...@gmail.com] Sent: Tuesday, April 02, 2013 1:02 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis - Odd results - what am I doing wrong? OK - so I have my SOLR instance running on AWS. Any suggestions on how to safely share the link? Right now, the whole SOLR instance is totally open. Gagandeep singh wrote: >say &debugQuery=true&mlt=true and see the scores for the MLT query, not >a sample query. You can use Amazon ec2 to bring up your solr, you >should be able to get a micro instance for free trial. > > >On Mon, Apr 1, 2013 at 5:10 AM, dc tech wrote: > >> I did try the raw query against the *simi* field and those seem to >> return results in the order expected. >> For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. >> Running a query with those words against the simi field returns the >> expected models (X5, Audi Q5, etc) and then the subsequent documents >> have decreasing relevance. So the basic query mechanism seems to be fine. >> >> The issue just seems to be with MoreLikeThis component and handler. >> I can post the index on a public SOLR instance - any suggestions? (or >> for >> hosting) >> >> >> On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh >> > >wrote: >> >> > If you can bring up your solr setup on a public machine then im >> > sure a >> lot >> > of debugging can be done. Without that, i think what you should >> > look at >> is >> > the tf-idf scores of the terms like "camry" etc. Usually idf is the >> > deciding factor into which results show at the top (tf should be 1 >> > for >> your >> > data). >> > Enable &debugQuery=true and look at explain section to see show >> > score is getting calculated. >> > >> > You should try giving different boosts to class, type, drive, size >> > to control the results. >> > >> > >> > On Sun, Mar 31, 2013 at 8:52 PM, dc tech wrote: >> > >> >> I am running some experiments on more like this and the results >> >> seem rather odd - I am doing something wrong but just cannot figure out >> >> what. >> >> Basically, the similarity results are decent - but not great. >> >> >> >> *Issue 1 = Quality* >> >> Toyota Camry : finds Altima (good) but then next one is Camry >> >> Hybrid whereas it should have found Accord. >> >> I have normalized the data into a simi field which has only the >> >> attributes that I care about. >> >> Without the simi field, I could not get mlt.qf boosts to work well >> enough >> >> to return results >> >> >> >> *Issue 2* >> >> Some fields do not work at all. For instance, text+simi (in >> >> mlt.fl) >> works >> >> whereas just simi does not. >> >> So some weirdness that am just not understanding. >> >> >> >> Would be grateful for your guidance ! >> >> >> >> >> >> Here is the setup: >> >> *1. SOLR Version* >> >> solr-spec 4.2.0.2013.03.06.22.32.13 >> >> solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 >> >> lucene-spec 4.2.0 >> >> lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 >> >> >> >> *2. Machine Information* >> >> Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 >> >> 19.0-b09) >> >> Windows 7 Home 64 Bit with 4 GB RAM >> >> >> >> *3. Sample Data * >> >> I created this 'dummy' data of cars - the idea being that these >> >> would >> be >> >> sufficient and simple to generate similarity and understand how it >> >> would work. >> >> There are 181 rows in the data set (I have attached it for >> >> reference in CSV format) >> >> >> >> [image: Inline image 1] >> >> >> >> *4. SCHEMA* >> >> *Field Definitions* >> >>> >> termVectors="true" multiValued="false"/> >> >>> >> termVectors="true" multiValued="false"/> >> >>> >> termVectors="true" multiValued="false"/> >> >>> >> termVectors="true" multiValued="false"/> >> >>> >> termVectors="true" multiValued="false"/> >> >>> >> termVectors="true" multiValued="false"/> >> >>> stored="true" >> >> termVectors="true" multiValued="true"/> >> >>> >> termVectors="true" multiValued="false"/> >> >> * >> >> * >> >> *Copy Fields* >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> * * >> >> * >> >> * >> >> * * >> >> * >> >> * >> >> >> >> Note that the "simi" field ends u
SolrCloud loadbalancing, replication, and failover
Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
RE: SolrCloud loadbalancing, replication, and failover
But my concern is this, when we have just 2 servers: - I want 1 to be able to take over in case the other fails, as you point out. - But when *both* servers are up I don't want the SolrCloud load balancer to have Shard1 and Replica2 do the work (as they would both reside on the same physical server). Does that make sense? I want *both* server1 & server2 sharing the processing of every request, *and* I want the failover capability. I'm probably missing some bit of logic here, but I want to be sure I understand the architecture. Dave -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Thursday, April 18, 2013 8:13 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Correct. This is what you want if server 2 goes down. Otis Solr & ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:11 AM, "David Parks" wrote: > Step 1: distribute processing > > We have 2 servers in which we'll run 2 SolrCloud instances on. > > We'll define 2 shards so that both servers are busy for each request > (improving response time of the request). > > > > Step 2: Failover > > We would now like to ensure that if either of the servers goes down > (we're very unlucky with disks), that the other will be able to take > over automatically. > > So we define 2 shards with a replication factor of 2. > > > > So we have: > > . Server 1: Shard 1, Replica 2 > > . Server 2: Shard 2, Replica 1 > > > > Question: > > But in SolrCloud, replicas are active right? So isn't it now possible > that the load balancer will have Server 1 process *both* parts of a > request, after all, it has both shards due to the replication, right? > >
RE: SolrCloud loadbalancing, replication, and failover
I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. Dave -Original Message- From: Timothy Potter [mailto:thelabd...@gmail.com] Sent: Thursday, April 18, 2013 9:16 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Hi Dave, This sounds more like a budget / deployment issue vs. anything architectural. You want 2 shards with replication so you either need sufficient capacity on each of your 2 servers to host 2 Solr instances or you need 4 servers. You need to avoid starving Solr of necessary RAM, disk performance, and CPU regardless of how you lay out the cluster otherwise performance will suffer. My guess is if each Solr had sufficient resources, you wouldn't actually notice much difference in query performance. Tim On Thu, Apr 18, 2013 at 8:03 AM, David Parks wrote: > But my concern is this, when we have just 2 servers: > - I want 1 to be able to take over in case the other fails, as you > point out. > - But when *both* servers are up I don't want the SolrCloud load > balancer to have Shard1 and Replica2 do the work (as they would both > reside on the same physical server). > > Does that make sense? I want *both* server1 & server2 sharing the > processing of every request, *and* I want the failover capability. > > I'm probably missing some bit of logic here, but I want to be sure I > understand the architecture. > > Dave > > > > -Original Message- > From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] > Sent: Thursday, April 18, 2013 8:13 PM > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud loadbalancing, replication, and failover > > Correct. This is what you want if server 2 goes down. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Apr 18, 2013 3:11 AM, "David Parks" wrote: > > > Step 1: distribute processing > > > > We have 2 servers in which we'll run 2 SolrCloud instances on. > > > > We'll define 2 shards so that both servers are busy for each request > > (improving response time of the request). > > > > > > > > Step 2: Failover > > > > We would now like to ensure that if either of the servers goes down > > (we're very unlucky with disks), that the other will be able to take > > over automatically. > > > > So we define 2 shards with a replication factor of 2. > > > > > > > > So we have: > > > > . Server 1: Shard 1, Replica 2 > > > > . Server 2: Shard 2, Replica 1 > > > > > > > > Question: > > > > But in SolrCloud, replicas are active right? So isn't it now > > possible that the load balancer will have Server 1 process *both* > > parts of a request, after all, it has both shards due to the replication, right? > > > > > >
RE: SolrCloud loadbalancing, replication, and failover
Wow! That was the most pointed, concise discussion of hardware requirements I've seen to date, and it's fabulously helpful, thank you Shawn! We currently have 2 servers that I can dedicate about 12GB of ram to Solr on (we're moving to these 2 servers now). I can upgrade further if it's needed & justified, and your discussion helps me justify that such an upgrade is the right thing to do. So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should be in the free and clear then right? This seems reasonable and doable. In this more extreme example the failover properties of solr cloud become more clear. I couldn't possibly run a replica shard without doubling the memory, so really replication isn't reasonable until I have double the hardware, then the load balancing scheme makes perfect sense. With 3 servers, 50GB of RAM and 120GB index I should just backup the index directory I think. My previous though to run replication just for failover would have actually resulted in LOWER performance because I would have halved the memory available to the master & replica. So the previous question is answered as well now. Question: if I had 1 server with 60GB of memory and 120GB index, would solr make full use of the 60GB of memory? Thus trimming disk access in half. Or is it an all-or-nothing thing? In a dev environment, I didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 11:51 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/18/2013 8:12 PM, David Parks wrote: > I think I still don't understand something here. > > My concern right now is that query times are very slow for 120GB index > (14s on avg), I've seen a lot of disk activity when running queries. > > I'm hoping that distributing that query across 2 servers is going to > improve the query time, specifically I'm hoping that we can distribute > that disk activity because we don't have great disks on there (yet). > > So, with disk IO being a factor in mind, running the query on one box, vs. > across 2 *should* be a concern right? > > Admittedly, this is the first step in what will probably be many to > try to work our query times down from 14s to what I want to be around 1s. I went through my mailing list archive to see what all you've said about your setup. One thing that I can't seem to find is a mention of how much total RAM is in each of your servers. I apologize if it was actually there and I overlooked it. In one email thread, you wanted to know whether Solr is CPU-bound or IO-bound. Solr is heavily reliant on the index on disk, and disk I/O is the slowest piece of the puzzle. The way to get good performance out of Solr is to have enough memory that you can take the disk mostly out of the equation by having the operating system cache the index in RAM. If you don't have enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy in iowait, unable to do much real work. If you DO have enough RAM to cache all (or most) of your index, then Solr will be CPU-bound. With 120GB of total index data on each server, you would want at least 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and that Solr is the only thing running on the machine. If you have more servers and shards, you can reduce the per-server memory requirement because the amount of index data on each server would go down. I am aware of the cost associated with this kind of requirement - each of my Solr servers has 64GB. If you are sharing the server with another program, then you want to have enough RAM available for Solr's heap, Solr's data, the other program's heap, and the other program's data. Some programs (like MySQL) completely skip the OS disk cache and instead do that caching themselves with heap memory that's actually allocated to the program. If you're using a program like that, then you wouldn't need to count its data. Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. In another thread, you asked about how Google gets such good response times. Although Google's software probably works differently than Solr/Lucene, when it comes right down to it, all search engines do similar jobs and have similar requirements. I would imagine that Google gets incredible response time because they have incredible amounts of RAM at their disposal that keep the important bits of their index instantly availabl
RE: SolrCloud loadbalancing, replication, and failover
Interesting. I'm trying to correlate this new understanding to what I see on my servers. I've got one server with 5GB dedicated to solr, solr dashboard reports a 167GB index actually. When I do many typical queries I see between 3MB and 9MB of disk reads (watching iostat). But solr's dashboard only shows 710MB of memory in use (this box has had many hundreds of queries put through it, and has been up for 1 week). That doesn't quite correlate with my understanding that Solr would cache the index as much as possible. Should I be thinking that things aren't configured correctly here? Dave -Original Message- From: John Nielsen [mailto:j...@mcb.dk] Sent: Friday, April 19, 2013 2:35 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Well, to consume 120GB of RAM with a 120GB index, you would have to query over every single GB of data. If you only actually query over, say, 500MB of the 120GB data in your dev environment, you would only use 500MB worth of RAM for caching. Not 120GB On Fri, Apr 19, 2013 at 7:55 AM, David Parks wrote: > Wow! That was the most pointed, concise discussion of hardware > requirements I've seen to date, and it's fabulously helpful, thank you > Shawn! We currently have 2 servers that I can dedicate about 12GB of > ram to Solr on (we're moving to these 2 servers now). I can upgrade > further if it's needed & justified, and your discussion helps me > justify that such an upgrade is the right thing to do. > > So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I > should be in the free and clear then right? This seems reasonable and > doable. > > In this more extreme example the failover properties of solr cloud > become more clear. I couldn't possibly run a replica shard without > doubling the memory, so really replication isn't reasonable until I > have double the hardware, then the load balancing scheme makes perfect > sense. With 3 servers, 50GB of RAM and 120GB index I should just > backup the index directory I think. > > My previous though to run replication just for failover would have > actually resulted in LOWER performance because I would have halved the > memory available to the master & replica. So the previous question is > answered as well now. > > Question: if I had 1 server with 60GB of memory and 120GB index, would > solr make full use of the 60GB of memory? Thus trimming disk access in > half. Or is it an all-or-nothing thing? In a dev environment, I > didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. > > Dave > > > -Original Message- > From: Shawn Heisey [mailto:s...@elyograg.org] > Sent: Friday, April 19, 2013 11:51 AM > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud loadbalancing, replication, and failover > > On 4/18/2013 8:12 PM, David Parks wrote: > > I think I still don't understand something here. > > > > My concern right now is that query times are very slow for 120GB > > index (14s on avg), I've seen a lot of disk activity when running queries. > > > > I'm hoping that distributing that query across 2 servers is going to > > improve the query time, specifically I'm hoping that we can > > distribute that disk activity because we don't have great disks on there (yet). > > > > So, with disk IO being a factor in mind, running the query on one > > box, > vs. > > across 2 *should* be a concern right? > > > > Admittedly, this is the first step in what will probably be many to > > try to work our query times down from 14s to what I want to be around 1s. > > I went through my mailing list archive to see what all you've said > about your setup. One thing that I can't seem to find is a mention of > how much total RAM is in each of your servers. I apologize if it was > actually there and I overlooked it. > > In one email thread, you wanted to know whether Solr is CPU-bound or > IO-bound. Solr is heavily reliant on the index on disk, and disk I/O > is the slowest piece of the puzzle. The way to get good performance > out of Solr is to have enough memory that you can take the disk mostly > out of the equation by having the operating system cache the index in > RAM. If you don't have enough RAM for that, then Solr becomes > IO-bound, and your CPUs will be busy in iowait, unable to do much real > work. If you DO have enough RAM to cache all (or most) of your index, > then Solr will be CPU-bound. > > With 120GB of total index data on each server, you would want at least > 128GB of RAM per server, assuming you are only giving 8-16GB of RAM
RE: SolrCloud loadbalancing, replication, and failover
Ok, I understand better now. The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey allocation of 602MB, and light grey of an additional 108MB, for a JVM total of 710MB allocated. If I understand correctly, Solr memory utilization is *not* for caching (unless I configured document caches or some of the other cache options in Solr, which don't seem to apply in this case, and I haven't altered from their defaults). So assuming this box was dedicated to 1 solr instance/shard. What JVM heap should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server instance). Would I be wise to just put the index on a RAM disk and guarantee performance? Assuming I installed sufficient RAM? Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 4:19 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/19/2013 2:15 AM, David Parks wrote: > Interesting. I'm trying to correlate this new understanding to what I > see on my servers. I've got one server with 5GB dedicated to solr, > solr dashboard reports a 167GB index actually. > > When I do many typical queries I see between 3MB and 9MB of disk reads > (watching iostat). > > But solr's dashboard only shows 710MB of memory in use (this box has > had many hundreds of queries put through it, and has been up for 1 > week). That doesn't quite correlate with my understanding that Solr > would cache the index as much as possible. There are two memory sections on the dashboard. The one at the top shows the operating system view of physical memory. That is probably showing virtually all of it in use. Most UNIX platforms will show you the same info with 'top' or 'free'. Some of them, like Solaris, require different tools. I assume you're not using Windows, because you mention iostat. The other memory section is for the JVM, and that only covers the memory used by Solr. The dark grey section is the amount of Java heap memory currently utilized by Solr and its servlet container. The light grey section represents the memory that the JVM has allocated from system memory. If any part of that bar is white, then Java has not yet requested the maximum configured heap. Typically a long-running Solr install will have only dark and light grey, no white. The operating system is what caches your index, not Solr. The bulk of your RAM should be unallocated. With your index size, the OS will use all unallocated RAM for the disk cache. If a program requests some of that RAM, the OS will instantly give it up. Thanks, Shawn
RE: SolrCloud loadbalancing, replication, and failover
Wow, thank you for those benchmarks Toke, that really gives me some firm footing to stand on in knowing what to expect and thinking out which path to venture down. It's tremendously appreciated! Dave -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Friday, April 19, 2013 5:17 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote: > Using SSDs for storage can speed things up dramatically and may reduce > the total memory requirement to some degree, We have been using SSDs for several years in our servers. It is our clear experience that "to some degree" should be replaced with "very much" in the above. Our current SSD-equipped servers each holds a total of 127GB of index data spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB are left for disk cache. "We" are the State and University Library, Denmark and our search engine is the primary (and arguably only) way to locate resources for our users. The average raw search time is 32ms for non-faceted queries and 616ms for heavy faceted (which is much too slow. Dang! I thought I fixed that). > but even an SSD is slower than RAM. The transfer speed of RAM is > faster, and from what I understand, the latency is at least an order > of magnitude quicker - nanoseconds vs microseconds. True, but you might as well argue that everyone should go for the fastest CPU possible, as it will be, well, faster than the slower ones. The question is almost never to get the fastest possible, but to get a good price/performance tradeoff. I would argue that SSDs fit that bill very well for a great deal of the "My search is too slow"-threads that are spun on this mailing list. Especially for larger indexes. Regards, Toke Eskildsen
RE: SolrCloud loadbalancing, replication, and failover
Again, thank you for this incredible information, I feel on much firmer footing now. I'm going to test distributing this across 10 servers, borrowing a Hadoop cluster temporarily, and see how it does with enough memory to have the whole index cached. But I'm thinking that we'll try the SSD route as our index will probably rest in the 1/2 terabyte range eventually, there's still a lot of active development. I guess the RAM disk would work in our case also, as we only index in batches, and eventually I'd like to do that off of Solr and just update the index (I'm presuming this is doable in solr cloud, but I haven't put it to task yet). If I could purpose Hadoop to index the shards, that would be ideal, though I haven't quite figured out how to go about it yet. David -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 9:42 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/19/2013 3:48 AM, David Parks wrote: > The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has > dark grey allocation of 602MB, and light grey of an additional 108MB, > for a JVM total of 710MB allocated. If I understand correctly, Solr > memory utilization is > *not* for caching (unless I configured document caches or some of the > other cache options in Solr, which don't seem to apply in this case, > and I haven't altered from their defaults). Right. Solr does have caches, but they serve specific purposes. The OS is much better at general large-scale caching than Solr is. Solr caches get cleared (and possibly re-warmed) whenever you issue a commit on your index that makes new documents visible. > So assuming this box was dedicated to 1 solr instance/shard. What JVM > heap should I set? Does that matter? 24GB JVM heap? Or keep it lower > and ensure the OS cache has plenty of room to operate? (this is an > Ubuntu 12.10 server instance). The JVM heap to use is highly dependent on the nature of your queries, the number of documents, the number of unique terms, etc. The best thing to do is try it out with a relatively large heap, see how much memory actually gets used inside the JVM. The jvisualvm and jconsole tools will give you nice graphs of JVM memory usage. The jstat program will give you raw numbers on the commandline that you'll need to add to get the full picture. Due to the garbage collection model that Java uses, what you'll see is a sawtooth pattern - memory usage goes up to max heap, then garbage collection reduces it to the actual memory used. Generally speaking, you want to have more heap available than the "low" point of that sawtooth pattern. If that low point is around 3GB when you are hitting your index hard with queries and updates, then you would want to give Solr a heap of 4 to 6 GB. > Would I be wise to just put the index on a RAM disk and guarantee > performance? Assuming I installed sufficient RAM? A RAM disk is a very good way to guarantee performance - but RAM disks are ephemeral. Reboot or have an OS crash and it's gone, you'll have to reindex. Also remember that you actually need at *least* twice the size of your index so that Solr (Lucene) has enough room to do merges, and the worst-case scenario is *three* times the index size. Merging happens during normal indexing, not just when you optimize. If you have enough RAM for three times your index size and it takes less than an hour or two to rebuild the index, then a RAM disk might be a viable way to go. I suspect that this won't work for you. Thanks, Shawn
Bug? JSON output changes when switching to solr cloud
We just took an installation of 4.1 which was working fine and changed it to run as solr cloud. We encountered the most incredibly bizarre apparent bug: In the JSON output, a colon ':' changed to a comma ',', which of course broke the JSON parser. I'm guessing I should file this as a bug, but it was so odd I thought I'd post here before doing so. Demo below: Here is a query on our previous single-server instance: Query: -- http://10.1.3.28:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalog_name &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&group.li mit=50 Response: - {"responseHeader":{"status":0,"QTime":15714,"params":{"fl":"score,id,unique_ catalog_name","start":"0","q":"book","group.limit":"50","group.field":"uniqu e_catalog_name","group":"true","wt":"json","rows":"50"}},"grouped":{"unique_ catalog_name":{"matches":106711214,"groups":[{"groupValue":"ls:2653","doclis t":{"numFound":103981882,"start":0,"maxScore":4.7039795,"docs":[{"id":"10055 02088784","score":4.7039795},{"id":"1005500291075","score":4.7039795},{"id": "1000810546074","score":4.7039795},{"id":"1000611003270","score":4.7039795}, Note this part: -- {"unique_catalog_name":{"matches": Now we run that same query on a server that was derived from the same build, just configuration changes to run it in distributed "solr cloud" mode. Query: - http://10.1.3.18:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalog_name &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&group.li mit=50 Response: -{"responseHeader":{"status":0,"QTime":8855,"params":{"fl":"scor e,id,unique_catalog_name","start":"0","q":"book","group.limit":"50","group.f ield":"unique_catalog_name","group":"true","wt":"json","rows":"50"}},"groupe d":["unique_catalog_name",{"matches":106711214,"groups":[{"groupValue":"ls:2 653","doclist":{"numFound":103981882,"start":0,"maxScore":4.7042913,"docs":[ {"id":"1005502088784","score":4.7042913},{"id":"1000611003270","score":4.704 2913},{"id":"1005500291075","score":4.703668},{"id":"1000810546074","score": 4.703668}, Note how it's changed: "unique_catalog_name",{"matches":
RE: Bug? JSON output changes when switching to solr cloud
Thanks Yonik! That was fast! We switched over to XML for the moment and will switch back to JSON when 4.3 comes out. Dave -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, April 22, 2013 8:18 PM To: solr-user@lucene.apache.org Subject: Re: Bug? JSON output changes when switching to solr cloud Thanks David, I've confirmed this is still a problem in trunk and opened https://issues.apache.org/jira/browse/SOLR-4746 -Yonik http://lucidworks.com On Sun, Apr 21, 2013 at 11:16 PM, David Parks wrote: > We just took an installation of 4.1 which was working fine and changed > it to run as solr cloud. We encountered the most incredibly bizarre apparent bug: > > In the JSON output, a colon ':' changed to a comma ',', which of > course broke the JSON parser. I'm guessing I should file this as a > bug, but it was so odd I thought I'd post here before doing so. Demo below: > > Here is a query on our previous single-server instance: > > Query: > -- > http://10.1.3.28:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalo > g_name > &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&gr > oup.li > mit=50 > > Response: > - > {"responseHeader":{"status":0,"QTime":15714,"params":{"fl":"score,id,u > nique_ > catalog_name","start":"0","q":"book","group.limit":"50","group.field": > "uniqu > e_catalog_name","group":"true","wt":"json","rows":"50"}},"grouped":{"u > nique_ > catalog_name":{"matches":106711214,"groups":[{"groupValue":"ls:2653"," > doclis > t":{"numFound":103981882,"start":0,"maxScore":4.7039795,"docs":[{"id": > "10055 > 02088784","score":4.7039795},{"id":"1005500291075","score":4.7039795},{"id": > "1000810546074","score":4.7039795},{"id":"1000611003270","score":4.703 > 9795}, > > Note this part: > -- > {"unique_catalog_name":{"matches": > > > > Now we run that same query on a server that was derived from the same > build, just configuration changes to run it in distributed "solr cloud" mode. > > Query: > - > http://10.1.3.18:8081/solr/select?q=book&fl=score%2Cid%2Cunique_catalo > g_name > &start=0&rows=50&wt=json&group=true&group.field=unique_catalog_name&gr > oup.li > mit=50 > > Response: > -{"responseHeader":{"status":0,"QTime":8855,"params":{"fl" > :"scor > e,id,unique_catalog_name","start":"0","q":"book","group.limit":"50","g > roup.f > ield":"unique_catalog_name","group":"true","wt":"json","rows":"50"}}," > groupe > d":["unique_catalog_name",{"matches":106711214,"groups":[{"groupValue" > :"ls:2 > 653","doclist":{"numFound":103981882,"start":0,"maxScore":4.7042913,"d > ocs":[ > {"id":"1005502088784","score":4.7042913},{"id":"1000611003270","score" > :4.704 > 2913},{"id":"1005500291075","score":4.703668},{"id":"1000810546074","score": > 4.703668}, > > Note how it's changed: > > "unique_catalog_name",{"matches": > > > >
Indexing off of the production servers
I've had trouble figuring out what options exist if I want to perform all indexing off of the production servers (I'd like to keep them only for user queries). We index data in batches roughly daily, ideally I'd index all solr cloud shards offline, then move the final index files to the solr cloud instance that needs it and flip a switch and have it use the new index. Is this possible via either: 1. Doing the indexing in Hadoop?? (this would be ideal as we have a significant investment in a hadoop cluster already), or 2. Maintaining a separate "master" server that handles indexing and the nodes that receive user queries update their index from there (I seem to recall reading about this configuration in 3.x, but now we're using solr cloud) Is there some ideal solution I can use to "protect" the production solr instances from degraded performance during large index processing periods? Thanks! David
RE: Indexing off of the production servers
I'm less concerned with fully utilizing a hadoop cluster (due to having fewer shards than I have hadoop reduce slots) as I am with just off-loading the whole indexing process. We may just want to re-index the whole thing to add some index time boosts or whatever else we conjure up to make queries faster and better quality. We're doing a lot of work on optimization right now. To re-index the whole thing is a 5-10 hour process for us, so when we move some update to production that requires full re-indexing (every week or so), right now we're just re-building new instances of solr to handle the re-indexing and then copying the final VMs to the production environment (slow process). I'm leery of letting a heavy duty full re-index process loose for 10 hours on production on a regular basis. It doesn't sound like there are any pre-built processes for doing this now though. I thought I had heard of master/slave hierarchy in 3.x that would allow us to designate a master to do indexing and let the slaves pull finished indexes from the master, so I thought maybe something like that followed into solr cloud. Eric might be right in that it's not worth the effort if there isn't some existing strategy. Dave -Original Message- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Monday, May 06, 2013 7:06 PM To: solr-user@lucene.apache.org Subject: Re: Indexing off of the production servers Hi Erick; I think that even if you use Map/Reduce you will not parallelize you indexing because indexing will parallelize as much as how many leaders you have at your SolrCloud, isn't it? 2013/5/6 Erick Erickson > The only problem with using Hadoop (or whatever) is that you need to > be sure that documents end up on the same shard, which means that you > have to use the same routing mechanism that SolrCloud uses. The custom > doc routing may help here > > My very first question, though, would be whether this is necessary. > It might be sufficient to just throttle the rate of indexing, or just > do the indexing during off hours or Have you measured an indexing > degradation during your heavy indexing? Indexing has costs, no > question, but it's worth asking whether the costs are heavy enough to > be worth the bother.. > > Best > Erick > > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI > wrote: > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you > > use Map/Reduce jobs you split your workload, process it, and then > > reduce step takes into account. Let me explain you new SolrCloud > > architecture. You start your SolrCluoud with a numShards parameter. > > Let's assume that you have 5 shards. Then you will have 5 leader at > > your SolrCloud. These > leaders > > will be responsible for indexing your data. It means that your > > indexing workload will divided into 5 so it means that you have > > parallelized your data as like Map/Reduce jobs. > > > > Let's assume that you have added 10 new Solr nodes into your SolrCloud. > > They will be added as a replica for each shard. Then you will have 5 > > shards, 5 leaders of them and every shard has 2 replica. When you > > send a query into a SolrCloud every replica will help you for > > searching and if > you > > add more replicas to your SolrCloud your search performance will improve. > > > > > > 2013/5/6 David Parks > > > >> I've had trouble figuring out what options exist if I want to > >> perform > all > >> indexing off of the production servers (I'd like to keep them only > >> for > user > >> queries). > >> > >> > >> > >> We index data in batches roughly daily, ideally I'd index all solr > >> cloud shards offline, then move the final index files to the solr > >> cloud > instance > >> that needs it and flip a switch and have it use the new index. > >> > >> > >> > >> Is this possible via either: > >> > >> 1. Doing the indexing in Hadoop?? (this would be ideal as we have > a > >> significant investment in a hadoop cluster already), or > >> > >> 2. Maintaining a separate "master" server that handles indexing > and > >> the nodes that receive user queries update their index from there > >> (I > seem > >> to > >> recall reading about this configuration in 3.x, but now we're using > >> solr > >> cloud) > >> > >> > >> > >> Is there some ideal solution I can use to "protect" the production > >> solr instances from degraded performance during large index > >> processing > periods? > >> > >> > >> > >> Thanks! > >> > >> David > >> > >> >
RE: Indexing off of the production servers
So, am I following this correctly by saying that, this proposed solution would present us a way to index a collection on an offline/dev solr cloud instance and *move* that pre-prepared index to the production server using an alias/rename trick? That seems like a reasonably doable solution. I also wonder how much work it is to build the shards programmatically (e.g. directly in a hadoop/java environment), cutting out the extra step of needing another solr instances running on a staging environment somewhere. Then using this technique to swap in the shards. I might do something like this first and then look into simplifying, and further automating, later on. And if it is indeed possible to build a hadoop driver for indexing, I think that would be a useful tool for the community at large. So I'm still curious about it, at least as a thought exercise, if nothing else. Thanks, Dave -Original Message- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Monday, May 06, 2013 9:44 PM To: solr-user@lucene.apache.org Subject: Re: Indexing off of the production servers Hi Erick; Thanks for your answer. I have read that at somewhere: I believe "redirect" from replica to leader would happen only at index time, so a doc first gets indexed to leader and from there it's replicated to non-leader shards. Is that true? I want to make clear the things in my mind otherwise I want to ask a separate question about what happens for indexing and querying at SolrCloud. 2013/5/6 Shawn Heisey > On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote: > > Excellent idea ! > > And it is possible to use collection aliasing with the CREATEALIAS > > to make this transparent for the query side. > > > > ex. with 2 collections named : > > collection_1 > > collection_2 > > > > > /collections?action=CREATEALIAS&name=collectionalias&collections=colle > ction_1 > > > > "collectionalias" is now a virtual collection pointing to collection_1. > > > > Index on collection_2, then : > > > /collections?action=CREATEALIAS&name=collectionalias&collections=colle > ction_2 > > > > "collectionalias" now is an alias to collection_2. > > > > > http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Col > lections_API > > Awesome idea, Andre! I was wondering whether you might have to delete > the original alias before creating the new one, but a quick look at > the issue for collection aliasing shows that this isn't the case. > > https://issues.apache.org/jira/browse/SOLR-4497 > > The wiki doesn't mention the DELETEALIAS action. I won't have time > right now to update the wiki. > > Thanks, > Shawn > >
RE: Solr Cloud with large synonyms.txt
Wouldn't it make more sense to only store a pointer to a synonyms file in zookeeper? Maybe just make the synonyms file accessible via http so other boxes can copy it if needed? Zookeeper was never meant for storing significant amounts of data. -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Tuesday, May 07, 2013 4:35 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud with large synonyms.txt See discussion here http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html One idea was compression. Perhaps if we add gzip support to SynonymFilter it can read synonyms.txt.gz which would then fit larger raw dicts? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 6. mai 2013 kl. 18:32 skrev Son Nguyen : > Hello, > > I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper (the Zookeeer is on different machine, version 3.4.5). > I've tried to start with a 1.7MB synonyms.txt, but got a "ConnectionLossException": > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt >at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) >at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270) >at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267) >at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java :65) >at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267) >at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436) >at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315) >at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135) >at org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955) >at org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285) >... 43 more > > I did some researches on internet and found out that because Zookeeper znode size limit is 1MB. I tried to increase the system property "jute.maxbuffer" but it won't work. > Does anyone have experience of dealing with it? > > Thanks, > Son
RE: Solr Cloud with large synonyms.txt
I can see your point, though I think edge cases would be one concern, if someone *can* create a very large synonyms file, someone *will* create that file. What would you set the zookeeper max data size to be? 50MB? 100MB? Someone is going to do something bad if there's nothing to tell them not to. Today solr cloud just crashes if you try to create a modest sized synonyms file, clearly at a minimum some zookeeper settings should be configured out of the box. Any reasonable setting you come up with for zookeeper is virtually guaranteed to fail for some percentage of users over a reasonably sized user-base (which solr has). What if I plugged in a 200MB synonyms file just for testing purposes (I don't care about performance implications)? I don't think most users would catch the footnote in the docs that calls out a max synonyms file size. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, May 07, 2013 11:53 PM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud with large synonyms.txt I'm not so worried about the large file in zk issue myself. The concern is that you start storing and accessing lots of large files in ZK. This is not what it was made for, and everything stays in RAM, so they guard against this type of usage. We are talking about a config file that is loaded on Core load though. It's uploaded and read very rarely. On modern hardware and networks, making that file 5MB rather than 1MB is not going to ruin your day. It just won't. Solr does not use ZooKeeper heavily - in a steady state cluster, it doesn't read or write from ZooKeeper at all to any degree that registers. I'm going to have to see problems loading these larger config files from ZooKeeper before I'm worried that it's a problem. - Mark On May 7, 2013, at 12:21 PM, Son Nguyen wrote: > Mark, > > I tried to set that property on both ZK (I have only one ZK instance) and Solr, but it still didn't work. > But I read somewhere that ZK is not really designed for keeping large data files, so this solution - increasing jute.maxbuffer (if I can implement it) should be just temporary. > > Son > > -Original Message- > From: Mark Miller [mailto:markrmil...@gmail.com] > Sent: Tuesday, May 07, 2013 9:35 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Cloud with large synonyms.txt > > > On May 7, 2013, at 10:24 AM, Mark Miller wrote: > >> >> On May 6, 2013, at 12:32 PM, Son Nguyen wrote: >> >>> I did some researches on internet and found out that because Zookeeper znode size limit is 1MB. I tried to increase the system property "jute.maxbuffer" but it won't work. >>> Does anyone have experience of dealing with it? >> >> Perhaps hit up the ZK list? They doc it as simply raising jute.maxbuffer, though you have to do it for each ZK instance. >> >> - Mark >> > > "the system property must be set on all servers and clients otherwise problems will arise." > > Make sure you try passing it both to ZK *and* to Solr. > > - Mark >
RE: More Like This and Caching
I'm not the expert here, but perhaps what you're noticing is actually the OS's disk cache. The actual solr index isn't cached by solr, but as you read the blocks off disk the OS disk cache probably did cache those blocks for you. On the 2nd run the index blocks were read out of memory. There was a very extensive discussion on this list not long back titled: "Re: SolrCloud loadbalancing, replication, and failover" look that thread up and you'll get a lot of in-depth on the topic. David -Original Message- From: Giammarco Schisani [mailto:giamma...@schisani.com] Sent: Thursday, May 09, 2013 2:59 PM To: solr-user@lucene.apache.org Subject: More Like This and Caching Hi all, Could anybody explain which Solr cache (e.g. queryResultCache, documentCache, fieldCache, etc.) can be used by the More Like This handler? One of my colleagues had previously suggested that the More Like This handler does not take advantage of any of the Solr caches. However, if I issue two identical MLT requests to the same Solr instance, the second request will execute much faster than the first request (for example, the first request will execute in 200ms and the second request will execute in 20ms). This makes me believe that at least one of the Solr caches is being used by the More Like This handler. I think the "documentCache" is the cache that is most likely being used, but would you be able to confirm? As information, I am currently using Solr version 3.6.1. Kind regards, Giammarco Schisani
RE: Is the CoreAdmin RENAME method atomic?
Find the discussion titled "Indexing off the production servers" just a week ago in this same forum, there is a significant discussion of this feature that you will probably want to review. -Original Message- From: Lan [mailto:dung@gmail.com] Sent: Friday, May 10, 2013 3:42 AM To: solr-user@lucene.apache.org Subject: Is the CoreAdmin RENAME method atomic? We need to implement a locking mechanism for a full-reindexing SOLR server pool. We could use a database, Zookeeper as our locking mechanism but thats a lot of work. Could solr do it? I noticed the core admin RENAME function (http://wiki.apache.org/solr/CoreAdmin#RENAME) Is this an synchronous atomic operation? What I'm thinking is we create a solr core named 'lock' and any process that wants to obtain a solr server from the pool tries to rename the 'lock' core to say 'lock.someuniqueid'. If it fails, then it tries another server in the pools or waits a bit. If it succeeds, it reindexes it's data and then renames 'lock.someuniqueid' back to 'lock' to return the server back to the pool. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-the-CoreAdmin-RENAME-method-atomic-tp4 061944.html Sent from the Solr - User mailing list archive at Nabble.com.
Boosting documents with terms derived from clustering - good idea?
We have a number of queries that produce good results based on the textual data, but are contextually wrong (for example, an "SSD hard drive" search matches the music album "SSD hip hop drives us crazy". Textually a fair match, but SSD is a term that strongly relates to technical documents. We'd like to be able to direct this query more strictly in the direction of the technical documents based on the term "SSD". I am considering whether it would be worth trying to cluster all documents, thus tending to group the music with the music and tech items with the tech items. Then pulling out the term vectors that define each group; do a human review of that data; and plug it back into the documents of each cluster as a separate search field that gets boosted. In my head it seems like a plausible way to weigh terms like SSD to the cluster of items that it most closely associates. Should I spend the effort to find out? Yeh or neh?
MoreLikeThis supporting multiple document IDs as input?
I'm unclear on this point from the documentation. Is it possible to give Solr X # of document IDs and tell it that I want documents similar to those X documents? Example: - The user is browsing 5 different articles - I send Solr the IDs of these 5 articles so I can present the user other similar articles I see this example for sending it 1 document ID: http://localhost:8080/solr/select/?qt=mlt&q=id:[document id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10 But can I send it 2+ document IDs as the query?
RE: MoreLikeThis supporting multiple document IDs as input?
Someone else suggested this query: q=id:[1001 OR 1002], where the numbers represent multiple IDs, but if I get it, you're saying that these ultimate get turned into just one document and we get similar documents to just that one. MoreLikeThese sounds promising. Is this in one of the development builds, or is it just and addon I need to install? I haven't done much customization of Solr yet. Thanks! Dave -Original Message- From: Roman Chyla [mailto:roman.ch...@gmail.com] Sent: Wednesday, December 26, 2012 3:57 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis supporting multiple document IDs as input? Jay Luker has written MoreLikeThese which is probably what you want. You may give it a try, though I am not sure if it works with Solr4.0 at this point (we didn't port it yet) https://github.com/romanchyla/montysolr/blob/MLT/contrib/adsabs/src/java/org/apache/solr/handler/MoreLikeTheseHandler.java roman On Wed, Dec 26, 2012 at 12:06 AM, Jack Krupansky wrote: > MLT has both a request handler and a search component. > > The MLT handler returns similar documents only for the first document > that the query matches. > > The MLT search component returns similar documents for each of the > documents in the search results, but processes each search result base > document one at a time and keeps its similar documents segregated by > each of the base documents. > > It sounds like you wanted to merge the base search results and then > find documents similar to that merged super-document. Is that what you > were really seeking, as opposed to what the MLT component does? > Unfortunately, you can't do that with the components as they are. > > You would have to manually merge the values from the base documents > and then you could POST that text back to the MLT handler and find > similar documents using the posted text rather than a query. Kind of > messy, but in theory that should work. > > -- Jack Krupansky > > -Original Message- From: David Parks > Sent: Tuesday, December 25, 2012 5:04 AM > To: solr-user@lucene.apache.org > Subject: MoreLikeThis supporting multiple document IDs as input? > > > I'm unclear on this point from the documentation. Is it possible to > give Solr X # of document IDs and tell it that I want documents > similar to those X documents? > > Example: > > - The user is browsing 5 different articles > - I send Solr the IDs of these 5 articles so I can present the user > other similar articles > > I see this example for sending it 1 document ID: > http://localhost:8080/solr/**select/?qt=mlt&q=id:[document<http://loca > lhost:8080/solr/select/?qt=mlt&q=id:[document> > id]&mlt.fl=[field1],[field2],[**field3]&fl=id&rows=10 > > But can I send it 2+ document IDs as the query? >
RE: solr + jetty deployment issue
Do you see any errors coming in on the console, stderr? I start solr this way and redirect the stdout and stderr to log files, when I have a problem stderr generally has the answer: java \ -server \ -Djetty.port=8080 \ -Dsolr.solr.home=/opt/solr \ -Dsolr.data.dir=/mnt/solr_data \ -jar /opt/solr/start.jar >/opt/solr/logs/stdout.log 2>/opt/solr/logs/stderr.log & -Original Message- From: Sushrut Bidwai [mailto:bidwai.sush...@gmail.com] Sent: Thursday, December 27, 2012 7:40 PM To: solr-user@lucene.apache.org Subject: solr + jetty deployment issue Hi, I am having trouble with getting solr + jetty to work. I am following all instructions to the letter from - http://wiki.apache.org/solr/SolrJetty. I also created a work folder - /opt/solr/work. I am also setting tmpdir to a new path in /etc/default/jetty . I am confirming the tmpdir is set to the new path from admin dashboard, under args. It works like a charm. But when I restart jetty multiple times, after 3/4 such restarts it starts hanging. Admin pages just dont load and my app fails to acquire a connection with solr. What I might be missing? Should I be rather looking at my code and see if I am not committing correctly? Please let me know if you have faced similar issue in the past and how to tackle it. Thank you. -- Best Regards, Sushrut
MoreLikeThis only returns 1 result
I'm doing a query like this for MoreLikeThis, sending it a document ID. But the only result I ever get back is the document ID I sent it. The debug response is below. If I read it correctly, it's taking "id:1004401713626" as the term (not the document ID) and only finding it once. But I want it to match the document with ID 1004401713626 of course. I tried &q=id[1004410713626], but that generates an exception: Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'id:[1004401713626]': Encountered " "]" "] "" at line 1, column 17. Was expecting one of: "TO" ... ... ... This must be easy, but the documentation is minimal. My Query: http://107.23.102.164:8080/solr/select/?qt=mlt&q=id:[1004401713626]&rows=10&; mlt.fl=item_name,item_brand,short_description,long_description,catalog_names ,categories,keywords,attributes,facetime&mlt.mintf=2&mlt.mindf=5&mlt.maxqt=1 00&mlt.boost=false&debugQuery=true 0 1 5 item_name,item_brand,short_description,long_description,catalog_names,catego ries,keywords,attributes,facetime false true id:1004401713626 2 100 mlt 10 0 1004401713626 id:1004401713626 id:1004401713626 id:1004401713626 id:1004401713626 18.29481 = (MATCH) fieldWeight(id:1004401713626 in 2843152), product of: 1.0 = tf(termFreq(id:1004401713626)=1) 18.29481 = idf(docFreq=1, maxDocs=64873893) 1.0 = fieldNorm(field=id, doc=2843152)
RE: MoreLikeThis only returns 1 result
Ok, that worked, I had the /mlt request handler misconfigured (forgot a '/'). It's working now. Thanks! -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, December 28, 2012 11:38 AM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis only returns 1 result Sounds like it is simply dispatching to the normal search request handler. Although you specified qt=mlt, make sure you enable the legacy select handler dispatching in solrconfig.xml. Change: to Or, simply address the MLT handler directly: http://107.23.102.164:8080/solr/mlt?q=... Or, use the MoreLikeThis search component: http://localhost:8983/solr/select?q=...&mlt=true&;... See: http://wiki.apache.org/solr/MoreLikeThis -- Jack Krupansky -Original Message- From: David Parks Sent: Thursday, December 27, 2012 9:59 PM To: solr-user@lucene.apache.org Subject: MoreLikeThis only returns 1 result I'm doing a query like this for MoreLikeThis, sending it a document ID. But the only result I ever get back is the document ID I sent it. The debug response is below. If I read it correctly, it's taking "id:1004401713626" as the term (not the document ID) and only finding it once. But I want it to match the document with ID 1004401713626 of course. I tried &q=id[1004410713626], but that generates an exception: Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'id:[1004401713626]': Encountered " "]" "] "" at line 1, column 17. Was expecting one of: "TO" ... ... ... This must be easy, but the documentation is minimal. My Query: http://107.23.102.164:8080/solr/select/?qt=mlt&q=id:[1004401713626]&rows=10&; mlt.fl=item_name,item_brand,short_description,long_description,catalog_names ,categories,keywords,attributes,facetime&mlt.mintf=2&mlt.mindf=5&mlt.maxqt=1 00&mlt.boost=false&debugQuery=true 0 1 5 item_name,item_brand,short_description,long_description,catalog_names,catego ries,keywords,attributes,facetime false true id:1004401713626 2 100 mlt 10 0 1004401713626 id:1004401713626 id:1004401713626 id:1004401713626 id:1004401713626 18.29481 = (MATCH) fieldWeight(id:1004401713626 in 2843152), product of: 1.0 = tf(termFreq(id:1004401713626)=1) 18.29481 = idf(docFreq=1, maxDocs=64873893) 1.0 = fieldNorm(field=id, doc=2843152)
RE: MoreLikeThis supporting multiple document IDs as input?
I'm somewhat new to Solr (it's running, I've been through the books, but I'm no master). What I hear you say is that MLT *can* accept, say 5, documents and provide results, but the results would essentially be the same as running the query 5 times for each document? If that's the case, I might accept it. I would just have to merge them together at the end (perhaps I'd take the top 2 of each result, for example). Being somewhat new I'm a little confused by the difference between a "Search Component" and a "Handler". I've got the /mlt handler working and I'm using that. But how's that different from a "Search Component"? Is that referring to the default /solr/select?q="..." style query? And if what I said about multiple documents above is correct, what's the syntax to try that out? Thanks very much for the great help! Dave -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, December 26, 2012 12:07 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis supporting multiple document IDs as input? MLT has both a request handler and a search component. The MLT handler returns similar documents only for the first document that the query matches. The MLT search component returns similar documents for each of the documents in the search results, but processes each search result base document one at a time and keeps its similar documents segregated by each of the base documents. It sounds like you wanted to merge the base search results and then find documents similar to that merged super-document. Is that what you were really seeking, as opposed to what the MLT component does? Unfortunately, you can't do that with the components as they are. You would have to manually merge the values from the base documents and then you could POST that text back to the MLT handler and find similar documents using the posted text rather than a query. Kind of messy, but in theory that should work. -- Jack Krupansky -Original Message- From: David Parks Sent: Tuesday, December 25, 2012 5:04 AM To: solr-user@lucene.apache.org Subject: MoreLikeThis supporting multiple document IDs as input? I'm unclear on this point from the documentation. Is it possible to give Solr X # of document IDs and tell it that I want documents similar to those X documents? Example: - The user is browsing 5 different articles - I send Solr the IDs of these 5 articles so I can present the user other similar articles I see this example for sending it 1 document ID: http://localhost:8080/solr/select/?qt=mlt&q=id:[document id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10 But can I send it 2+ document IDs as the query?
RE: MoreLikeThis supporting multiple document IDs as input?
So the Search Components are executed in series an _every_ request. I presume then that they look at the request parameters and decide what and whether to take action. So in the case of the MLT component this was said: > The MLT search component returns similar documents for each of the > documents in the search results, but processes each search result base > document one at a time and keeps its similar documents segregated by > each of the base documents. So what I think I understand is that the Query Component (presumably this guy: org.apache.solr.handler.component.QueryComponent) takes the input from the "q" parameter and returns a result (the "q=id:123456" ensure that the Query Component will return just this one document). The MltComponent then looks at the result from the QueryComponent and generates its results. The part that is still confusing is understanding the difference between these two comments: - The MLT search component returns similar documents for each of the documents in the search results - The MLT handler returns similar documents only for the first document that the query matches. -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, December 28, 2012 1:26 PM To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis supporting multiple document IDs as input? Hi Dave, Think of search components as a chain of Java classes that get executed during each search request. If you open solrconfig.xml you will see how they are defined and used. HTH Otis Solr & ElasticSearch Support http://sematext.com/ On Dec 28, 2012 12:06 AM, "David Parks" wrote: > I'm somewhat new to Solr (it's running, I've been through the books, > but I'm no master). What I hear you say is that MLT *can* accept, say > 5, documents and provide results, but the results would essentially be > the same as running the query 5 times for each document? > > If that's the case, I might accept it. I would just have to merge them > together at the end (perhaps I'd take the top 2 of each result, for > example). > > Being somewhat new I'm a little confused by the difference between a > "Search Component" and a "Handler". I've got the /mlt handler working > and I'm using that. But how's that different from a "Search > Component"? Is that referring to the default /solr/select?q="..." > style query? > > And if what I said about multiple documents above is correct, what's > the syntax to try that out? > > Thanks very much for the great help! > Dave > > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Wednesday, December 26, 2012 12:07 PM > To: solr-user@lucene.apache.org > Subject: Re: MoreLikeThis supporting multiple document IDs as input? > > MLT has both a request handler and a search component. > > The MLT handler returns similar documents only for the first document > that the query matches. > > The MLT search component returns similar documents for each of the > documents in the search results, but processes each search result base > document one at a time and keeps its similar documents segregated by > each of the base documents. > > It sounds like you wanted to merge the base search results and then > find documents similar to that merged super-document. Is that what you > were really seeking, as opposed to what the MLT component does? > Unfortunately, you can't do that with the components as they are. > > You would have to manually merge the values from the base documents > and then you could POST that text back to the MLT handler and find > similar documents using the posted text rather than a query. Kind of > messy, but in theory that should work. > > -- Jack Krupansky > > -Original Message- > From: David Parks > Sent: Tuesday, December 25, 2012 5:04 AM > To: solr-user@lucene.apache.org > Subject: MoreLikeThis supporting multiple document IDs as input? > > I'm unclear on this point from the documentation. Is it possible to > give Solr X # of document IDs and tell it that I want documents > similar to those X documents? > > Example: > > - The user is browsing 5 different articles > - I send Solr the IDs of these 5 articles so I can present the user > other similar articles > > I see this example for sending it 1 document ID: > http://localhost:8080/solr/select/?qt=mlt&q=id:[document > id]&mlt.fl=[field1],[field2],[field3]&fl=id&rows=10 > > But can I send it 2+ document IDs as the query? > >
What do I need to research to solve the problem of returning good results for a generic term?
I'm sure this is a complex problem requiring many iterations of work, so I'm just looking for pointers in the right direction of research here. I have a base term, such as let's say "black dress" that I might search for. Someone searching on this term is most logically looking for black dresses. In my dataset I have black dresses, but I also have many CDs with the term "black dress" in them (it's not so uncommon of a song title). I would want the CDs to show up if I search for a more specific term like "black dress CD", but I would want the black dresses to show up for the less specific term "black dress". Google image search is excellent at handling this example. A pretty vanilla installation of Solr isn't yet great at it. So. just looking for a nudge in the right direction here. What should I go read up on first to start learning how to improve on these results?
RE: MoreLikeThis supporting multiple document IDs as input?
I'm not seeing the results I would expect. In the previous email below it's stated that the "MLT search component" returns N results and K similar documents per EACH of the N results. If I'm not mistaken I access the "MLT search component" via a query to /solr/select/?qt=mlt, such as this: http://10.0.0.1:8080/solr/select/?qt=mlt&terms=true&q=shoes&rows=3 The query above for a simple term such as "shoes" can return many documents. But I limited the results to 3, and I see 3 results, and the results don't appear to me any different than doing this query: http://107.23.102.164:8080/solr/select/?q=shoes&rows=3 So that suggests to me that solr maybe isn't handing things off to the MLT component as expected (I don't know what results to expect so it's hard for me to know where I'm trying to get to). So add in a debugQuery=on parameter and I see this, possibly useful reference: LuceneQParser It also appears that the MoreLikeThisComponent did indeed run So maybe I should ask exactly what results I should be expecting here? Thanks very much! David -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, December 28, 2012 8:13 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis supporting multiple document IDs as input? Try a query that returns multiple results and you will see the difference. MLT search component: n results, k similar documents per EACH of the n results MLT request handler: only FIRST result is examined, so only k similar documents for that ONE (first) TOP search result. Are you really saying that you don't comprehend what the difference is, or simply that you don't LIKE the difference?! Or, maybe that you are wondering WHY they are different? That latter question I don't have the answer to. -- Jack Krupansky -Original Message- From: David Parks Sent: Friday, December 28, 2012 2:48 AM To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis supporting multiple document IDs as input? So the Search Components are executed in series an _every_ request. I presume then that they look at the request parameters and decide what and whether to take action. So in the case of the MLT component this was said: > The MLT search component returns similar documents for each of the > documents in the search results, but processes each search result base > document one at a time and keeps its similar documents segregated by > each of the base documents. So what I think I understand is that the Query Component (presumably this guy: org.apache.solr.handler.component.QueryComponent) takes the input from the "q" parameter and returns a result (the "q=id:123456" ensure that the Query Component will return just this one document). The MltComponent then looks at the result from the QueryComponent and generates its results. The part that is still confusing is understanding the difference between these two comments: - The MLT search component returns similar documents for each of the documents in the search results - The MLT handler returns similar documents only for the first document that the query matches. -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, December 28, 2012 1:26 PM To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis supporting multiple document IDs as input? Hi Dave, Think of search components as a chain of Java classes that get executed during each search request. If you open solrconfig.xml you will see how they are defined and used. HTH Otis Solr & ElasticSearch Support http://sematext.com/ On Dec 28, 2012 12:06 AM, "David Parks" wrote: > I'm somewhat new to Solr (it's running, I've been through the books, > but I'm no master). What I hear you say is that MLT *can* accept, say > 5, documents and provide results, but the results would essentially be > the same as running the query 5 times for each document? > > If that's the case, I might accept it. I would just have to merge them > together at the end (perhaps I'd take the top 2 of each result, for > example). > > Being somewhat new I'm a little confused by the difference between a > "Search Component" and a "Handler". I've got the /mlt handler working > and I'm using that. But how's that different from a "Search > Component"? Is that referring to the default /solr/select?q="..." > style query? > > And if what I said about multiple documents above is correct, what's > the syntax to try that out? > > Thanks very much for the great help! > Dave > > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Wednesday, December 26
RE: MoreLikeThis supporting multiple document IDs as input?
Aha! &mlt=true, that was the key I hadn't worked out before (thought it was &qt=mlt that achieved that), things are looking rosy now, and these results are a perfect fit for my needs. Thanks very much for your time to help explain this!! David -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, January 03, 2013 8:46 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis supporting multiple document IDs as input? The MLT search component is enabled using &mlt=true and works on any normal Solr query. It gives a batch of similar documents for each search result of the original query, one batch per original query result. It uses the &mlt.count=n parameter to control how many similar results to return for each original query result. The MLT request handler is a standalone request handler that does a query, takes the first result, and then returns one batch of documents that are similar to that one document. You have to configure the handler yourself, but typically it would have the name "/mlt", so you would write: http://10.0.0.1:8080/solr/mlt/?q=shoes&rows=3 It will show you both the single document from the original query and then the batch of documents that are most similar to the top terms from that one original document. Add &debugQuery=true or &debug=query or &debug=results to see the terms that are used in the secondary queries that find the similar documents. There are a bunch a parameters that you have to tune for either approach. -- Jack Krupansky -Original Message- From: David Parks Sent: Thursday, January 03, 2013 4:11 AM To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis supporting multiple document IDs as input? I'm not seeing the results I would expect. In the previous email below it's stated that the "MLT search component" returns N results and K similar documents per EACH of the N results. If I'm not mistaken I access the "MLT search component" via a query to /solr/select/?qt=mlt, such as this: http://10.0.0.1:8080/solr/select/?qt=mlt&terms=true&q=shoes&rows=3 The query above for a simple term such as "shoes" can return many documents. But I limited the results to 3, and I see 3 results, and the results don't appear to me any different than doing this query: http://107.23.102.164:8080/solr/select/?q=shoes&rows=3 So that suggests to me that solr maybe isn't handing things off to the MLT component as expected (I don't know what results to expect so it's hard for me to know where I'm trying to get to). So add in a debugQuery=on parameter and I see this, possibly useful reference: LuceneQParser It also appears that the MoreLikeThisComponent did indeed run So maybe I should ask exactly what results I should be expecting here? Thanks very much! David -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, December 28, 2012 8:13 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis supporting multiple document IDs as input? Try a query that returns multiple results and you will see the difference. MLT search component: n results, k similar documents per EACH of the n results MLT request handler: only FIRST result is examined, so only k similar documents for that ONE (first) TOP search result. Are you really saying that you don't comprehend what the difference is, or simply that you don't LIKE the difference?! Or, maybe that you are wondering WHY they are different? That latter question I don't have the answer to. -- Jack Krupansky -Original Message- From: David Parks Sent: Friday, December 28, 2012 2:48 AM To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis supporting multiple document IDs as input? So the Search Components are executed in series an _every_ request. I presume then that they look at the request parameters and decide what and whether to take action. So in the case of the MLT component this was said: > The MLT search component returns similar documents for each of the > documents in the search results, but processes each search result base > document one at a time and keeps its similar documents segregated by > each of the base documents. So what I think I understand is that the Query Component (presumably this guy: org.apache.solr.handler.component.QueryComponent) takes the input from the "q" parameter and returns a result (the "q=id:123456" ensure that the Query Component will return just this one document). The MltComponent then looks at the result from the QueryComponent and generates its results. The part that is still confusing is understanding the difference between these two comments: - The MLT search component returns similar documents for each of the documents in the search results - The MLT handler returns similar documents only for
Search strategy - improving search quality for short search terms such as "doll"
I'm a beginner-intermediate solr admin, I've set up the basics for our application and it runs well. Now it's time for me to dig in and start tuning and improving queries. My next target is searches on simple terms such as "doll" which, in google, would return documents about, well, "toy dolls", because that's the most common usage of the simple term "doll". But in my index it predominantly returns documents about CDs with the song "Doll Face", and "My baby doll" in them. I'm not directly asking how to solve this as much as I'm asking what direction I should be looking in to learn what I need to know to tackle the general issue myself. Left on my own I would start looking at categorizing the CD's into a facet called "music", reasonably doable in my dataset. Then I need to reduce the boost-value of the entire facet/category of music unless certain pre-defined query terms exist, such as [music, cd, song, listen, dvd, , etc.]. I don't yet know how to do all of this, but after a couple more good books I should be "dangerous". So the question to this list: - Am I on the right track here? If not, can you point me in a direction to go?
RE: Search strategy - improving search quality for short search terms such as "doll"
Thanks for the recommendation. I'll start this book today. In my example, "doll" is one example of a million I might only guess at, whereas the category "music", and "book" tend to interferes in many places and seem to be a more limited set of categories to deal with. Dave -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Thursday, January 17, 2013 12:01 AM To: solr-user@lucene.apache.org Subject: Re: Search strategy - improving search quality for short search terms such as "doll" Sounds like 'Doll' could be a category for you, while "Doll face" is a title. Maybe the categories should get a higher boost in eDismax definition over the titles? Related, you may find the following book interesting: http://rosenfeldmedia.com/books/searchanalytics/ Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jan 16, 2013 at 4:40 AM, David Parks wrote: > I'm a beginner-intermediate solr admin, I've set up the basics for our > application and it runs well. > > > > Now it's time for me to dig in and start tuning and improving queries. > > > > My next target is searches on simple terms such as "doll" which, in > google, would return documents about, well, "toy dolls", because > that's the most common usage of the simple term "doll". But in my > index it predominantly returns documents about CDs with the song "Doll Face", > and "My baby doll" > in > them. > > > > I'm not directly asking how to solve this as much as I'm asking what > direction I should be looking in to learn what I need to know to > tackle the general issue myself. > > > > Left on my own I would start looking at categorizing the CD's into a > facet called "music", reasonably doable in my dataset. Then I need to > reduce the boost-value of the entire facet/category of music unless > certain pre-defined query terms exist, such as [music, cd, song, > listen, dvd, exhaustive list>, etc.]. > > > > I don't yet know how to do all of this, but after a couple more good > books I should be "dangerous". > > > > So the question to this list: > > > > - Am I on the right track here? If not, can you point me in a > direction to go? > > > > > >
RE: Search strategy - improving search quality for short search terms such as "doll"
My issue is more that the search term doll shows up in both documents on CDs as well as documents about toys. But I have 10 CD documents for every toy document, so my searches for "doll" tend to show the CDs most prominently. But that's not the way a user thinks. If they want the CD documents they'll search for "doll face", or "doll face song", more specific queries (which work fine), but if they want the toy they might just search for "doll". If I run the searches "doll" and "doll song" on google image search you'll clearly see that google has solved this problem perfectly. "doll" returns toy dolls, and "doll song" returns music and anime results. I'm striving for this type of result. -Original Message- From: Amit Jha [mailto:shanuu@gmail.com] Sent: Wednesday, January 16, 2013 11:41 PM To: solr-user@lucene.apache.org Subject: Re: Search strategy - improving search quality for short search terms such as "doll" Its all about the data data set, here I mean index. If you have documents containing "toy" and "doll" it will return that in result set. What I understood that you are talking about the context of the query. For example if you search "books on MK Gandhi" and "books by MK Gandhi" both queries have different context. Context based search at some level achieved by natural language processing. This one you can look at for better search. Look for solr wiki & mailing list would be great source of learning. Rgds AJ On 16-Jan-2013, at 15:10, "David Parks" wrote: > I'm a beginner-intermediate solr admin, I've set up the basics for our > application and it runs well. > > > > Now it's time for me to dig in and start tuning and improving queries. > > > > My next target is searches on simple terms such as "doll" which, in > google, would return documents about, well, "toy dolls", because > that's the most common usage of the simple term "doll". But in my > index it predominantly returns documents about CDs with the song "Doll > Face", and "My baby doll" in them. > > > > I'm not directly asking how to solve this as much as I'm asking what > direction I should be looking in to learn what I need to know to > tackle the general issue myself. > > > > Left on my own I would start looking at categorizing the CD's into a > facet called "music", reasonably doable in my dataset. Then I need to > reduce the boost-value of the entire facet/category of music unless > certain pre-defined query terms exist, such as [music, cd, song, > listen, dvd, , etc.]. > > > > I don't yet know how to do all of this, but after a couple more good > books I should be "dangerous". > > > > So the question to this list: > > > > - Am I on the right track here? If not, can you point me in a > direction to go? > > > > >
Field Collapsing - Anything in the works for multi-valued fields?
I want to configure Field Collapsing, but my target field is multi-valued (e.g. the field I want to group on has a variable # of entries per document, 1-N entries). I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that grouping doesn't support multi-valued fields yet. Anything in the works on that front by chance? Any common work-arounds?
RE: Field Collapsing - Anything in the works for multi-valued fields?
The documents are individual products which come from 1 or more vendors. Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document. Most fields are multi valued (short_description from each of the 2 vendors, long_description, product_name, vendor, etc. the same). I'd like to collapse on the vendor in an attempt to ensure that vast collections of books, music, and movies, by just a few vendors, don't overwhelm the results simply due to the fact that they have every search term imaginable due to the sheer volume of books, CDs, and DVDs, in relation to other product items. But in this case there is clearly 1...N vendors per document, solidly a multi-valued field. And it's hard to put a maximum number of vendors possible. Thanks, Dave -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Friday, January 18, 2013 2:32 AM To: solr-user Subject: Re: Field Collapsing - Anything in the works for multi-valued fields? David, What's the documents and the field? It can help to suggest workaround. On Thu, Jan 17, 2013 at 5:51 PM, David Parks wrote: > I want to configure Field Collapsing, but my target field is > multi-valued (e.g. the field I want to group on has a variable # of > entries per document, 1-N entries). > > I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that > grouping doesn't support multi-valued fields yet. > > Anything in the works on that front by chance? Any common work-arounds? > > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com>
RE: Field Collapsing - Anything in the works for multi-valued fields?
If I understand the reading, you've suggested that I index the vendor names as their own document (currently this is a multi-valued field of each document). Each such "vendor document" would just have a single valued 'name' field. Each normal product document would contain a multi-valued field that is a list of "vendor document IDs" and that we use to join the query results with the vendor documents. I presume this means that I would have some kind of dynamic field created from the join which I could use as the 'group.field' value? I didn't quite follow the last point. -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, January 18, 2013 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Field Collapsing - Anything in the works for multi-valued fields? Hi, Instead of the multi-valued fields, would parent-child setup for you here? See http://search-lucene.com/?q=solr+join&fc_type=wiki Otis -- Solr & ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 8:04 PM, David Parks wrote: > The documents are individual products which come from 1 or more vendors. > Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document. > Most fields are multi valued (short_description from each of the 2 > vendors, long_description, product_name, vendor, etc. the same). > > I'd like to collapse on the vendor in an attempt to ensure that vast > collections of books, music, and movies, by just a few vendors, don't > overwhelm the results simply due to the fact that they have every > search term imaginable due to the sheer volume of books, CDs, and > DVDs, in relation to other product items. > > But in this case there is clearly 1...N vendors per document, solidly > a multi-valued field. And it's hard to put a maximum number of vendors > possible. > > Thanks, > Dave > > > -Original Message- > From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] > Sent: Friday, January 18, 2013 2:32 AM > To: solr-user > Subject: Re: Field Collapsing - Anything in the works for multi-valued > fields? > > David, > > What's the documents and the field? It can help to suggest workaround. > > > On Thu, Jan 17, 2013 at 5:51 PM, David Parks > wrote: > > > I want to configure Field Collapsing, but my target field is > > multi-valued (e.g. the field I want to group on has a variable # of > > entries per document, 1-N entries). > > > > I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) > > that grouping doesn't support multi-valued fields yet. > > > > Anything in the works on that front by chance? Any common work-arounds? > > > > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > > >
After upgrade to solr4, search doesn't work
I just upgraded from solr3 to solr4, and I wiped the previous work and reloaded 500,000 documents. I see in solr that I loaded the documents, and from the console, if I do a query "*:*" I see documents returned. I copied a single word from the text of the query results I got from "*:*" but any query I do with a term returns 0 results, even though it's clear from the "*:*" query that solr has that document. Any ideas on where to start looking here? David
Re: After upgrade to solr4, search doesn't work
Good though, thanks for the quick reply too. Seems that this is still set to my unique ID field: explicit 10 id I wonder if I have somehow lost the configuration that specifies that the other fields should be searched as well, though my schema hasn't changed and they're certainly indexed: From: Jack Krupansky To: solr-user@lucene.apache.org Sent: Wednesday, March 6, 2013 1:34 PM Subject: Re: After upgrade to solr4, search doesn't work You may simply need to set the default value of the "df" parameter in the /select request handler in solrconfig.xml to be your default query field name if it is not "text". -- Jack Krupansky -----Original Message- From: David Parks Sent: Wednesday, March 06, 2013 1:26 AM To: solr-user@lucene.apache.org Subject: After upgrade to solr4, search doesn't work I just upgraded from solr3 to solr4, and I wiped the previous work and reloaded 500,000 documents. I see in solr that I loaded the documents, and from the console, if I do a query "*:*" I see documents returned. I copied a single word from the text of the query results I got from "*:*" but any query I do with a term returns 0 results, even though it's clear from the "*:*" query that solr has that document. Any ideas on where to start looking here? David
Re: After upgrade to solr4, search doesn't work
All but the unique ID field use the out-of-the-box default text_en_splitting field type, this copied over from v3 to v4 without change as far as I know. I've done the import from scratch (deleted the solr data directory and re-imported and committed). From: mani arasu To: solr-user@lucene.apache.org Sent: Wednesday, March 6, 2013 1:37 PM Subject: Re: After upgrade to solr4, search doesn't work You should probably be looking at which Analyzer you used in solr version 3.x and which one you are using in solr version 4.x. If there is any change in that you may have to do either of the following: - Do a full-import so that documents are created according to your new schema - Do a search on the previously created documents, considering the way your documents are Analysed and Indexed as per solr version 3.x On Wed, Mar 6, 2013 at 11:56 AM, David Parks wrote: > I just upgraded from solr3 to solr4, and I wiped the previous work and > reloaded 500,000 documents. > > I see in solr that I loaded the documents, and from the console, if I do a > query "*:*" I see documents returned. > > I copied a single word from the text of the query results I got from "*:*" > but any query I do with a term returns 0 results, even though it's clear > from the "*:*" query that solr has that document. > > Any ideas on where to start looking here? > > David > > >
Re: After upgrade to solr4, search doesn't work
Oops, I didn't include the full XML there, hopefully this formats ok. From: David Parks To: "solr-user@lucene.apache.org" Sent: Wednesday, March 6, 2013 1:58 PM Subject: Re: After upgrade to solr4, search doesn't work All but the unique ID field use the out-of-the-box default text_en_splitting field type, this copied over from v3 to v4 without change as far as I know. I've done the import from scratch (deleted the solr data directory and re-imported and committed). From: mani arasu To: solr-user@lucene.apache.org Sent: Wednesday, March 6, 2013 1:37 PM Subject: Re: After upgrade to solr4, search doesn't work You should probably be looking at which Analyzer you used in solr version 3.x and which one you are using in solr version 4.x. If there is any change in that you may have to do either of the following: - Do a full-import so that documents are created according to your new schema - Do a search on the previously created documents, considering the way your documents are Analysed and Indexed as per solr version 3.x On Wed, Mar 6, 2013 at 11:56 AM, David Parks wrote: > I just upgraded from solr3 to solr4, and I wiped the previous work and > reloaded 500,000 documents. > > I see in solr that I loaded the documents, and from the console, if I do a > query "*:*" I see documents returned. > > I copied a single word from the text of the query results I got from "*:*" > but any query I do with a term returns 0 results, even though it's clear > from the "*:*" query that solr has that document. > > Any ideas on where to start looking here? > > David > > >
Re: After upgrade to solr4, search doesn't work
Ah, I think I see the issue, in the debug results it's only searching the id field, which is the unique ID, that must have gotten changed in the upgrade. In fact I think I might have had a misconfiguration in the 3.x version here. Can I set it to query multiple fields by default? I tried a comma separated list of my fields here but that was invalid. dvddvdid:dvdid:dvd From: David Parks To: "solr-user@lucene.apache.org" Sent: Wednesday, March 6, 2013 1:52 PM Subject: Re: After upgrade to solr4, search doesn't work Good though, thanks for the quick reply too. Seems that this is still set to my unique ID field: explicit 10 id I wonder if I have somehow lost the configuration that specifies that the other fields should be searched as well, though my schema hasn't changed and they're certainly indexed: From: Jack Krupansky To: solr-user@lucene.apache.org Sent: Wednesday, March 6, 2013 1:34 PM Subject: Re: After upgrade to solr4, search doesn't work You may simply need to set the default value of the "df" parameter in the /select request handler in solrconfig.xml to be your default query field name if it is not "text". -- Jack Krupansky -Original Message- From: David Parks Sent: Wednesday, March 06, 2013 1:26 AM To: solr-user@lucene.apache.org Subject: After upgrade to solr4, search doesn't work I just upgraded from solr3 to solr4, and I wiped the previous work and reloaded 500,000 documents. I see in solr that I loaded the documents, and from the console, if I do a query "*:*" I see documents returned. I copied a single word from the text of the query results I got from "*:*" but any query I do with a term returns 0 results, even though it's clear from the "*:*" query that solr has that document. Any ideas on where to start looking here? David
RE: After upgrade to solr4, search doesn't work
I had actually totally blown my previous configuration and didn't know it (luckily it didn't reach production this way). I'm glad I ran into this problem. I had defaulted the queries to one of the most useful fields and never realized I wasn't searching the others. Thanks very much for all your help on this, it certainly helped me get my configuration straight and the upgrade to 4 is now complete. All the best, David -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, March 06, 2013 7:56 PM To: solr-user@lucene.apache.org; David Parks Subject: Re: After upgrade to solr4, search doesn't work I imagine that you had a "qf" parameter in your old query request handler, so add "qf" it to the new query request handler. "df" is used only if "qf" is missing. -- Jack Krupansky -Original Message- From: David Parks Sent: Wednesday, March 06, 2013 2:18 AM To: solr-user@lucene.apache.org ; David Parks Subject: Re: After upgrade to solr4, search doesn't work Ah, I think I see the issue, in the debug results it's only searching the id field, which is the unique ID, that must have gotten changed in the upgrade. In fact I think I might have had a misconfiguration in the 3.x version here. Can I set it to query multiple fields by default? I tried a comma separated list of my fields here but that was invalid. dvddvdid:dvdid:dvd From: David Parks To: "solr-user@lucene.apache.org" Sent: Wednesday, March 6, 2013 1:52 PM Subject: Re: After upgrade to solr4, search doesn't work Good though, thanks for the quick reply too. Seems that this is still set to my unique ID field: explicit 10 id I wonder if I have somehow lost the configuration that specifies that the other fields should be searched as well, though my schema hasn't changed and they're certainly indexed: From: Jack Krupansky To: solr-user@lucene.apache.org Sent: Wednesday, March 6, 2013 1:34 PM Subject: Re: After upgrade to solr4, search doesn't work You may simply need to set the default value of the "df" parameter in the /select request handler in solrconfig.xml to be your default query field name if it is not "text". -- Jack Krupansky -Original Message- From: David Parks Sent: Wednesday, March 06, 2013 1:26 AM To: solr-user@lucene.apache.org Subject: After upgrade to solr4, search doesn't work I just upgraded from solr3 to solr4, and I wiped the previous work and reloaded 500,000 documents. I see in solr that I loaded the documents, and from the console, if I do a query "*:*" I see documents returned. I copied a single word from the text of the query results I got from "*:*" but any query I do with a term returns 0 results, even though it's clear from the "*:*" query that solr has that document. Any ideas on where to start looking here? David
Is Solr more CPU bound or IO bound?
I'm spec'ing out some hardware for a first go at our production Solr instance, but I haven't spent enough time loadtesting it yet. What I want to ask if how IO intensive solr is vs. CPU intensive, typically. Specifically I'm considering whether to dual-purpose the Solr servers to run Solr and another CPU-only application we have. I know Solr uses a fair amount of CPU, but if it also is very disk intensive it might be a net benefit to have more instances running Solr and share the CPU resources with the other app than to run Solr separate from the other CPU app that wouldn't otherwise use the disk. Thoughts on this? Thanks, David
RE: Is Solr more CPU bound or IO bound?
Thank you, Manu, for that excellent discussion on the topic, I could have been more detailed about my use case. We'll be indexing off-of the main production servers (either on a master, or in Hadoop, we're yet to build out that piece of the puzzle). We don't store documents at all, we only store the index data and return a document ID, each document is maybe 1k of text, small. We do have a few "interesting" queries in which we do some grouping. We currently index 100GB of input data, that'll grow 2x or 3x in the near future. So based on your experience, it seems likely that we'll be CPU bound (heavy queries against a static index updated nightly from the master), thus nullifying the advantage of dual-purposing a box with another CPU bound app. Very useful discussion, I'll get proper load tests done in time but this helps direct my thinking now. David -Original Message- From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel Le Normand Sent: Monday, March 18, 2013 9:57 AM To: solr-user@lucene.apache.org Subject: Re: Is Solr more CPU bound or IO bound? Your question is a typical use-case dependent, the bottleneck will change from user to user. These are two main issues that will affect the answer: 1. How do you index: what is your indexing rate (how many docs a days)? how big is a typical document? how many documents do you plan on indexing in tota? do you store fields? calculate their term vectors? 2. How looks you retrieval process: What's the query rate expected? Are there common queries (taking advantage of the cache)? Complexity of queries (faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan to retrieve stored fields or only id's? After answering all that there's an interative game between hardware configuration and software configuration (how do you split your shards, use your cache, tuning your merges and flushes etc) that would also affect the IO / CPU bounded answer. In my use-case for example the indexing part is IO bounded, but as my indexing rate is much below the rate my machine could initially provide it didn't affect my hardware spec. After fine tuning my configuration i discovered my retrieval process was CPU bounded and was directly affecting my avg response time, while the IO rate in cache usage was quite low. Try describing your use case in more details with the above questions so we'd be able to give you guidelines. Best, Manu On Mon, Mar 18, 2013 at 3:55 AM, David Parks wrote: > I'm spec'ing out some hardware for a first go at our production Solr > instance, but I haven't spent enough time loadtesting it yet. > > > > What I want to ask if how IO intensive solr is vs. CPU intensive, > typically. > > > > Specifically I'm considering whether to dual-purpose the Solr servers > to run Solr and another CPU-only application we have. I know Solr uses > a fair amount of CPU, but if it also is very disk intensive it might > be a net benefit to have more instances running Solr and share the CPU > resources with the other app than to run Solr separate from the other > CPU app that wouldn't otherwise use the disk. > > > > Thoughts on this? > > > > Thanks, > > David > > > >