Re: Solr cloud performance degradation with billions of documents

2014-08-17 Thread Erick Erickson
ctions, is it a bug? > > Thanks. > > Shushuai > > > From: Erick Erickson > To: solr-user@lucene.apache.org > Sent: Friday, August 15, 2014 7:30 PM > Subject: Re: Solr cloud performance degradation with billions of documents > > > Toke: > > bq: I would h

Re: Solr cloud performance degradation with billions of documents

2014-08-16 Thread shushuai zhu
a bug? Thanks. Shushuai From: Erick Erickson To: solr-user@lucene.apache.org Sent: Friday, August 15, 2014 7:30 PM Subject: Re: Solr cloud performance degradation with billions of documents Toke: bq: I would have agreed with you fully an hour ago. Well, I now disagree with myself too

Re: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Erick Erickson
Toke: bq: I would have agreed with you fully an hour ago. Well, I now disagree with myself too :) I don't mind talking to myself. I don't even mind arguing with myself. I really _do_ mind losing the arguments I have with myself though. Scott: OK, that has a much better chance of working

RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Toke Eskildsen
Erick Erickson [erickerick...@gmail.com] wrote: > I guess that my main issue is that from everything I've seen so far, > this project is doomed. You simply cannot put 7B documents in a single > shard, period. Lucene has a 2B hard limit. I would have agreed with you fully an hour ago and actually p

RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote: > You make some very good valid points. Let me clear a few things up, though. > We are not trying to put 7B docs into one single shard, because we are using > collections, created daily, which spread the index across the 32 shards th

RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Wilburn, Scott
ideal, to ensure the project succeeds and comes in under budget. Thanks, Scott -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, August 15, 2014 7:52 AM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of

Re: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Erick Erickson
Toke: You make valid points. You're completely right that my reflexes are for sub-second responses so I tend to think of lots and lots of memory being a requirement. I agree that depending on the problem space the percentage of the index that has to be in memory varies widely, I've seen a large va

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Toke Eskildsen
Erick Erickson [erickerick...@gmail.com] wrote: > Solr requires holding large parts of the index in memory. > For the entire corpus. At once. That requirement is under the assumption that one must have the lowest possible latency at each individual box. You might as well argue for the fastest po

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Wilburn, Scott
few things to try, thanks to all of your comments. I am very appreciative. Thanks, Scott -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, August 14, 2014 8:31 AM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with

Re: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Erick Erickson
You are absolutely on the bleeding edge. I know of a couple of projects that are at that scale, but 1> they aren't being done on just a few nodes. As Jack says, this scale for SolrCloud is not common and there are no OOB templates to follow. 2> AFAIK, the projects I'm talking about aren't in

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote: > Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking > into that now. > I agree what I am trying to do is a tall order, and the more I hear from all > of your > comments, the more I am convinced that lack of

Re: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Jack Krupansky
Wilburn, Scott Sent: Thursday, August 14, 2014 11:05 AM To: solr-user@lucene.apache.org Subject: RE: Solr cloud performance degradation with billions of documents Erick, Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking into that now. I agree what I am trying to do is

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Wilburn, Scott
nks, Scott -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, August 13, 2014 4:48 PM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of documents Several points: 1> Have you considered using the MapReduceIn

Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Erick Erickson
Several points: 1> Have you considered using the MapReduceIndexerTool for your ingestion? Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread your indexing across as many nodes as you have in your cluster. That said, it's not entirely clear that you'll gain throughput since

Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
ent: Wednesday, August 13, 2014 5:42 PM To: solr-user@lucene.apache.org Subject: RE: Solr cloud performance degradation with billions of documents Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each consisting of 32 shards. The clusters do not have any interaction with eac

RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Markus Jelsma
Hi - You are running mapred jobs on the same nodes as Solr runs right? The first thing i would think of is that your OS file buffer cache is abused. The mappers read all data, presumably residing on the same node. The mapper output and shuffling part would take place on the same node, only the r

RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote: > Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the > Solr shards and > each node has 128GB of memory. The current SolrCloud setup is split into 4 > > separate and > individual clouds of 32 shards each the

RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Wilburn, Scott
:17 PM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of documents Could you clarify what you mean with the term "cloud", as in "per cloud" and "individual clouds"? That's not a proper Solr or SolrCloud concept per

Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
Could you clarify what you mean with the term "cloud", as in "per cloud" and "individual clouds"? That's not a proper Solr or SolrCloud concept per se. SolrCloud works with a single "cluster" of nodes. And there is no interaction between separate SolrCloud clusters. -- Jack Krupansky -Ori