For Indexing 3.5 billion documents, you will not only run into bottleneck with Solr but also at different places (data acquisition, solr document object creation, submitting in bulk/batches to Solr).
This will require parallelizing the above operations at each of the above steps which can get you maximum throughput. Multi-threaded java solrj based Indexer & CloudSolrClient is required as described by Shawn. I have used ConcurrentSolrUpdate in the past but with CloudSolrClient, setParallelUpdates should be tried out. Thanks, Susheel On Wed, Aug 19, 2015 at 2:41 PM, Erick Erickson <[email protected]> wrote: > Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm > not > sure that'll hit your rate, it spends some time copying things around. > If you're not on > HDFS, though, it's not an option. > > Best, > Erick > > On Wed, Aug 19, 2015 at 11:36 AM, Upayavira <[email protected]> wrote: > > > > > > On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote: > >> Troy Edwards <[email protected]> wrote: > >> > My average document size is 400 bytes > >> > Number of documents that need to be inserted 250000/second > >> > (for a total of about 3.6 Billion documents) > >> > >> > Any ideas/suggestions on how that can be done? (use a client > >> > or uploadcsv or stream or data import handler) > >> > >> Use more than one cloud. Make them fully independent. As I suggested > when > >> you asked 4 days ago. That would also make it easy to scale: Just > measure > >> how much a single setup can take and do the math. > > > > Yes - work out how much each node can handle, then you can work out how > > many nodes you need. > > > > You could consider using implicit routing rather than compositeId, which > > means that you take on responsibility for hashing your ID to push > > content to the right node. (Or, if you use compositeId, you could use > > the same algorithm, and be sure that you send docs directly to the > > correct shard. > > > > At the moment, if you push five documents to a five shard collection, > > the node you send them to could end up doing four HTTP requests to the > > other nodes in the collection. This means you don't need to worry about > > where to post your content - it is just handled for you. However, there > > is a performance hit there. Push content direct to the correct node > > (either using implicit routing, or by replicating the compositeId hash > > calculation in your client) and you'd increase your indexing throughput > > significantly, I would theorise. > > > > Upayavira >
