Shawn: bq: The bottleneck is definitely Solr.
Since you commented out the server.add(doclist), you're right to focus there. I've seen a few things that help. 1> batch the documents, i.e. in the doclist above the list should be on the order of 1,000 docs. Here are some numbers I worked up one time: https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/ 2> If your Solr CPUs aren't running flat out, then adding threads until they are being pretty well hammered is A Good Thing. Of course you have to balance that off against anything else your servers are doing like serving queries.... 3> Make sure you're using CloudSolrClient. 4> If you still need more throughput, use more shards..... Best, Erick On Thu, Mar 31, 2016 at 6:39 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/24/2016 11:57 AM, tedsolr wrote: >> My post was scant on details. The numbers I gave for collection sizes are >> projections for the future. I am in the midst of an upgrade that will be >> completed within a few weeks. My concern is that I may not be able to >> produce the throughput necessary to index an entire collection quickly >> enough (3 to 4 hours) for a large customer (100M docs). > > I can fully rebuild one of my indexes, with 146 million docs, in 8-10 > hours. This is fairly inefficient indexing -- six large shards (not > cloud), each one running the dataimport handler, importing from MySQL. > I suspect I could probably get two or three times this rate (and maybe > more) on the same hardware if I wrote a SolrJ application that uses > multiple threads for each Solr shard. > > I know from experiments that the MySQL server can push over 100 million > rows to a SolrJ program in less than an hour, including constructing > SolrInputDocument objects. That experiment just left out the > "client.add(docs);" line. The bottleneck is definitely Solr. > > Each machine holds three large shards(half the index),is running Solr > 4.x (5.x upgrade is in the works), and has 64GB RAM with an 8GB heap. > Each shard is approximately 24.4 million docs and 28GB. These machines > also hold another sharded index in the same Solr install, but it's quite > a lot smaller. > > Thanks, > Shawn >