On 3/24/2016 11:57 AM, tedsolr wrote: > My post was scant on details. The numbers I gave for collection sizes are > projections for the future. I am in the midst of an upgrade that will be > completed within a few weeks. My concern is that I may not be able to > produce the throughput necessary to index an entire collection quickly > enough (3 to 4 hours) for a large customer (100M docs).
I can fully rebuild one of my indexes, with 146 million docs, in 8-10 hours. This is fairly inefficient indexing -- six large shards (not cloud), each one running the dataimport handler, importing from MySQL. I suspect I could probably get two or three times this rate (and maybe more) on the same hardware if I wrote a SolrJ application that uses multiple threads for each Solr shard. I know from experiments that the MySQL server can push over 100 million rows to a SolrJ program in less than an hour, including constructing SolrInputDocument objects. That experiment just left out the "client.add(docs);" line. The bottleneck is definitely Solr. Each machine holds three large shards(half the index),is running Solr 4.x (5.x upgrade is in the works), and has 64GB RAM with an 8GB heap. Each shard is approximately 24.4 million docs and 28GB. These machines also hold another sharded index in the same Solr install, but it's quite a lot smaller. Thanks, Shawn