Re: Performance potential for updating (reindexing) documents

Erick Erickson Fri, 01 Apr 2016 19:57:06 -0700

Shawn:

bq: The bottleneck is definitely Solr.

Since you commented out the server.add(doclist), you're right to focus
there. I've seen
a few things that help.

1> batch the documents, i.e. in the doclist above the list should be
on the order of 1,000 docs. Here
are some numbers I worked up one time:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

2> If your Solr CPUs aren't running flat out, then adding threads
until they are being pretty well hammered
is A Good Thing. Of course you have to balance that off against
anything else your servers are doing like
serving queries....

3> Make sure you're using CloudSolrClient.

4> If you still need more throughput, use more shards.....

Best,
Erick

On Thu, Mar 31, 2016 at 6:39 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 3/24/2016 11:57 AM, tedsolr wrote:
>> My post was scant on details. The numbers I gave for collection sizes are
>> projections for the future. I am in the midst of an upgrade that will be
>> completed within a few weeks. My concern is that I may not be able to
>> produce the throughput necessary to index an entire collection quickly
>> enough (3 to 4 hours) for a large customer (100M docs).
>
> I can fully rebuild one of my indexes, with 146 million docs, in 8-10
> hours.  This is fairly inefficient indexing -- six large shards (not
> cloud), each one running the dataimport handler, importing from MySQL.
> I suspect I could probably get two or three times this rate (and maybe
> more) on the same hardware if I wrote a SolrJ application that uses
> multiple threads for each Solr shard.
>
> I know from experiments that the MySQL server can push over 100 million
> rows to a SolrJ program in less than an hour, including constructing
> SolrInputDocument objects.  That experiment just left out the
> "client.add(docs);" line.  The bottleneck is definitely Solr.
>
> Each machine holds three large shards(half the index),is running Solr
> 4.x (5.x upgrade is in the works), and has 64GB RAM with an 8GB heap.
> Each shard is approximately 24.4 million docs and 28GB.  These machines
> also hold another sharded index in the same Solr install, but it's quite
> a lot smaller.
>
> Thanks,
> Shawn
>

Re: Performance potential for updating (reindexing) documents

Reply via email to