You can run in parallel and that should help quite a bit. But at a really large batch job is better done like this:
https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <gus.h...@gmail.com> wrote: > Hi Folks, > > I'm looking for ideas on how to speed up processing for a streaming > expression. I can't post the full details because it's customer related, > but the structure is shown here: https://imgur.com/a/98sENVT What that > does > is take the results of two queries, join them and push them back into the > collection as a new (denormalized) doc. The second (hash) join just updates > a field that distinguishes the new docs from either of the old docs so it's > hashing exactly one value, and thus this is not of concern for performance > (if there were a good way to tell select to modify only one field and keep > all the rest without listing the fields explicitly it wouldn't be needed) . > > > When I run it across a test index with 1377364 and 5146620 docs for the two > queries. The result is that it inserts 4742322 new documents, in ~10 > minutes. This seems pretty spiffy except this test index is ~1/1000 of the > real index... so obviously I want to find *at least* a factor of 10 > improvement. So far I managed a factor of about 3 to get it down to > slightly over 200 seconds by programmatically building the queries > partitioning based on a set of percentiles from a stats query on one of the > fields that is a floating point number with good distribution, but this > seems to stop helping 10-12 splits on my 50 node cluster, scaling up to > split to all 50 nodes brings things back to ~400 seconds. > > The CPU utilization on the machines mostly stabilizes around 30-50%, Disk > metrics don't seem to look bad (disk idle stat in AWS stays over 90%). > Still trying to get a good handle on network numbers, but I'm guessing that > I'm either network limited or there's an inefficiency with contention > somewhere inside solr (no I haven't put a profiler on it yet). > > Here's the interesting bit. I happen to know that the join key in the > leftJoin is on a key that is used for document routing, so we're only > joining up with documents on the same node. Furthermore, the id generated > is a concatenation of these id's with a value from one of the fields and > should also route to the same node... Is there any way to make the whole > expression run locally on the nodes to avoid throwing the data back and > forth across the network needlessly? > > Any other ideas for making this go another factor of 2-3 faster? > > -Gus >