Use large batches and fetch instead of hashjoin and lots of parallel workers.
Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Feb 15, 2019 at 7:48 PM Joel Bernstein <joels...@gmail.com> wrote: > You can run in parallel and that should help quite a bit. But at a really > large batch job is better done like this: > > > https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <gus.h...@gmail.com> wrote: > >> Hi Folks, >> >> I'm looking for ideas on how to speed up processing for a streaming >> expression. I can't post the full details because it's customer related, >> but the structure is shown here: https://imgur.com/a/98sENVT What that >> does >> is take the results of two queries, join them and push them back into the >> collection as a new (denormalized) doc. The second (hash) join just >> updates >> a field that distinguishes the new docs from either of the old docs so >> it's >> hashing exactly one value, and thus this is not of concern for performance >> (if there were a good way to tell select to modify only one field and keep >> all the rest without listing the fields explicitly it wouldn't be needed) >> . >> >> >> When I run it across a test index with 1377364 and 5146620 docs for the >> two >> queries. The result is that it inserts 4742322 new documents, in ~10 >> minutes. This seems pretty spiffy except this test index is ~1/1000 of the >> real index... so obviously I want to find *at least* a factor of 10 >> improvement. So far I managed a factor of about 3 to get it down to >> slightly over 200 seconds by programmatically building the queries >> partitioning based on a set of percentiles from a stats query on one of >> the >> fields that is a floating point number with good distribution, but this >> seems to stop helping 10-12 splits on my 50 node cluster, scaling up to >> split to all 50 nodes brings things back to ~400 seconds. >> >> The CPU utilization on the machines mostly stabilizes around 30-50%, Disk >> metrics don't seem to look bad (disk idle stat in AWS stays over 90%). >> Still trying to get a good handle on network numbers, but I'm guessing >> that >> I'm either network limited or there's an inefficiency with contention >> somewhere inside solr (no I haven't put a profiler on it yet). >> >> Here's the interesting bit. I happen to know that the join key in the >> leftJoin is on a key that is used for document routing, so we're only >> joining up with documents on the same node. Furthermore, the id generated >> is a concatenation of these id's with a value from one of the fields and >> should also route to the same node... Is there any way to make the whole >> expression run locally on the nodes to avoid throwing the data back and >> forth across the network needlessly? >> >> Any other ideas for making this go another factor of 2-3 faster? >> >> -Gus >> >