Re: Under-utilization during streaming expression execution

Joel Bernstein Fri, 15 Feb 2019 16:49:18 -0800

You can run in parallel and that should help quite a bit. But at a really
large batch job is better done like this:


https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <gus.h...@gmail.com> wrote:

> Hi Folks,
>
> I'm looking for ideas on how to speed up processing for a streaming
> expression. I can't post the full details because it's customer related,
> but the structure is shown here: https://imgur.com/a/98sENVT What that
> does
> is take the results of two queries, join them and push them back into the
> collection as a new (denormalized) doc. The second (hash) join just updates
> a field that distinguishes the new docs from either of the old docs so it's
> hashing exactly one value, and thus this is not of concern for performance
> (if there were a good way to tell select to modify only one field and keep
> all the rest without listing the fields explicitly it wouldn't be needed) .
>
>
> When I run it across a test index with 1377364 and 5146620 docs for the two
> queries. The result is that it inserts 4742322 new documents, in ~10
> minutes. This seems pretty spiffy except this test index is ~1/1000 of the
> real index... so obviously I want to find *at least* a factor of 10
> improvement. So far I managed a factor of about 3 to get it down to
> slightly over 200 seconds by programmatically building the queries
> partitioning based on a set of percentiles from a stats query on one of the
> fields that is a floating point number with good distribution, but this
> seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
> split to all 50 nodes brings things back to ~400 seconds.
>
> The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
> metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
> Still trying to get a good handle on network numbers, but I'm guessing that
> I'm either network limited or there's an inefficiency with contention
> somewhere inside solr (no I haven't put a profiler on it yet).
>
> Here's the interesting bit. I happen to know that the join key in the
> leftJoin is on a key that is used for document routing, so we're only
> joining up with documents on the same node. Furthermore, the id generated
> is a concatenation of these id's with a value from one of the fields and
> should also route to the same node... Is there any way to make the whole
> expression run locally on the nodes to avoid throwing the data back and
> forth across the network needlessly?
>
> Any other ideas for making this go another factor of 2-3 faster?
>
> -Gus
>

Re: Under-utilization during streaming expression execution

Reply via email to