Use large batches and fetch instead of hashjoin and lots of parallel
workers.

Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Feb 15, 2019 at 7:48 PM Joel Bernstein <joels...@gmail.com> wrote:

> You can run in parallel and that should help quite a bit. But at a really
> large batch job is better done like this:
>
>
> https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <gus.h...@gmail.com> wrote:
>
>> Hi Folks,
>>
>> I'm looking for ideas on how to speed up processing for a streaming
>> expression. I can't post the full details because it's customer related,
>> but the structure is shown here: https://imgur.com/a/98sENVT What that
>> does
>> is take the results of two queries, join them and push them back into the
>> collection as a new (denormalized) doc. The second (hash) join just
>> updates
>> a field that distinguishes the new docs from either of the old docs so
>> it's
>> hashing exactly one value, and thus this is not of concern for performance
>> (if there were a good way to tell select to modify only one field and keep
>> all the rest without listing the fields explicitly it wouldn't be needed)
>> .
>>
>>
>> When I run it across a test index with 1377364 and 5146620 docs for the
>> two
>> queries. The result is that it inserts 4742322 new documents, in ~10
>> minutes. This seems pretty spiffy except this test index is ~1/1000 of the
>> real index... so obviously I want to find *at least* a factor of 10
>> improvement. So far I managed a factor of about 3 to get it down to
>> slightly over 200 seconds by programmatically building the queries
>> partitioning based on a set of percentiles from a stats query on one of
>> the
>> fields that is a floating point number with good distribution, but this
>> seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
>> split to all 50 nodes brings things back to ~400 seconds.
>>
>> The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
>> metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
>> Still trying to get a good handle on network numbers, but I'm guessing
>> that
>> I'm either network limited or there's an inefficiency with contention
>> somewhere inside solr (no I haven't put a profiler on it yet).
>>
>> Here's the interesting bit. I happen to know that the join key in the
>> leftJoin is on a key that is used for document routing, so we're only
>> joining up with documents on the same node. Furthermore, the id generated
>> is a concatenation of these id's with a value from one of the fields and
>> should also route to the same node... Is there any way to make the whole
>> expression run locally on the nodes to avoid throwing the data back and
>> forth across the network needlessly?
>>
>> Any other ideas for making this go another factor of 2-3 faster?
>>
>> -Gus
>>
>

Reply via email to