Hi Folks, I'm looking for ideas on how to speed up processing for a streaming expression. I can't post the full details because it's customer related, but the structure is shown here: https://imgur.com/a/98sENVT What that does is take the results of two queries, join them and push them back into the collection as a new (denormalized) doc. The second (hash) join just updates a field that distinguishes the new docs from either of the old docs so it's hashing exactly one value, and thus this is not of concern for performance (if there were a good way to tell select to modify only one field and keep all the rest without listing the fields explicitly it wouldn't be needed) .
When I run it across a test index with 1377364 and 5146620 docs for the two queries. The result is that it inserts 4742322 new documents, in ~10 minutes. This seems pretty spiffy except this test index is ~1/1000 of the real index... so obviously I want to find *at least* a factor of 10 improvement. So far I managed a factor of about 3 to get it down to slightly over 200 seconds by programmatically building the queries partitioning based on a set of percentiles from a stats query on one of the fields that is a floating point number with good distribution, but this seems to stop helping 10-12 splits on my 50 node cluster, scaling up to split to all 50 nodes brings things back to ~400 seconds. The CPU utilization on the machines mostly stabilizes around 30-50%, Disk metrics don't seem to look bad (disk idle stat in AWS stays over 90%). Still trying to get a good handle on network numbers, but I'm guessing that I'm either network limited or there's an inefficiency with contention somewhere inside solr (no I haven't put a profiler on it yet). Here's the interesting bit. I happen to know that the join key in the leftJoin is on a key that is used for document routing, so we're only joining up with documents on the same node. Furthermore, the id generated is a concatenation of these id's with a value from one of the fields and should also route to the same node... Is there any way to make the whole expression run locally on the nodes to avoid throwing the data back and forth across the network needlessly? Any other ideas for making this go another factor of 2-3 faster? -Gus