Hi Folks,

I'm looking for ideas on how to speed up processing for a streaming
expression. I can't post the full details because it's customer related,
but the structure is shown here: https://imgur.com/a/98sENVT What that does
is take the results of two queries, join them and push them back into the
collection as a new (denormalized) doc. The second (hash) join just updates
a field that distinguishes the new docs from either of the old docs so it's
hashing exactly one value, and thus this is not of concern for performance
(if there were a good way to tell select to modify only one field and keep
all the rest without listing the fields explicitly it wouldn't be needed) .


When I run it across a test index with 1377364 and 5146620 docs for the two
queries. The result is that it inserts 4742322 new documents, in ~10
minutes. This seems pretty spiffy except this test index is ~1/1000 of the
real index... so obviously I want to find *at least* a factor of 10
improvement. So far I managed a factor of about 3 to get it down to
slightly over 200 seconds by programmatically building the queries
partitioning based on a set of percentiles from a stats query on one of the
fields that is a floating point number with good distribution, but this
seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
split to all 50 nodes brings things back to ~400 seconds.

The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
Still trying to get a good handle on network numbers, but I'm guessing that
I'm either network limited or there's an inefficiency with contention
somewhere inside solr (no I haven't put a profiler on it yet).

Here's the interesting bit. I happen to know that the join key in the
leftJoin is on a key that is used for document routing, so we're only
joining up with documents on the same node. Furthermore, the id generated
is a concatenation of these id's with a value from one of the fields and
should also route to the same node... Is there any way to make the whole
expression run locally on the nodes to avoid throwing the data back and
forth across the network needlessly?

Any other ideas for making this go another factor of 2-3 faster?

-Gus

Reply via email to