The primary use case seems to require a SortStream. Ignoring the large join
for now...
1. search main collection with stream (a)
2. search other collection with stream (b)
3. hash join a & b (c)
4. full sort on c
5. aggregate c with reducer
6. apply user sort criteria with top

It's very likely that a field from 'b' will be in the user's search criteria
and the results list. Therefore the merge (3) must occur before the
aggregation (5), and the aggregation requires the sort (4). If this can't be
done without lots of hardware and memory, perhaps I'd be better off leaving
the data denormalized and increasing index speed by sharding (with many more
VMs). Will sharding increase re-indexing speed by a factor close to the # of
shards (collection with 10 shards indexes ~10x faster than same collection
with one shard)? 


Joel Bernstein wrote
> The tricky thing you have is a large join coupled with a reduce() group,
> which have different sorts.
> 
> ...
> 
>> Joel Bernstein wrote
>> > A few other things for you to consider:
>> >
>> > 1) How big are the joins?
>> > 2) How fast do they need to go?
>> > 3) How many queries need to run concurrently?
>> >
>> > #1 and 2# will dictate how many shards, replicas and parallel workers
>> are
>> > needed to perform the join. #3 needs to be carefully considered because
>> > MapReduce distributed joins are not going to scale like traditional
>> Solr
>> > queries.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to