The primary use case seems to require a SortStream. Ignoring the large join for now... 1. search main collection with stream (a) 2. search other collection with stream (b) 3. hash join a & b (c) 4. full sort on c 5. aggregate c with reducer 6. apply user sort criteria with top
It's very likely that a field from 'b' will be in the user's search criteria and the results list. Therefore the merge (3) must occur before the aggregation (5), and the aggregation requires the sort (4). If this can't be done without lots of hardware and memory, perhaps I'd be better off leaving the data denormalized and increasing index speed by sharding (with many more VMs). Will sharding increase re-indexing speed by a factor close to the # of shards (collection with 10 shards indexes ~10x faster than same collection with one shard)? Joel Bernstein wrote > The tricky thing you have is a large join coupled with a reduce() group, > which have different sorts. > > ... > >> Joel Bernstein wrote >> > A few other things for you to consider: >> > >> > 1) How big are the joins? >> > 2) How fast do they need to go? >> > 3) How many queries need to run concurrently? >> > >> > #1 and 2# will dictate how many shards, replicas and parallel workers >> are >> > needed to perform the join. #3 needs to be carefully considered because >> > MapReduce distributed joins are not going to scale like traditional >> Solr >> > queries. >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> -- View this message in context: http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288288.html Sent from the Solr - User mailing list archive at Nabble.com.