The tricky thing you have is a large join coupled with a reduce() group, which have different sorts.
If you have a big enough cluster with enough workers, shards and replicas you can make this work. For example if you partition the large join across 30 workers, the hash join would fit in the available RAM across the workers. Same goes for the in-memory sort, if you partition the sort across 30 workers then the sort would work for you because you'd have enough RAM in total to do the sort. The key to the performance is how fast you can export the data. For example if you have a join with 20 million documents from one collection and you want to do it in sub-second, you have to be able to export faster then 20 million documents per second. You increase your export performance by adding shards, replicas and workers. There is a document in the works that describes how to scale the MapReduce architecture. I'll attempt to get this finished for the 6.2 release: https://cwiki.apache.org/confluence/display/solr/MapReduce%2C+Shuffling+and+Worker+Collections The 12 concurrent queries also need to be explored because if you're doing 12 simultaneous large distributed joins, that will create significant network traffic. The MapReduce functions of the Streaming API were originally designed for interactive data exploration on large data sets. So it serves the same use case as Hive, Pig or Impala. The non-MapReduce functions in the streaming expression library are designed for higher QPS OLTP use cases. The facet(), stats() and gatherNodes() functions are examples of functions that can operate at a higher QPS. But, if you have a large enough cluster and the distributed joins are not too large, then you could probably run the distributed joins with a moderate level of QPS. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Jul 21, 2016 at 11:08 AM, tedsolr <tsm...@sciquest.com> wrote: > I can see I may need to rethink some things. I have two joins: one is 1 to > 1 > (very large) and one is 1 to .03. A HashJoin may work on the smaller one. > The large join looks like it may not be possible. I could get away with > treating it as a filter somehow - I don't need the fields from the > documents. Such as ... include col1 document (id=123) if col2 contains > document with id=123. > > This whole chain is a real-time user search. A 1-2 sec response would be > ideal, but I'm sacrificing speed in order to get the reindexing to run much > faster. > > Concurrency is low - like a dozen. Have you read any blogs on balancing # > shards vs # replicas? Any guidelines on estimating the number of VMs this > may require would be great. > > > Joel Bernstein wrote > > A few other things for you to consider: > > > > 1) How big are the joins? > > 2) How fast do they need to go? > > 3) How many queries need to run concurrently? > > > > #1 and 2# will dictate how many shards, replicas and parallel workers are > > needed to perform the join. #3 needs to be carefully considered because > > MapReduce distributed joins are not going to scale like traditional Solr > > queries. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html > Sent from the Solr - User mailing list archive at Nabble.com. >