A few other things for you to consider: 1) How big are the joins? 2) How fast do they need to go? 3) How many queries need to run concurrently?
#1 and 2# will dictate how many shards, replicas and parallel workers are needed to perform the join. #3 needs to be carefully considered because MapReduce distributed joins are not going to scale like traditional Solr queries. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jul 20, 2016 at 11:29 PM, Joel Bernstein <joels...@gmail.com> wrote: > One of the things to consider is using a hashJoin on first and second > joins. If you have one large table and two smaller tables the hashJoin > makes a lot of sense. > > One possible flow would be: > > parallel reduce to do the grouping > hashJoin to the second table > hashJoin to the third table > > The hashJoins can be done in parallel if the partitionKeys are the same as > the partition keys for the reduce. > > To resort for the user specified sort you can use the top() function to > resort the top N results in a priority queue. > > The sort() should not be used if you don't have enough memory to sort the > entire underlying stream. > > > > > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Jul 20, 2016 at 10:38 PM, tedsolr <tsm...@sciquest.com> wrote: > >> I'm hoping I'm just not using the streaming API correctly. I have about >> 30M >> docs (~ 15 collections) in production right now that work well with just >> 4GB >> of heap (no streaming). I can't believe streaming would choke on my test >> data. >> >> I guess there are 2 primary requirements. Reindexing an entire collection >> must be done over night (so 8 hrs or so). And search must perform >> reasonably >> well with grouped results, stats per group, and the grouping based on a >> runtime dynamic set of fields. I have a plugin analytics query that does >> the >> searching now that is passable (couple secs for 10M results, 30 secs for >> 40M >> results), however it doesn't support sharded collections. The reindexing >> (with atomic updates) takes ~5 hrs for 10M docs, or 20 hrs for 40M docs. >> The >> problem is that I will soon have customers with 100M docs so my solution >> will not scale. >> >> To speed up reindexing I'm trying to normalize the data. The key fields >> that >> get updated represent 1/30th the number of documents. Placing those fields >> in a separate collection (unique values only) would allow reindexing to >> finish much quicker. Search results must include data from both >> collections >> (there could be a 3rd collection too but I'm trying to simplify here). >> Streaming requires like sorts for grouping and merging, and these sorts >> are >> different. Then there's the user's sort preference to consider. So what is >> the right way to stream (for example) grouped results (on User, Date, >> Vendor, Manufacturer) with merged Vendor data, then apply a final sort on >> Manufacturer? >> >> >> >> >> Joel Bernstein wrote >> > It's likely that the SortStream is the issue. With the sort function you >> > need enough memory to sort all the tuples coming from the underlying >> > stream. The sort stream can also be done in parallel so you can split >> the >> > tuples from col1 across N worker nodes. This will give you faster >> sorting >> > and apply more memory to the sort. >> > >> > Can you describe your exact use case? Perhaps we can think about a >> > different Streaming flow that would work for you. >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288116.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >