One of the things to consider is using a hashJoin on first and second joins. If you have one large table and two smaller tables the hashJoin makes a lot of sense.
One possible flow would be: parallel reduce to do the grouping hashJoin to the second table hashJoin to the third table The hashJoins can be done in parallel if the partitionKeys are the same as the partition keys for the reduce. To resort for the user specified sort you can use the top() function to resort the top N results in a priority queue. The sort() should not be used if you don't have enough memory to sort the entire underlying stream. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jul 20, 2016 at 10:38 PM, tedsolr <tsm...@sciquest.com> wrote: > I'm hoping I'm just not using the streaming API correctly. I have about 30M > docs (~ 15 collections) in production right now that work well with just > 4GB > of heap (no streaming). I can't believe streaming would choke on my test > data. > > I guess there are 2 primary requirements. Reindexing an entire collection > must be done over night (so 8 hrs or so). And search must perform > reasonably > well with grouped results, stats per group, and the grouping based on a > runtime dynamic set of fields. I have a plugin analytics query that does > the > searching now that is passable (couple secs for 10M results, 30 secs for > 40M > results), however it doesn't support sharded collections. The reindexing > (with atomic updates) takes ~5 hrs for 10M docs, or 20 hrs for 40M docs. > The > problem is that I will soon have customers with 100M docs so my solution > will not scale. > > To speed up reindexing I'm trying to normalize the data. The key fields > that > get updated represent 1/30th the number of documents. Placing those fields > in a separate collection (unique values only) would allow reindexing to > finish much quicker. Search results must include data from both collections > (there could be a 3rd collection too but I'm trying to simplify here). > Streaming requires like sorts for grouping and merging, and these sorts are > different. Then there's the user's sort preference to consider. So what is > the right way to stream (for example) grouped results (on User, Date, > Vendor, Manufacturer) with merged Vendor data, then apply a final sort on > Manufacturer? > > > > > Joel Bernstein wrote > > It's likely that the SortStream is the issue. With the sort function you > > need enough memory to sort all the tuples coming from the underlying > > stream. The sort stream can also be done in parallel so you can split the > > tuples from col1 across N worker nodes. This will give you faster sorting > > and apply more memory to the sort. > > > > Can you describe your exact use case? Perhaps we can think about a > > different Streaming flow that would work for you. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288116.html > Sent from the Solr - User mailing list archive at Nabble.com. >