A few other things for you to consider:

1) How big are the joins?
2) How fast do they need to go?
3) How many queries need to run concurrently?

#1 and 2# will dictate how many shards, replicas and parallel workers are
needed to perform the join. #3 needs to be carefully considered because
MapReduce distributed joins are not going to scale like traditional Solr
queries.







Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jul 20, 2016 at 11:29 PM, Joel Bernstein <joels...@gmail.com> wrote:

> One of the things to consider is using a hashJoin on first and second
> joins. If you have one large table and two smaller tables the hashJoin
> makes a lot of sense.
>
> One possible flow would be:
>
> parallel reduce to do the grouping
> hashJoin to the second table
> hashJoin to the third table
>
> The hashJoins can be done in parallel if the partitionKeys are the same as
> the partition keys for the reduce.
>
> To resort for the user specified sort you can use the top() function to
> resort the top N results in a priority queue.
>
> The sort() should not be used if you don't have enough memory to sort the
> entire underlying stream.
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Jul 20, 2016 at 10:38 PM, tedsolr <tsm...@sciquest.com> wrote:
>
>> I'm hoping I'm just not using the streaming API correctly. I have about
>> 30M
>> docs (~ 15 collections) in production right now that work well with just
>> 4GB
>> of heap (no streaming). I can't believe streaming would choke on my test
>> data.
>>
>> I guess there are 2 primary requirements. Reindexing an entire collection
>> must be done over night (so 8 hrs or so). And search must perform
>> reasonably
>> well with grouped results, stats per group, and the grouping based on a
>> runtime dynamic set of fields. I have a plugin analytics query that does
>> the
>> searching now that is passable (couple secs for 10M results, 30 secs for
>> 40M
>> results), however it doesn't support sharded collections. The reindexing
>> (with atomic updates) takes ~5 hrs for 10M docs, or 20 hrs for 40M docs.
>> The
>> problem is that I will soon have customers with 100M docs so my solution
>> will not scale.
>>
>> To speed up reindexing I'm trying to normalize the data. The key fields
>> that
>> get updated represent 1/30th the number of documents. Placing those fields
>> in a separate collection (unique values only) would allow reindexing to
>> finish much quicker. Search results must include data from both
>> collections
>> (there could be a 3rd collection too but I'm trying to simplify here).
>> Streaming requires like sorts for grouping and merging, and these sorts
>> are
>> different. Then there's the user's sort preference to consider. So what is
>> the right way to stream (for example) grouped results (on User, Date,
>> Vendor, Manufacturer) with merged Vendor data, then apply a final sort on
>> Manufacturer?
>>
>>
>>
>>
>> Joel Bernstein wrote
>> > It's likely that the SortStream is the issue. With the sort function you
>> > need enough memory to sort all the tuples coming from the underlying
>> > stream. The sort stream can also be done in parallel so you can split
>> the
>> > tuples from col1 across N worker nodes. This will give you faster
>> sorting
>> > and apply more memory to the sort.
>> >
>> > Can you describe your exact use case? Perhaps we can think about a
>> > different Streaming flow that would work for you.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288116.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Reply via email to