Re: Specify sorting of merged streams

Joel Bernstein Thu, 21 Jul 2016 10:03:11 -0700

The tricky thing you have is a large join coupled with a reduce() group,
which have different sorts.

If you have a big enough cluster with enough workers, shards and replicas
you can make this work.

For example if you partition the large join across 30 workers, the hash
join would fit in the available RAM across the workers. Same goes for the
in-memory sort, if you partition the sort across 30 workers then the sort
would work for you because you'd have enough RAM in total to do the sort.

The key to the performance is how fast you can export the data. For example
if you have a join with 20 million documents from one collection and you
want to do it in sub-second, you have to be able to export faster then 20
million documents per second. You increase your export performance by
adding shards, replicas and workers.

There is a document in the works that describes how to scale the MapReduce
architecture. I'll attempt to get this finished for the 6.2 release:

https://cwiki.apache.org/confluence/display/solr/MapReduce%2C+Shuffling+and+Worker+Collections

The 12 concurrent queries also need to be explored because if you're doing
12 simultaneous large distributed joins, that will create significant
network traffic.

The MapReduce functions of the Streaming API were originally designed for
interactive data exploration on large data sets. So it serves the same use
case as Hive, Pig or Impala.

The non-MapReduce functions in the streaming expression library are
designed for higher QPS OLTP use cases. The facet(), stats() and
gatherNodes() functions are examples of functions that can operate at a
higher QPS.

But, if you have a large enough cluster and the distributed joins are not
too large, then you could probably run the distributed joins with a
moderate level of QPS.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jul 21, 2016 at 11:08 AM, tedsolr <tsm...@sciquest.com> wrote:

> I can see I may need to rethink some things. I have two joins: one is 1 to
> 1
> (very large) and one is 1 to .03. A HashJoin may work on the smaller one.
> The large join looks like it may not be possible. I could get away with
> treating it as a filter somehow - I don't need the fields from the
> documents. Such as ... include col1 document (id=123) if col2 contains
> document with id=123.
>
> This whole chain is a real-time user search. A 1-2 sec response would be
> ideal, but I'm sacrificing speed in order to get the reindexing to run much
> faster.
>
> Concurrency is low - like a dozen. Have you read any blogs on balancing #
> shards vs # replicas? Any guidelines on estimating the number of VMs this
> may require would be great.
>
>
> Joel Bernstein wrote
> > A few other things for you to consider:
> >
> > 1) How big are the joins?
> > 2) How fast do they need to go?
> > 3) How many queries need to run concurrently?
> >
> > #1 and 2# will dictate how many shards, replicas and parallel workers are
> > needed to perform the join. #3 needs to be carefully considered because
> > MapReduce distributed joins are not going to scale like traditional Solr
> > queries.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Specify sorting of merged streams

Reply via email to