Re: Distributed search cross cluster

Erick Erickson Tue, 30 Jan 2018 19:59:16 -0800

Jan:

Hmmm, must Solr do the work? On some level it seems easier if your
middle layer (behind your single IP) has 10 CloudSolrClient thread
pools, one for each cluster and just merged the docs when it got them
back. That would take care of all of the goodness of internal LBs and
all that.


Somewhere you have to know about all 10 ZK ensembles (or the IP
address of one of your Solr instances in each cluster or....). You're
talking about building that into a SearchComponent, but would a simple
client work as well? It wouldn't even have to be on a node hosting
Solr.

Streaming doesn't really seem to fit the bill, I don't think. It's
built to handle large result sets and it doesn't sound like this is
that. It _used_ to read to end of stream even when closed, although
that's been fixed but check your version (don't have the JIRA number
offhand).

If you use a "shards" approach, don't you then have a single point of
failure if it's going to just "some Solr node" in each of the other
collections?

I admit I just scanned your post and haven't thought about it very deeply....

Erick

On Tue, Jan 30, 2018 at 8:09 AM, Jan Høydahl <jan....@cominvent.com> wrote:
> Hi,
>
> A customer has 10 separate SolrCloud clusters, with same schema across all, 
> but different content.
> Now they want users in each location to be able to federate a search across 
> all locations.
> Each location is 100% independent, with separate ZK etc. Bandwidth and 
> latency between the
> clusters is not an issue, they are actually in the same physical datacenter.
>
> Now my first thought was using a custom &shards parameter, and let the 
> receiving node fan
> out to all shards of all clusters. We’d need to contact the ZK for each 
> environment and find
> all shards and replicas participating in the collection and then construct 
> the shards=A1|A2,B1|B2…
> sting which would be quite big, but if we get it right, it should “just work".
>
> Now, my question is whether there are other smarter ways that would leave it 
> up to existing Solr
> logic to select shards and load balance, that would also take into account 
> any shard.keys/_route_
> info etc. I thought of these
>   * &collection=collA,collB  — but it only supports collections local to one 
> cloud
>   * Create a collection ALIAS to point to all 10 — but same here, only local 
> to one cluster
>   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want 
> it with pure search API
>   * Write a custom ShardHandler plugin that knows about all clusters — but 
> this is complex stuff :)
>   * Write a custom SearchComponent plugin that knows about all clusters and 
> adds the &shards= param
>
> Another approach would be for the originating cluster to fan out just ONE 
> request to each of the other
> clusters and then write some SearchComponent to merge those responses. That 
> would let us query
> the other clusters using one LB IP address instead of requiring full 
> visibility to all solr nodes
> of all clusters, but if we don’t need that isolation, that extra merge code 
> seems fairly complex.
>
> So far I opt for the custom SearchComponent and &shards= param approach. Any 
> useful input from
> someone who tried a similar approach would be priceless!
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>

Re: Distributed search cross cluster

Reply via email to