RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

Ephraim Ofir Wed, 06 Apr 2011 04:59:18 -0700

Hi all,
I'd love to share the diagram, just not sure how to do that on the list
(it's a word document I tried to send as attachment).


Jens, to answer your questions:
1. Correct, in our setup the source of the data is a DB from which we
pull the data using DIH (search the list for my previous post "DIH -
deleting documents, high performance (delta) imports, and passing
parameters" if you want info about that).  We were lucky enough to have
the data sharded at the DB level before we started using Solr, so using
the same shards was an easy extension.  Note that we're not (yet...)
using SolrCloud, it was just something I thought you should consider.
2. I got the idea for the "aggregator" from the Solr book (PACKT).  I
don't remember if that term was used in the book or if I made it up (if
Google doesn't know it, I probably mad it up...), but I think it conveys
what this part of the puzzle does.  As you said, this is simply a Solr
instance which doesn't hold its own index, but shares the same schema as
the slaves and masters.  I actually defined the default query handler on
this instance to include the shards parameter (see below), so the client
doesn't have to know anything about the internal workings of the sharded
setup, it just hits the aggregator load balancer with a regular query
and everything is handled behind the scenes.  This simplifies the client
and allows me to change the architecture in the future (i.e. change the
number of shards or their DNS name) without requiring a client change.

Sharded query handler:

  <requestHandler name="sharded" class="solr.SearchHandler"
default="${aggregator:false}">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <str name="shards">${slaveUrls:null}</str>
     </lst>
  </requestHandler>

All of our Solr instances share the same configs (solrconfig.xml,
schema.xml, etc.) and different instances take different roles according
to properties defined in solr.xml which is generated by a script
specifically for each Solr instance (the script has a "map" of which
instances should be on which host, and has to be run once on each host).
In this case, this is how the generated solr.xml looks:

<solr sharedLib="../lib" persistent="true">
   <property name="name" value="aggregator" />    -- just a name that
appears in Solr management
                                                  -- to make it easier
to know which instance you're on

   <property name="aggregator" value="true" />    -- this tells the
instance is an aggregator,
                                                  -- so it should use
the sharded request handler by default
                                                  -- masters and slaves
have master/slave accordingly do define
                                                  -- replication, a
regular default search handler for slaves,
                                                  -- and DIH on masters

   <property name="shardID" value="" />  -- this is used by instances
which are shards in order to determine which
                                         -- DB they should import from
(masters)
                                         -- and which master they should
replicate from (slaves)

   <property name="slaveUrls" value="long,list.of,shard.urls" />  --
used by the sharded request handler

   <property name="HealthCheckDir" value="/data/servers/xxxxx_solr/
aggregator/core0/conf" /> -- used by load balancer to
 
-- know if this instance is alive
   <cores adminPath="/admin/cores" defaultCoreName="prod">
      <core name="prod" instanceDir="core0/"/>                -- just
one core for this instance
                                                              --
indexers have 2 cores, one prod and one for full reindex
   </cores>
</solr>


Let me know if I can assist any further.
Ephraim Ofir


-----Original Message-----
From: Jonathan DeMello [mailto:demello....@googlemail.com] 
Sent: Wednesday, April 06, 2011 8:58 AM
To: solr-user@lucene.apache.org
Cc: Isan Fulia; Tirthankar Chatterjee
Subject: Re: FW: Very very large scale Solr Deployment = how to do
(Expert Question)?

I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia <isan.fu...@germinait.com>
wrote:

> Hi Ephraim/Jen,
>
> Can u share that diagram with all.It may really help all of us.
> Thanks,
> Isan Fulia.
>
> On 6 April 2011 10:15, Tirthankar Chatterjee
<tchatter...@commvault.com
> >wrote:
>
> > Hi Jen,
> > Can you please forward the diagram attachment too that Ephraim sent.
:-)
> > Thanks,
> > Tirthankar
> >
> > -----Original Message-----
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Tuesday, April 05, 2011 10:30 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: FW: Very very large scale Solr Deployment = how to do
> (Expert
> > Question)?
> >
> > Hello Ephraim,
> >
> > thank you so much for the great Document/Scaling-Concept!!
> >
> > First I think you really should publish this on the solr wiki. This
> > approach is nowhere documented there and not really obvious for
newbies
> and
> > your document is great and explains this very well!
> >
> > Please allow me to further questions regarding your document:
> > 1.) Is it correct, that you mean by "DB" the Origin-Data-Source of
the
> data
> > that is fed into the Solr "Cloud" for searching?
> >
> > 2.) Solr Aggregator: This term did not yeald any google results, but
is a
> > very important aspect of your design (and this was the missing piece
for
> me
> > when thinking about solr architectures): Is it cocrrec that the
> > "aggregators" are simply tomcat instances, with the solr webapp
deployed?
> > These Aggregators do not have their own index but only run the solr
> webapp
> > and I access them via the ?shard= parameter giving the shards I want
to
> > query? (So in the end they aggreate the data of the shards but do
not
> have
> > their own data). This is really an important aspect that is not
> documented
> > well enough in the solr documentation.
> >
> > Thank you very much!
> > Jens
> >
> >
> > 2011/4/5 Ephraim Ofir <ephra...@icq.com>
> >
> > > of course the attachment didn't get to the list, so here it is if
you
> > > want it...
> > >
> > > Ephraim Ofir
> > >
> > >
> > > -----Original Message-----
> > > From: Ephraim Ofir
> > > Sent: Tuesday, April 05, 2011 10:20 AM
> > > To: 'solr-user@lucene.apache.org'
> > > Subject: RE: Very very large scale Solr Deployment = how to do
(Expert
> > > Question)?
> > >
> > > I'm not sure about the scale you're aiming for, but you probably
want
> > > to do both sharding and replication.  There's no central server
which
> > > would be the bottleneck. The guidelines should probably be
something
> > like:
> > > 1. Split your index to enough shards so it can keep up with the
update
> > > rate.
> > > 2. Have enough replicates of each shard master to keep up with the
> > > rate of queries.
> > > 3. Have enough aggregators in front of the shard replicates so the
> > > aggregation doesn't become a bottleneck.
> > > 4. Make sure you have good load balancing across your system.
> > >
> > > Attached is a diagram of the setup we have.  You might want to
look
> > > into SolrCloud as well.
> > >
> > > Ephraim Ofir
> > >
> > >
> > > -----Original Message-----
> > > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > > Sent: Tuesday, April 05, 2011 4:25 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Very very large scale Solr Deployment = how to do (Expert
> > > Question)?
> > >
> > > Hello Experts,
> > >
> > >
> > >
> > > I am a Solr newbie but read quite a lot of docs. I still do not
> > > understand what would be the best way to setup very large scale
> > > deployments:
> > >
> > >
> > >
> > > Goal (threoretical):
> > >
> > >  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> > >
> > >  B) Queries: 100000 Queries/ per Second
> > >
> > >  C) Updates: 100000 Updates / per Second
> > >
> > >
> > >
> > >
> > > Solr offers:
> > >
> > > 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
> > > satisfied
> > >
> > >
> > > 2.)    Sharding => Scales well for A) BUT B) and C) are not
satisfied
> > > (=> As
> > > I understand the Sharding approach all goes through a central
server,
> > > that dispatches the updates and assembles the quries retrieved
from
> > > the different shards. But this central server has also some
capacity
> > > limits...)
> > >
> > >
> > >
> > >
> > > What is the right approach to handle such large deployments? I
would
> > > be thankfull for just a rough sketch of the concepts so I can
> > > experiment/search further...
> > >
> > >
> > > Maybe I am missing something very trivial as I think some of the
"Solr
> > > Users/Use Cases" on the homepage are that kind of large
deployments.
> > > How are they implemented?
> > >
> > >
> > >
> > > Thanky very much!!!
> > >
> > > Jens
> > >
> > ******************Legal Disclaimer***************************
> > "This communication may contain confidential and privileged
> > material for the sole use of the intended recipient. Any
> > unauthorized review, use or distribution by others is strictly
> > prohibited. If you have received the message in error, please
> > advise the sender by reply email and delete the message. Thank
> > you."
> > *********************************************************
> >
>
>
>
> --
> Thanks & Regards,
> Isan Fulia.
>

RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

Reply via email to