Just a quick comment re LinkedIn's stuff.  You can look at Zoie (also covered 
in 
Lucene in Action 2), but you may be more interested in Sensei.

And yes, big systems like that need sharding and replication, multiple master 
and lots of slaves.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Jens Mueller <supidupi...@googlemail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, April 7, 2011 1:29:40 AM
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
>Question)?
> 
> Hello Ephraim, hello Lance, hello Walter,
> 
> thanks for your  replies:
> 
> Ephraim, thanks very much for the further detailed explanation.  I will try
> to setup a demo system in the next few days and use your  advice.
> LoadBalancers are an important aspect of your design. Can you  recommend one
> LB specificallly? (I would be using haproxy.1wt.eu) . I think  the Idea with
> uploading your document is very good. However Google-Docs  seemed not be be
> working (at least for me with the docx format?), but maybe  you can simply
> output the document as PDF and then I think Google Docs is  working, so all
> the others can also have a look at your concept. The best  approach would be
> if you could upload your advice directly somewhere to the  solr wiki as it is
> really helpful.I found some other documents meanwhile, but  yours is much
> clearer and more complete, with the LBs and the Aggregators  (
> http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> 
> Lance,  thanks I will have a look at what linkedin is doing.
> 
> Walter, thanks for  the advice: Well you are right, mentioning google. My
> question was also to  understand how such large systems like google/facebook
> are actually working.  So my numbers are just theoretical and made up. My
> system will be  smaller,  but I would be very happy to understand how such
> large systems  are build and I think the approach Ephraim showd should be
> working quite well  at large scale. If you know a good documents (besides the
> bigtable research  paper that I already know) that technically describes how
> google is working  in detail that would be of great interest. You seem to be
> working for a  company that handles large datasets. Does google use this
> approach, sharing  the index into N writers, and the procuded index is then
> replicated to N  "read only searchers"?
> 
> thank you all.
> best  regards
> jens
> 
> 
> 
> 2011/4/7 Walter Underwood <wun...@wunderwood.org>
> 
> >  The bigger answer is that you cannot get to this size by just  configuring
> > Solr. You may have to invent a lot of stuff. Like all of  Google.
> >
> > Where did you get these numbers? The proposed query rate  is twice as big as
> > Google (Feb 2010 estimate, 34K qps).
> >
> >  I work at MarkLogic, and we scale to 100's of terabytes, with fast  update
> > and query rates. If you want a real system that handles that, you  might 
want
> > to look at our product.
> >
> >  wunder
> >
> > On Apr 6, 2011, at 8:06 PM, Lance Norskog  wrote:
> >
> > > I would not use replication. LinkedIn consumer  search is a flat system
> > > where one process indexes new entries and  does queries simultaneously.
> > > It's a custom Lucene app called Zoie.  Their stuff is on Github..
> > >
> > > I would get documents to  indexers via a multicast IP-based queueing
> > > system. This scales very  well and there's a lot of hardware support.
> > >
> > > The  problem with distributed search is that it is a) inherently slower
> > >  and b) has inherently more and longer jitter. The "airplane wing"
> > >  distribution of query times becomes longer and flatter.
> > >
> >  > This is going to have to be a "federated" system, where the  front-end
> > > app aggregates results rather than Solr.
> >  >
> > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller <supidupi...@googlemail.com>
> >  wrote:
> > >> Hello Experts,
> > >>
> >  >>
> > >>
> > >> I am a Solr newbie but read quite a  lot of docs. I still do not
> > understand
> > >> what would be  the best way to setup very large scale deployments:
> > >>
> >  >>
> > >>
> > >> Goal (threoretical):
> >  >>
> > >>  A.) Index-Size: 1 Petabyte (1 Document is about  5 KB in Size)
> > >>
> > >>  B) Queries: 100000  Queries/ per Second
> > >>
> > >>  C) Updates: 100000  Updates / per Second
> > >>
> > >>
> > >>
> >  >>
> > >> Solr offers:
> > >>
> > >>  1.)    Replication => Scales Well for B)  BUT  A) and C)  are not
> > satisfied
> > >>
> > >>
> > >>  2.)    Sharding => Scales well for A) BUT B) and C) are not  satisfied
> > (=> As
> > >> I understand the Sharding approach  all goes through a central server,
> > that
> > >> dispatches the  updates and assembles the quries retrieved from the
> > different
> >  >> shards. But this central server has also some capacity  limits...)
> > >>
> > >>
> > >>
> >  >>
> > >> What is the right approach to handle such large  deployments? I would be
> > >> thankfull for just a rough sketch of  the concepts so I can
> > experiment/search
> > >>  further…
> > >>
> > >>
> > >> Maybe I am missing  something very trivial as I think some of the “Solr
> > >> Users/Use  Cases” on the homepage are that kind of large deployments. How
> >  are
> > >> they implemented?
> > >>
> > >>
> >  >>
> > >> Thanky very much!!!
> > >>
> > >>  Jens
> > >>
> > >
> >
> >
> >
> >
> >
>

Reply via email to