Re: Solr feasibility with terabyte-scale data

Otis Gospodnetic Fri, 09 May 2008 16:21:38 -0700

>From what I can tell from the overview on http://katta.wiki.sourceforge.net/, 
>it's a partial replication of Solr/Nutch functionality, plus some goodies.  It 
>might have been better to work those goodies into some friendly contrib/ be it 
>Solr, Nutch, Hadoop, or Lucene.  Anyhow, let's see what happens there! :)



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 5:37:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
> 
> Hi Otis,
> 
> >You can't believe how much it pains me to see such nice piece of 
> >work live so separately.  But I also think I know why it happened 
> >:(.  Do you know if Stefan & Co. have the intention to bring it 
> >under some contrib/ around here?  Would that not make sense?
> 
> I'm not working on the project, so I can't speak for Stefan & 
> friends...but my guess is that it's going to live separately as 
> something independent of Solr/Nutch. If you view it as search 
> plumbing that's usable in multiple environments, then that makes 
> sense. If you view it as replicating core Solr (or Nutch) 
> functionality, then it sucks. Not sure what the outcome will be.
> 
> -- Ken
> 
> 
> 
> >----- Original Message ----
> >>  From: Ken Krugler 
> >>  To: solr-user@lucene.apache.org
> >>  Sent: Friday, May 9, 2008 4:26:19 PM
> >>  Subject: Re: Solr feasibility with terabyte-scale data
> >>
> >>  Hi Marcus,
> >>
> >>  >It seems a lot of what you're describing is really similar to
> >>  >MapReduce, so I think Otis' suggestion to look at Hadoop is a good
> >>  >one: it might prevent a lot of headaches and they've already solved
> >>  >a lot of the tricky problems. There a number of ridiculously sized
> >>  >projects using it to solve their scale problems, not least Yahoo...
> >>
> >>  You should also look at a new project called Katta:
> >>
> >>  http://katta.wiki.sourceforge.net/
> >>
> >>  First code check-in should be happening this weekend, so I'd wait
> >>  until Monday to take a look :)
> >>
> >>  -- Ken
> >>
> >>  >On 9 May 2008, at 01:17, Marcus Herou wrote:
> >>  >
> >>  >>Cool.
> >>  >>
> >>  >>Since you must certainly already have a good partitioning 
> >>scheme, could you
> >>  >>elaborate on high level how you set this up ?
> >>  >>
> >>  >>I'm certain that I will shoot myself in the foot both once and 
> >>twice before
> >>  >>getting it right but this is what I'm good at; to never stop trying :)
> >>  >>However it is nice to start playing at least on the right side of the
> >>  >>football field so a little push in the back would be really helpful.
> >>  >>
> >>  >>Kindly
> >>  >>
> >>  >>//Marcus
> >>  >>
> >>  >>
> >>  >>
> >>  >>On Fri, May 9, 2008 at 9:36 AM, James Brady
> >>  >>wrote:
> >>  >>
> >>  >>>Hi, we have an index of ~300GB, which is at least approaching 
> >>the ballpark
> >>  >>>you're in.
> >>  >>>
> >>  >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> >>  >>>index so we can just scale out horizontally across commodity 
> >>hardware with
> >>  >>>no problems at all. We're also using the multicore features available 
> >> in
> >>  >>>development Solr version to reduce granularity of core size by 
> >>an order of
> >>  >>>magnitude: this makes for lots of small commits, rather than 
> >>few long ones.
> >>  >>>
> >>  >>>There was mention somewhere in the thread of document collections: if
> >>  >>>you're going to be filtering by collection, I'd strongly recommend
> >>  >>>partitioning too. It makes scaling so much less painful!
> >>  >>>
> >>  >>>James
> >>  >>>
> >>  >>>
> >>  >>>On 8 May 2008, at 23:37, marcusherou wrote:
> >>  >>>
> >>  >>>>Hi.
> >>  >>>>
> >>  >>>>I will as well head into a path like yours within some months from 
> >> now.
> >>  >>>>Currently I have an index of ~10M docs and only store id's in the 
> >> index
> >>  >>>>for
> >>  >>>>performance and distribution reasons. When we enter a new market I'm
> >>  >>>>assuming we will soon hit 100M and quite soon after that 1G documents.
> >>  >>>>Each
> >>  >>>>document have in average about 3-5k data.
> >>  >>>>
> >>  >>>>We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> >  > >>>>enclosures
> >>  >>>>as shared storage (think of it as a SAN or shared storage at least, 
> >> one
> >>  >>>>mount point). Hope this will be the right choice, only future can 
> >> tell.
> >>  >>>>
> >>  >>>>Since we are developing a search engine I frankly don't think 
> >>even having
> >>  >>>>100's of SOLR instances serving the index will cut it 
> >>performance wise if
> >>  >>>>we
> >>  >>>>have one big index. I totally agree with the others claiming 
> >>that you most
> >>  >>>>definitely will go OOE or hit some other constraints of SOLR if you 
> >> must
> >>  >>>>have the whole result in memory sort it and create a xml response. I 
> >> did
> >>  >>>>hit
> >>  >>>>such constraints when I couldn't afford the instances to have enough
> >  > >>>>memory
> >>  >>>>and I had only 1M of docs back then. And think of it... Optimizing a 
> >> TB
> >>  >>>>index will take a long long time and you really want to have 
> >>an optimized
> >>  >>>>index if you want to reduce search time.
> >>  >>>>
> >>  >>>>I am thinking of a sharding solution where I fragment the index over 
> >> the
> >>  >>>>disk(s) and let each SOLR instance only have little piece of the total
> >>  >>>>index. This will require a master database or namenode (or 
> >>simpler just a
> >>  >>>>properties file in each index dir) of some sort to know what docs is
> >>  >>>>located
> >>  >>>>on which machine or at least how many docs each shard have. This is to
> >>  >>>>ensure that whenever you introduce a new SOLR instance with a new 
> >> shard
> >>  >>>>the
> >>  >>>>master indexer will know what shard to prioritize. This is probably 
> >> not
> >>  >>>>enough either since all new docs will go to the new shard until it is
> >>  >>>>filled
> >>  >>>>(have the same size as the others) only then will all shards 
> >>receive docs
> >>  >>>>in
> >>  >>>>a loadbalanced fashion. So whenever you want to add a new indexer you
> >>  >>>>probably need to initiate a "stealing" process where it steals docs 
> >> from
> >>  >>>>the
> >>  >>>>others until it reaches some sort of threshold (10 servers = each 
> >> shard
> >>  >>>>should have 1/10 of the docs or such).
> >>  >>>>
> >>  >>>>I think this will cut it and enabling us to grow with the data. I 
> >> think
> >>  >>>>doing a distributed reindexing will as well be a good thing 
> >>when it comes
> >>  >>>>to
> >>  >>>>cutting both indexing and optimizing speed. Probably each indexer 
> >> should
> >>  >>>>buffer it's shard locally on RAID1 SCSI disks, optimize it and then 
> >> just
> >>  >>>>copy it to the main index to minimize the burden of the shared 
> >> storage.
> >>  >>>>
> >>  >>>>Let's say the indexing part will be all fancy and working i TB scale 
> >> now
> >>  >>>>we
> >>  >>>>come to searching. I personally believe after talking to other 
> >>guys which
> >>  >>>>have built big search engines that you need to introduce a 
> >>controller like
> >>  >>>>searcher on the client side which itself searches in all of 
> >>the shards and
> >>  >>>>merges the response. Perhaps Distributed Solr solves this and 
> >>will love to
> >>  >>>>test it whenever my new installation of servers and enclosures is
> >>  >>>>finished.
> >>  >>>>
> >>  >>>>Currently my idea is something like this.
> >>  >>>>public Pagesearch(SearchDocumentCommand sdc)
> >>  >>>>  {
> >>  >>>>      Setids = documentIndexers.keySet();
> >>  >>>>      int nrOfSearchers = ids.size();
> >>  >>>>      int totalItems = 0;
> >>  >>>>      Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
> >>  >>>>      for (Iteratoriterator = ids.iterator();
> >>  >>>>iterator.hasNext();)
> >>  >>>>      {
> >>  >>>>          Integer id = iterator.next();
> >>  >>>>          Listindexers = documentIndexers.get(id);
> >>  >>>>          DocumentIndexer indexer =
> >>  >>>>indexers.get(random.nextInt(indexers.size()));
> >>  >>>>          SearchDocumentCommand sdc2 = copy(sdc);
> >>  >>>>          sdc2.setPage(sdc.getPage()/nrOfSearchers);
> >>  >>>>          Pageres = indexer.search(sdc);
> >>  >>>>          totalItems += res.getTotalItems();
> >>  >>>>          docs.addAll(res);
> >>  >>>>      }
> >>  >>>>
> >>  >>>>      if(sdc.getComparator() != null)
> >>  >>>>      {
> >>  >>>>          Collections.sort(docs, sdc.getComparator());
> >>  >>>>      }
> >>  >>>>
> >>  >>>>      docs.setTotalItems(totalItems);
> >>  >>>>
> >>  >>>>      return docs;
> >>  >>>>  }
> >>  >>>>
> >>  >>>>This is my RaidedDocumentIndexer which wraps a set of 
> >>DocumentIndexers. I
> >>  >>>>switch from Solr to raw Lucene back and forth benchmarking and 
> >> comparing
> >  > >>>>stuff so I have two implementations of DocumentIndexer
> >>  >>>>(SolrDocumentIndexer
> >>  >>>>and LuceneDocumentIndexer) to make the switch easy.
> >>  >>>>
> >>  >>>>I think this approach is quite OK but the paging stuff is 
> >>broken I think.
> >>  >>>>However the searching speed will at best be constant proportional to 
> >> the
> >>  >>>>number of searchers, probably a lot worse. To get even more speed each
> >>  >>>>document indexer should be put into a separate thread with 
> >>something like
> >>  >>>>EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction 
> >>with a thread
> >>  >>>>pool. The Future result times out after let's say 750 msec and 
> >>the client
> >>  >>>>ignores all searchers which are slower. Probably some 
> >>performance metrics
> >  > >>>>should be gathered about each searcher so the client knows 
> >which indexers
> >>  >>>>to
> >>  >>>>prefer over the others.
> >>  >>>>But of course if you have 50 searchers, having each client thread 
> >> spawn
> >>  >>>>yet
> >>  >>>>another 50 threads isn't a good thing either. So perhaps a combo of
> >>  >>>>iterative and parallell search needs to be done with the ratio
> >>  >>>>configurable.
> >>  >>>>
> >>  >>>>The controller patterns is used by Google I think I think Peter 
> >> Zaitzev
> >>  >>>>(mysqlperformanceblog) once told me.
> >>  >>>>
> >>  >>>>Hope I gave some insights in how I plan to scale to TB size 
> >>and hopefully
> >>  >>>>someone smacks me on my head and says "Hey dude do it like 
> >>this instead".
> >>  >>>>
> >>  >>>>Kindly
> >>  >>>>
> >>  >>>>//Marcus
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>
> >>  >>>>Phillip Farber wrote:
> >>  >>>>
> >>  >>>>>
> >>  >>>>>Hello everyone,
> >>  >>>>>
> >>  >>>>>We are considering Solr 1.2 to index and search a 
> >>terabyte-scale dataset
> >>  >>>>>of OCR.  Initially our requirements are simple: basic tokenizing, 
> >> score
> >>  >>>>>sorting only, no faceting.   The schema is simple too.  A document
> >>  >>>>>consists of a numeric id, stored and indexed and a large text field,
> >>  >>>>>indexed not stored, containing the OCR typically ~1.4Mb.  Some 
> >> limited
> >>  >>>>>faceting or additional metadata fields may be added later.
> >>  >>>>>
> >>  >>>>>The data in question currently amounts to about 1.1Tb of OCR (about 
> >> 1M
> >>  >>>>>docs) which we expect to increase to 10Tb over time.  Pilot 
> >>tests on the
> >>  >>>>>desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
> >>  >>>>>data via HTTP suggest we can index at a rate sufficient to keep up 
> >> with
> >>  >>>>>the inputs (after getting over the 1.1 Tb hump).  We envision nightly
> >>  >>>>>commits/optimizes.
> >>  >>>>>
> >>  >>>>>We expect to have low QPS (<10) rate and probably will not need
> >>  >>>>>millisecond query response.
> >>  >>>>>
> >>  >>>>>Our environment makes available Apache on blade servers (Dell 1955 
> >> dual
> >>  >>>>>dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
> >>  >>>>>high-performance NAS system over a dedicated (out-of-band) GbE switch
> >>  >>>>>(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We 
> >>are starting
> >>  >>>>>with 2 blades and will add as demands require.
> >>  >>>>>
> >>  >>>>>While we have a lot of storage, the idea of master/slave Solr 
> >>Collection
> >>  >>>>>Distribution to add more Solr instances clearly means duplicating an
> >>  >>>>>immense index.  Is it possible to use one instance to update the 
> >> index
> >>  >>>>>on NAS while other instances only read the index and commit to keep
> >>  >>>>>their caches warm instead?
> >>  >>>>>
> >>  >>>>>Should we expect Solr indexing time to slow significantly as we scale
> >>  >>>>>up?  What kind of query performance could we expect?  Is it totally
> >>  >>>>>naive even to consider Solr at this kind of scale?
> >>  >>>>>
> >>  >>>>>Given these parameters is it realistic to think that Solr could 
> >> handle
> >>  >>>>>the task?
> >>  >>>>>
> >>  >>>>>Any advice/wisdom greatly appreciated,
> >>  >>>>>
> >>  >>>>>Phil
> >>
> >>  --
> >>  Ken Krugler
> >>  Krugle, Inc.
> >>  +1 530-210-6378
> >>  "If you can't find it, you can't fix it"
> 
> 
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"

Re: Solr feasibility with terabyte-scale data

Reply via email to