Re: Solr feasibility with terabyte-scale data

Otis Gospodnetic Fri, 09 May 2008 14:13:41 -0700

You can't believe how much it pains me to see such nice piece of work live so 
separately.  But I also think I know why it happened :(.  Do you know if Stefan 
& Co. have the intention to bring it under some contrib/ around here?  Would 
that not make sense?



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 4:26:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
> 
> Hi Marcus,
> 
> >It seems a lot of what you're describing is really similar to 
> >MapReduce, so I think Otis' suggestion to look at Hadoop is a good 
> >one: it might prevent a lot of headaches and they've already solved 
> >a lot of the tricky problems. There a number of ridiculously sized 
> >projects using it to solve their scale problems, not least Yahoo...
> 
> You should also look at a new project called Katta:
> 
> http://katta.wiki.sourceforge.net/
> 
> First code check-in should be happening this weekend, so I'd wait 
> until Monday to take a look :)
> 
> -- Ken
> 
> >On 9 May 2008, at 01:17, Marcus Herou wrote:
> >
> >>Cool.
> >>
> >>Since you must certainly already have a good partitioning scheme, could you
> >>elaborate on high level how you set this up ?
> >>
> >>I'm certain that I will shoot myself in the foot both once and twice before
> >>getting it right but this is what I'm good at; to never stop trying :)
> >>However it is nice to start playing at least on the right side of the
> >>football field so a little push in the back would be really helpful.
> >>
> >>Kindly
> >>
> >>//Marcus
> >>
> >>
> >>
> >>On Fri, May 9, 2008 at 9:36 AM, James Brady 
> >>wrote:
> >>
> >>>Hi, we have an index of ~300GB, which is at least approaching the ballpark
> >>>you're in.
> >>>
> >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> >>>index so we can just scale out horizontally across commodity hardware with
> >>>no problems at all. We're also using the multicore features available in
> >>>development Solr version to reduce granularity of core size by an order of
> >>>magnitude: this makes for lots of small commits, rather than few long ones.
> >>>
> >>>There was mention somewhere in the thread of document collections: if
> >>>you're going to be filtering by collection, I'd strongly recommend
> >>>partitioning too. It makes scaling so much less painful!
> >>>
> >>>James
> >>>
> >>>
> >>>On 8 May 2008, at 23:37, marcusherou wrote:
> >>>
> >>>>Hi.
> >>>>
> >>>>I will as well head into a path like yours within some months from now.
> >>>>Currently I have an index of ~10M docs and only store id's in the index
> >>>>for
> >>>>performance and distribution reasons. When we enter a new market I'm
> >>>>assuming we will soon hit 100M and quite soon after that 1G documents.
> >>>>Each
> >>>>document have in average about 3-5k data.
> >>>>
> >>>>We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> >>>>enclosures
> >>>>as shared storage (think of it as a SAN or shared storage at least, one
> >>>>mount point). Hope this will be the right choice, only future can tell.
> >>>>
> >>>>Since we are developing a search engine I frankly don't think even having
> >>>>100's of SOLR instances serving the index will cut it performance wise if
> >>>>we
> >>>>have one big index. I totally agree with the others claiming that you most
> >>>>definitely will go OOE or hit some other constraints of SOLR if you must
> >>>>have the whole result in memory sort it and create a xml response. I did
> >>>>hit
> >>>>such constraints when I couldn't afford the instances to have enough
> >>>>memory
> >>>>and I had only 1M of docs back then. And think of it... Optimizing a TB
> >>>>index will take a long long time and you really want to have an optimized
> >>>>index if you want to reduce search time.
> >>>>
> >>>>I am thinking of a sharding solution where I fragment the index over the
> >>>>disk(s) and let each SOLR instance only have little piece of the total
> >>>>index. This will require a master database or namenode (or simpler just a
> >>>>properties file in each index dir) of some sort to know what docs is
> >>>>located
> >>>>on which machine or at least how many docs each shard have. This is to
> >>>>ensure that whenever you introduce a new SOLR instance with a new shard
> >>>>the
> >>>>master indexer will know what shard to prioritize. This is probably not
> >>>>enough either since all new docs will go to the new shard until it is
> >>>>filled
> >>>>(have the same size as the others) only then will all shards receive docs
> >>>>in
> >>>>a loadbalanced fashion. So whenever you want to add a new indexer you
> >>>>probably need to initiate a "stealing" process where it steals docs from
> >>>>the
> >>>>others until it reaches some sort of threshold (10 servers = each shard
> >>>>should have 1/10 of the docs or such).
> >>>>
> >>>>I think this will cut it and enabling us to grow with the data. I think
> >>>>doing a distributed reindexing will as well be a good thing when it comes
> >>>>to
> >>>>cutting both indexing and optimizing speed. Probably each indexer should
> >>>>buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
> >>>>copy it to the main index to minimize the burden of the shared storage.
> >>>>
> >>>>Let's say the indexing part will be all fancy and working i TB scale now
> >>>>we
> >>>>come to searching. I personally believe after talking to other guys which
> >>>>have built big search engines that you need to introduce a controller like
> >>>>searcher on the client side which itself searches in all of the shards and
> >>>>merges the response. Perhaps Distributed Solr solves this and will love to
> >>>>test it whenever my new installation of servers and enclosures is
> >>>>finished.
> >>>>
> >>>>Currently my idea is something like this.
> >>>>public Pagesearch(SearchDocumentCommand sdc)
> >>>>  {
> >>>>      Setids = documentIndexers.keySet();
> >>>>      int nrOfSearchers = ids.size();
> >>>>      int totalItems = 0;
> >>>>      Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
> >>>>      for (Iteratoriterator = ids.iterator();
> >>>>iterator.hasNext();)
> >>>>      {
> >>>>          Integer id = iterator.next();
> >>>>          Listindexers = documentIndexers.get(id);
> >>>>          DocumentIndexer indexer =
> >>>>indexers.get(random.nextInt(indexers.size()));
> >>>>          SearchDocumentCommand sdc2 = copy(sdc);
> >>>>          sdc2.setPage(sdc.getPage()/nrOfSearchers);
> >>>>          Pageres = indexer.search(sdc);
> >>>>          totalItems += res.getTotalItems();
> >>>>          docs.addAll(res);
> >>>>      }
> >>>>
> >>>>      if(sdc.getComparator() != null)
> >>>>      {
> >>>>          Collections.sort(docs, sdc.getComparator());
> >>>>      }
> >>>>
> >>>>      docs.setTotalItems(totalItems);
> >>>>
> >>>>      return docs;
> >>>>  }
> >>>>
> >>>>This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
> >>>>switch from Solr to raw Lucene back and forth benchmarking and comparing
> >>>>stuff so I have two implementations of DocumentIndexer
> >>>>(SolrDocumentIndexer
> >>>>and LuceneDocumentIndexer) to make the switch easy.
> >>>>
> >>>>I think this approach is quite OK but the paging stuff is broken I think.
> >>>>However the searching speed will at best be constant proportional to the
> >>>>number of searchers, probably a lot worse. To get even more speed each
> >>>>document indexer should be put into a separate thread with something like
> >>>>EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
> >>>>pool. The Future result times out after let's say 750 msec and the client
> >>>>ignores all searchers which are slower. Probably some performance metrics
> >>>>should be gathered about each searcher so the client knows which indexers
> >>>>to
> >>>>prefer over the others.
> >>>>But of course if you have 50 searchers, having each client thread spawn
> >>>>yet
> >>>>another 50 threads isn't a good thing either. So perhaps a combo of
> >>>>iterative and parallell search needs to be done with the ratio
> >>>>configurable.
> >>>>
> >>>>The controller patterns is used by Google I think I think Peter Zaitzev
> >>>>(mysqlperformanceblog) once told me.
> >>>>
> >>>>Hope I gave some insights in how I plan to scale to TB size and hopefully
> >>>>someone smacks me on my head and says "Hey dude do it like this instead".
> >>>>
> >>>>Kindly
> >>>>
> >>>>//Marcus
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>Phillip Farber wrote:
> >>>>
> >>>>>
> >>>>>Hello everyone,
> >>>>>
> >>>>>We are considering Solr 1.2 to index and search a terabyte-scale dataset
> >>>>>of OCR.  Initially our requirements are simple: basic tokenizing, score
> >>>>>sorting only, no faceting.   The schema is simple too.  A document
> >>>>>consists of a numeric id, stored and indexed and a large text field,
> >>>>>indexed not stored, containing the OCR typically ~1.4Mb.  Some limited
> >>>>>faceting or additional metadata fields may be added later.
> >>>>>
> >>>>>The data in question currently amounts to about 1.1Tb of OCR (about 1M
> >>>>>docs) which we expect to increase to 10Tb over time.  Pilot tests on the
> >>>>>desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
> >>>>>data via HTTP suggest we can index at a rate sufficient to keep up with
> >>>>>the inputs (after getting over the 1.1 Tb hump).  We envision nightly
> >>>>>commits/optimizes.
> >>>>>
> >>>>>We expect to have low QPS (<10) rate and probably will not need
> >>>>>millisecond query response.
> >>>>>
> >>>>>Our environment makes available Apache on blade servers (Dell 1955 dual
> >>>>>dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
> >>>>>high-performance NAS system over a dedicated (out-of-band) GbE switch
> >>>>>(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
> >>>>>with 2 blades and will add as demands require.
> >>>>>
> >>>>>While we have a lot of storage, the idea of master/slave Solr Collection
> >>>>>Distribution to add more Solr instances clearly means duplicating an
> >>>>>immense index.  Is it possible to use one instance to update the index
> >>>>>on NAS while other instances only read the index and commit to keep
> >>>>>their caches warm instead?
> >>>>>
> >>>>>Should we expect Solr indexing time to slow significantly as we scale
> >>>>>up?  What kind of query performance could we expect?  Is it totally
> >>>>>naive even to consider Solr at this kind of scale?
> >>>>>
> >>>>>Given these parameters is it realistic to think that Solr could handle
> >>>>>the task?
> >>>>>
> >>>>>Any advice/wisdom greatly appreciated,
> >>>>>
> >>>>>Phil
> 
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"

Re: Solr feasibility with terabyte-scale data

Reply via email to