Re: Solr feasibility with terabyte-scale data

Marcus Herou Sat, 10 May 2008 09:04:25 -0700

Hi Otis.

Thanks for the insights. Nice to get feedback from a technorati guy. Nice to
see that the snippet of yours is almost a copy of mine, gives me the right
stomach feeling about this :)


I'm quite familiar with Hadoop as you can see if you check out the code of
my OS project AbstractCache->
http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project
which aims to create storage solutions based on the Map and SortedMap
interface. I use it everywhere in Tailsweep.com and used it as well at my
former employer Eniro.se (largest yellow pages site in Sweden). It has been
in constant development for five years.

Since I'm a cluster freak of nature I love a project named GlusterFS where
thay have managed to create a system without master/slave[s] and NameNode.
The advantage of this is that it is a lot more scalable, the drawback is
that you can get into Split-Brain situations which guys in the mailing-list
are complaining about. Anyway I tend to try to solve this with JGroups
membership where the coordinator can be any machine in the cluster but in
the group joining process the first machine to join get's the privilege of
becoming coordinator. But even with JGroups you can run into trouble with
race-conditions of all kinds (distributed locks for example).

I've created an alternative to the Hadoop file system (mostly for fun) where
you just add an object to the cluster and based on what algorithm you choose
it is Raided or striped across the cluster.

Anyway this was off topic but I think my experience in building membership
aware clusters will help me in this particular case.

Kindly

//Mrcaus



On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Marcus,
>
> You are headed in the right direction.
>
> We've built a system like this at Technorati (Lucene, not Solr) and had
> components like the "namenode" or "controller" that you mention.  If you
> look at Hadoop project, you will see something similar in concept
> (NameNode), though it deals with raw data blocks, their placement in the
> cluster, etc.  As a matter of fact, I am currently running its "re-balancer"
> in order to move some of the blocks around in the cluster.  That matches
> what you are describing for moving documents from one shard to the other.
>  Of course, you can simplify things and just have this central piece be
> aware of any new servers and simply get it to place any new docs on the new
> servers and create a new shard there.  Or you can get fancy and take into
> consideration the hardware resources - the CPU, the disk space, the memory,
> and use that to figure out how much each machine in your cluster can handle
> and maximize its use based on this knowledge. :)
>
> I think Solr and Nutch are in a desperate need of this central component
> (must not be SPOF!) for shard management.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: marcusherou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 2:37:19 AM
> > Subject: Re: Solr feasibility with terabyte-scale data
> >
> >
> > Hi.
> >
> > I will as well head into a path like yours within some months from now.
> > Currently I have an index of ~10M docs and only store id's in the index
> for
> > performance and distribution reasons. When we enter a new market I'm
> > assuming we will soon hit 100M and quite soon after that 1G documents.
> Each
> > document have in average about 3-5k data.
> >
> > We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> enclosures
> > as shared storage (think of it as a SAN or shared storage at least, one
> > mount point). Hope this will be the right choice, only future can tell.
> >
> > Since we are developing a search engine I frankly don't think even having
> > 100's of SOLR instances serving the index will cut it performance wise if
> we
> > have one big index. I totally agree with the others claiming that you
> most
> > definitely will go OOE or hit some other constraints of SOLR if you must
> > have the whole result in memory sort it and create a xml response. I did
> hit
> > such constraints when I couldn't afford the instances to have enough
> memory
> > and I had only 1M of docs back then. And think of it... Optimizing a TB
> > index will take a long long time and you really want to have an optimized
> > index if you want to reduce search time.
> >
> > I am thinking of a sharding solution where I fragment the index over the
> > disk(s) and let each SOLR instance only have little piece of the total
> > index. This will require a master database or namenode (or simpler just a
> > properties file in each index dir) of some sort to know what docs is
> located
> > on which machine or at least how many docs each shard have. This is to
> > ensure that whenever you introduce a new SOLR instance with a new shard
> the
> > master indexer will know what shard to prioritize. This is probably not
> > enough either since all new docs will go to the new shard until it is
> filled
> > (have the same size as the others) only then will all shards receive docs
> in
> > a loadbalanced fashion. So whenever you want to add a new indexer you
> > probably need to initiate a "stealing" process where it steals docs from
> the
> > others until it reaches some sort of threshold (10 servers = each shard
> > should have 1/10 of the docs or such).
> >
> > I think this will cut it and enabling us to grow with the data. I think
> > doing a distributed reindexing will as well be a good thing when it comes
> to
> > cutting both indexing and optimizing speed. Probably each indexer should
> > buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
> > copy it to the main index to minimize the burden of the shared storage.
> >
> > Let's say the indexing part will be all fancy and working i TB scale now
> we
> > come to searching. I personally believe after talking to other guys which
> > have built big search engines that you need to introduce a controller
> like
> > searcher on the client side which itself searches in all of the shards
> and
> > merges the response. Perhaps Distributed Solr solves this and will love
> to
> > test it whenever my new installation of servers and enclosures is
> finished.
> >
> > Currently my idea is something like this.
> > public Pagesearch(SearchDocumentCommand sdc)
> >     {
> >         Setids = documentIndexers.keySet();
> >         int nrOfSearchers = ids.size();
> >         int totalItems = 0;
> >         Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
> >         for (Iteratoriterator = ids.iterator();
> > iterator.hasNext();)
> >         {
> >             Integer id = iterator.next();
> >             Listindexers = documentIndexers.get(id);
> >             DocumentIndexer indexer =
> > indexers.get(random.nextInt(indexers.size()));
> >             SearchDocumentCommand sdc2 = copy(sdc);
> >             sdc2.setPage(sdc.getPage()/nrOfSearchers);
> >             Pageres = indexer.search(sdc);
> >             totalItems += res.getTotalItems();
> >             docs.addAll(res);
> >         }
> >
> >         if(sdc.getComparator() != null)
> >         {
> >             Collections.sort(docs, sdc.getComparator());
> >         }
> >
> >         docs.setTotalItems(totalItems);
> >
> >         return docs;
> >     }
> >
> > This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
> > switch from Solr to raw Lucene back and forth benchmarking and comparing
> > stuff so I have two implementations of DocumentIndexer
> (SolrDocumentIndexer
> > and LuceneDocumentIndexer) to make the switch easy.
> >
> > I think this approach is quite OK but the paging stuff is broken I think.
> > However the searching speed will at best be constant proportional to the
> > number of searchers, probably a lot worse. To get even more speed each
> > document indexer should be put into a separate thread with something like
> > EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
> > pool. The Future result times out after let's say 750 msec and the client
> > ignores all searchers which are slower. Probably some performance metrics
> > should be gathered about each searcher so the client knows which indexers
> to
> > prefer over the others.
> > But of course if you have 50 searchers, having each client thread spawn
> yet
> > another 50 threads isn't a good thing either. So perhaps a combo of
> > iterative and parallell search needs to be done with the ratio
> configurable.
> >
> > The controller patterns is used by Google I think I think Peter Zaitzev
> > (mysqlperformanceblog) once told me.
> >
> > Hope I gave some insights in how I plan to scale to TB size and hopefully
> > someone smacks me on my head and says "Hey dude do it like this instead".
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> >
> > Phillip Farber wrote:
> > >
> > > Hello everyone,
> > >
> > > We are considering Solr 1.2 to index and search a terabyte-scale
> dataset
> > > of OCR.  Initially our requirements are simple: basic tokenizing, score
> > > sorting only, no faceting.   The schema is simple too.  A document
> > > consists of a numeric id, stored and indexed and a large text field,
> > > indexed not stored, containing the OCR typically ~1.4Mb.  Some limited
> > > faceting or additional metadata fields may be added later.
> > >
> > > The data in question currently amounts to about 1.1Tb of OCR (about 1M
> > > docs) which we expect to increase to 10Tb over time.  Pilot tests on
> the
> > > desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
> > > data via HTTP suggest we can index at a rate sufficient to keep up with
> > > the inputs (after getting over the 1.1 Tb hump).  We envision nightly
> > > commits/optimizes.
> > >
> > > We expect to have low QPS (<10) rate and probably will not need
> > > millisecond query response.
> > >
> > > Our environment makes available Apache on blade servers (Dell 1955 dual
> > > dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
> > > high-performance NAS system over a dedicated (out-of-band) GbE switch
> > > (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
> starting
> > > with 2 blades and will add as demands require.
> > >
> > > While we have a lot of storage, the idea of master/slave Solr
> Collection
> > > Distribution to add more Solr instances clearly means duplicating an
> > > immense index.  Is it possible to use one instance to update the index
> > > on NAS while other instances only read the index and commit to keep
> > > their caches warm instead?
> > >
> > > Should we expect Solr indexing time to slow significantly as we scale
> > > up?  What kind of query performance could we expect?  Is it totally
> > > naive even to consider Solr at this kind of scale?
> > >
> > > Given these parameters is it realistic to think that Solr could handle
> > > the task?
> > >
> > > Any advice/wisdom greatly appreciated,
> > >
> > > Phil
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

Reply via email to