Re: Solr feasibility with terabyte-scale data

Marcus Herou Sun, 11 May 2008 07:43:35 -0700

Quick reply to Otis and Ken.

Otis: After a nights sleep I think you are absolutely right about that some
HPC grid like Hadoop or perhaps GlusterHPC should be used regardless of my
last comment.


Ken: Looked at the arch of Katta and it looks really nice. I really believe
that Katta could be something which lives as a subproject of Lucene since it
surely fills a gap that is not filled by Nutch. Nutch surely do similar
stuff but this could actually something that Nutch uses as a component to
distributed the crawled index. I will try to join the Stefan & friends team.

Kindly

//Marcus


On Sat, May 10, 2008 at 6:03 PM, Marcus Herou <[EMAIL PROTECTED]>
wrote:

> Hi Otis.
>
> Thanks for the insights. Nice to get feedback from a technorati guy. Nice
> to see that the snippet of yours is almost a copy of mine, gives me the
> right stomach feeling about this :)
>
> I'm quite familiar with Hadoop as you can see if you check out the code of
> my OS project AbstractCache->
> http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a
> project which aims to create storage solutions based on the Map and
> SortedMap interface. I use it everywhere in Tailsweep.com and used it as
> well at my former employer Eniro.se (largest yellow pages site in Sweden).
> It has been in constant development for five years.
>
> Since I'm a cluster freak of nature I love a project named GlusterFS where
> thay have managed to create a system without master/slave[s] and NameNode.
> The advantage of this is that it is a lot more scalable, the drawback is
> that you can get into Split-Brain situations which guys in the mailing-list
> are complaining about. Anyway I tend to try to solve this with JGroups
> membership where the coordinator can be any machine in the cluster but in
> the group joining process the first machine to join get's the privilege of
> becoming coordinator. But even with JGroups you can run into trouble with
> race-conditions of all kinds (distributed locks for example).
>
> I've created an alternative to the Hadoop file system (mostly for fun)
> where you just add an object to the cluster and based on what algorithm you
> choose it is Raided or striped across the cluster.
>
> Anyway this was off topic but I think my experience in building membership
> aware clusters will help me in this particular case.
>
> Kindly
>
> //Mrcaus
>
>
>
>
> On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
>
> > Marcus,
> >
> > You are headed in the right direction.
> >
> > We've built a system like this at Technorati (Lucene, not Solr) and had
> > components like the "namenode" or "controller" that you mention.  If you
> > look at Hadoop project, you will see something similar in concept
> > (NameNode), though it deals with raw data blocks, their placement in the
> > cluster, etc.  As a matter of fact, I am currently running its "re-balancer"
> > in order to move some of the blocks around in the cluster.  That matches
> > what you are describing for moving documents from one shard to the other.
> >  Of course, you can simplify things and just have this central piece be
> > aware of any new servers and simply get it to place any new docs on the new
> > servers and create a new shard there.  Or you can get fancy and take into
> > consideration the hardware resources - the CPU, the disk space, the memory,
> > and use that to figure out how much each machine in your cluster can handle
> > and maximize its use based on this knowledge. :)
> >
> > I think Solr and Nutch are in a desperate need of this central component
> > (must not be SPOF!) for shard management.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: marcusherou <[EMAIL PROTECTED]>
> > > To: solr-user@lucene.apache.org
> > > Sent: Friday, May 9, 2008 2:37:19 AM
> > > Subject: Re: Solr feasibility with terabyte-scale data
> > >
> > >
> > > Hi.
> > >
> > > I will as well head into a path like yours within some months from
> > now.
> > > Currently I have an index of ~10M docs and only store id's in the
> > index for
> > > performance and distribution reasons. When we enter a new market I'm
> > > assuming we will soon hit 100M and quite soon after that 1G documents.
> > Each
> > > document have in average about 3-5k data.
> > >
> > > We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> > enclosures
> > > as shared storage (think of it as a SAN or shared storage at least,
> > one
> > > mount point). Hope this will be the right choice, only future can
> > tell.
> > >
> > > Since we are developing a search engine I frankly don't think even
> > having
> > > 100's of SOLR instances serving the index will cut it performance wise
> > if we
> > > have one big index. I totally agree with the others claiming that you
> > most
> > > definitely will go OOE or hit some other constraints of SOLR if you
> > must
> > > have the whole result in memory sort it and create a xml response. I
> > did hit
> > > such constraints when I couldn't afford the instances to have enough
> > memory
> > > and I had only 1M of docs back then. And think of it... Optimizing a
> > TB
> > > index will take a long long time and you really want to have an
> > optimized
> > > index if you want to reduce search time.
> > >
> > > I am thinking of a sharding solution where I fragment the index over
> > the
> > > disk(s) and let each SOLR instance only have little piece of the total
> > > index. This will require a master database or namenode (or simpler
> > just a
> > > properties file in each index dir) of some sort to know what docs is
> > located
> > > on which machine or at least how many docs each shard have. This is to
> > > ensure that whenever you introduce a new SOLR instance with a new
> > shard the
> > > master indexer will know what shard to prioritize. This is probably
> > not
> > > enough either since all new docs will go to the new shard until it is
> > filled
> > > (have the same size as the others) only then will all shards receive
> > docs in
> > > a loadbalanced fashion. So whenever you want to add a new indexer you
> > > probably need to initiate a "stealing" process where it steals docs
> > from the
> > > others until it reaches some sort of threshold (10 servers = each
> > shard
> > > should have 1/10 of the docs or such).
> > >
> > > I think this will cut it and enabling us to grow with the data. I
> > think
> > > doing a distributed reindexing will as well be a good thing when it
> > comes to
> > > cutting both indexing and optimizing speed. Probably each indexer
> > should
> > > buffer it's shard locally on RAID1 SCSI disks, optimize it and then
> > just
> > > copy it to the main index to minimize the burden of the shared
> > storage.
> > >
> > > Let's say the indexing part will be all fancy and working i TB scale
> > now we
> > > come to searching. I personally believe after talking to other guys
> > which
> > > have built big search engines that you need to introduce a controller
> > like
> > > searcher on the client side which itself searches in all of the shards
> > and
> > > merges the response. Perhaps Distributed Solr solves this and will
> > love to
> > > test it whenever my new installation of servers and enclosures is
> > finished.
> > >
> > > Currently my idea is something like this.
> > > public Pagesearch(SearchDocumentCommand sdc)
> > >     {
> > >         Setids = documentIndexers.keySet();
> > >         int nrOfSearchers = ids.size();
> > >         int totalItems = 0;
> > >         Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
> > >         for (Iteratoriterator = ids.iterator();
> > > iterator.hasNext();)
> > >         {
> > >             Integer id = iterator.next();
> > >             Listindexers = documentIndexers.get(id);
> > >             DocumentIndexer indexer =
> > > indexers.get(random.nextInt(indexers.size()));
> > >             SearchDocumentCommand sdc2 = copy(sdc);
> > >             sdc2.setPage(sdc.getPage()/nrOfSearchers);
> > >             Pageres = indexer.search(sdc);
> > >             totalItems += res.getTotalItems();
> > >             docs.addAll(res);
> > >         }
> > >
> > >         if(sdc.getComparator() != null)
> > >         {
> > >             Collections.sort(docs, sdc.getComparator());
> > >         }
> > >
> > >         docs.setTotalItems(totalItems);
> > >
> > >         return docs;
> > >     }
> > >
> > > This is my RaidedDocumentIndexer which wraps a set of
> > DocumentIndexers. I
> > > switch from Solr to raw Lucene back and forth benchmarking and
> > comparing
> > > stuff so I have two implementations of DocumentIndexer
> > (SolrDocumentIndexer
> > > and LuceneDocumentIndexer) to make the switch easy.
> > >
> > > I think this approach is quite OK but the paging stuff is broken I
> > think.
> > > However the searching speed will at best be constant proportional to
> > the
> > > number of searchers, probably a lot worse. To get even more speed each
> > > document indexer should be put into a separate thread with something
> > like
> > > EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a
> > thread
> > > pool. The Future result times out after let's say 750 msec and the
> > client
> > > ignores all searchers which are slower. Probably some performance
> > metrics
> > > should be gathered about each searcher so the client knows which
> > indexers to
> > > prefer over the others.
> > > But of course if you have 50 searchers, having each client thread
> > spawn yet
> > > another 50 threads isn't a good thing either. So perhaps a combo of
> > > iterative and parallell search needs to be done with the ratio
> > configurable.
> > >
> > > The controller patterns is used by Google I think I think Peter
> > Zaitzev
> > > (mysqlperformanceblog) once told me.
> > >
> > > Hope I gave some insights in how I plan to scale to TB size and
> > hopefully
> > > someone smacks me on my head and says "Hey dude do it like this
> > instead".
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Phillip Farber wrote:
> > > >
> > > > Hello everyone,
> > > >
> > > > We are considering Solr 1.2 to index and search a terabyte-scale
> > dataset
> > > > of OCR.  Initially our requirements are simple: basic tokenizing,
> > score
> > > > sorting only, no faceting.   The schema is simple too.  A document
> > > > consists of a numeric id, stored and indexed and a large text field,
> > > > indexed not stored, containing the OCR typically ~1.4Mb.  Some
> > limited
> > > > faceting or additional metadata fields may be added later.
> > > >
> > > > The data in question currently amounts to about 1.1Tb of OCR (about
> > 1M
> > > > docs) which we expect to increase to 10Tb over time.  Pilot tests on
> > the
> > > > desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb
> > of
> > > > data via HTTP suggest we can index at a rate sufficient to keep up
> > with
> > > > the inputs (after getting over the 1.1 Tb hump).  We envision
> > nightly
> > > > commits/optimizes.
> > > >
> > > > We expect to have low QPS (<10) rate and probably will not need
> > > > millisecond query response.
> > > >
> > > > Our environment makes available Apache on blade servers (Dell 1955
> > dual
> > > > dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
> > > > high-performance NAS system over a dedicated (out-of-band) GbE
> > switch
> > > > (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
> > starting
> > > > with 2 blades and will add as demands require.
> > > >
> > > > While we have a lot of storage, the idea of master/slave Solr
> > Collection
> > > > Distribution to add more Solr instances clearly means duplicating an
> > > > immense index.  Is it possible to use one instance to update the
> > index
> > > > on NAS while other instances only read the index and commit to keep
> > > > their caches warm instead?
> > > >
> > > > Should we expect Solr indexing time to slow significantly as we
> > scale
> > > > up?  What kind of query performance could we expect?  Is it totally
> > > > naive even to consider Solr at this kind of scale?
> > > >
> > > > Given these parameters is it realistic to think that Solr could
> > handle
> > > > the task?
> > > >
> > > > Any advice/wisdom greatly appreciated,
> > > >
> > > > Phil
> > > >
> > > >
> > > >
> > >
> > > --
> > > View this message in context:
> > >
> > http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [EMAIL PROTECTED]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

Reply via email to