Quick reply to Otis and Ken. Otis: After a nights sleep I think you are absolutely right about that some HPC grid like Hadoop or perhaps GlusterHPC should be used regardless of my last comment.
Ken: Looked at the arch of Katta and it looks really nice. I really believe that Katta could be something which lives as a subproject of Lucene since it surely fills a gap that is not filled by Nutch. Nutch surely do similar stuff but this could actually something that Nutch uses as a component to distributed the crawled index. I will try to join the Stefan & friends team. Kindly //Marcus On Sat, May 10, 2008 at 6:03 PM, Marcus Herou <[EMAIL PROTECTED]> wrote: > Hi Otis. > > Thanks for the insights. Nice to get feedback from a technorati guy. Nice > to see that the snippet of yours is almost a copy of mine, gives me the > right stomach feeling about this :) > > I'm quite familiar with Hadoop as you can see if you check out the code of > my OS project AbstractCache-> > http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a > project which aims to create storage solutions based on the Map and > SortedMap interface. I use it everywhere in Tailsweep.com and used it as > well at my former employer Eniro.se (largest yellow pages site in Sweden). > It has been in constant development for five years. > > Since I'm a cluster freak of nature I love a project named GlusterFS where > thay have managed to create a system without master/slave[s] and NameNode. > The advantage of this is that it is a lot more scalable, the drawback is > that you can get into Split-Brain situations which guys in the mailing-list > are complaining about. Anyway I tend to try to solve this with JGroups > membership where the coordinator can be any machine in the cluster but in > the group joining process the first machine to join get's the privilege of > becoming coordinator. But even with JGroups you can run into trouble with > race-conditions of all kinds (distributed locks for example). > > I've created an alternative to the Hadoop file system (mostly for fun) > where you just add an object to the cluster and based on what algorithm you > choose it is Raided or striped across the cluster. > > Anyway this was off topic but I think my experience in building membership > aware clusters will help me in this particular case. > > Kindly > > //Mrcaus > > > > > On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Marcus, > > > > You are headed in the right direction. > > > > We've built a system like this at Technorati (Lucene, not Solr) and had > > components like the "namenode" or "controller" that you mention. If you > > look at Hadoop project, you will see something similar in concept > > (NameNode), though it deals with raw data blocks, their placement in the > > cluster, etc. As a matter of fact, I am currently running its "re-balancer" > > in order to move some of the blocks around in the cluster. That matches > > what you are describing for moving documents from one shard to the other. > > Of course, you can simplify things and just have this central piece be > > aware of any new servers and simply get it to place any new docs on the new > > servers and create a new shard there. Or you can get fancy and take into > > consideration the hardware resources - the CPU, the disk space, the memory, > > and use that to figure out how much each machine in your cluster can handle > > and maximize its use based on this knowledge. :) > > > > I think Solr and Nutch are in a desperate need of this central component > > (must not be SPOF!) for shard management. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > > > From: marcusherou <[EMAIL PROTECTED]> > > > To: solr-user@lucene.apache.org > > > Sent: Friday, May 9, 2008 2:37:19 AM > > > Subject: Re: Solr feasibility with terabyte-scale data > > > > > > > > > Hi. > > > > > > I will as well head into a path like yours within some months from > > now. > > > Currently I have an index of ~10M docs and only store id's in the > > index for > > > performance and distribution reasons. When we enter a new market I'm > > > assuming we will soon hit 100M and quite soon after that 1G documents. > > Each > > > document have in average about 3-5k data. > > > > > > We will use a GlusterFS installation with RAID1 (or RAID10) SATA > > enclosures > > > as shared storage (think of it as a SAN or shared storage at least, > > one > > > mount point). Hope this will be the right choice, only future can > > tell. > > > > > > Since we are developing a search engine I frankly don't think even > > having > > > 100's of SOLR instances serving the index will cut it performance wise > > if we > > > have one big index. I totally agree with the others claiming that you > > most > > > definitely will go OOE or hit some other constraints of SOLR if you > > must > > > have the whole result in memory sort it and create a xml response. I > > did hit > > > such constraints when I couldn't afford the instances to have enough > > memory > > > and I had only 1M of docs back then. And think of it... Optimizing a > > TB > > > index will take a long long time and you really want to have an > > optimized > > > index if you want to reduce search time. > > > > > > I am thinking of a sharding solution where I fragment the index over > > the > > > disk(s) and let each SOLR instance only have little piece of the total > > > index. This will require a master database or namenode (or simpler > > just a > > > properties file in each index dir) of some sort to know what docs is > > located > > > on which machine or at least how many docs each shard have. This is to > > > ensure that whenever you introduce a new SOLR instance with a new > > shard the > > > master indexer will know what shard to prioritize. This is probably > > not > > > enough either since all new docs will go to the new shard until it is > > filled > > > (have the same size as the others) only then will all shards receive > > docs in > > > a loadbalanced fashion. So whenever you want to add a new indexer you > > > probably need to initiate a "stealing" process where it steals docs > > from the > > > others until it reaches some sort of threshold (10 servers = each > > shard > > > should have 1/10 of the docs or such). > > > > > > I think this will cut it and enabling us to grow with the data. I > > think > > > doing a distributed reindexing will as well be a good thing when it > > comes to > > > cutting both indexing and optimizing speed. Probably each indexer > > should > > > buffer it's shard locally on RAID1 SCSI disks, optimize it and then > > just > > > copy it to the main index to minimize the burden of the shared > > storage. > > > > > > Let's say the indexing part will be all fancy and working i TB scale > > now we > > > come to searching. I personally believe after talking to other guys > > which > > > have built big search engines that you need to introduce a controller > > like > > > searcher on the client side which itself searches in all of the shards > > and > > > merges the response. Perhaps Distributed Solr solves this and will > > love to > > > test it whenever my new installation of servers and enclosures is > > finished. > > > > > > Currently my idea is something like this. > > > public Pagesearch(SearchDocumentCommand sdc) > > > { > > > Setids = documentIndexers.keySet(); > > > int nrOfSearchers = ids.size(); > > > int totalItems = 0; > > > Pagedocs = new Page(sdc.getPage(), sdc.getPageSize()); > > > for (Iteratoriterator = ids.iterator(); > > > iterator.hasNext();) > > > { > > > Integer id = iterator.next(); > > > Listindexers = documentIndexers.get(id); > > > DocumentIndexer indexer = > > > indexers.get(random.nextInt(indexers.size())); > > > SearchDocumentCommand sdc2 = copy(sdc); > > > sdc2.setPage(sdc.getPage()/nrOfSearchers); > > > Pageres = indexer.search(sdc); > > > totalItems += res.getTotalItems(); > > > docs.addAll(res); > > > } > > > > > > if(sdc.getComparator() != null) > > > { > > > Collections.sort(docs, sdc.getComparator()); > > > } > > > > > > docs.setTotalItems(totalItems); > > > > > > return docs; > > > } > > > > > > This is my RaidedDocumentIndexer which wraps a set of > > DocumentIndexers. I > > > switch from Solr to raw Lucene back and forth benchmarking and > > comparing > > > stuff so I have two implementations of DocumentIndexer > > (SolrDocumentIndexer > > > and LuceneDocumentIndexer) to make the switch easy. > > > > > > I think this approach is quite OK but the paging stuff is broken I > > think. > > > However the searching speed will at best be constant proportional to > > the > > > number of searchers, probably a lot worse. To get even more speed each > > > document indexer should be put into a separate thread with something > > like > > > EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a > > thread > > > pool. The Future result times out after let's say 750 msec and the > > client > > > ignores all searchers which are slower. Probably some performance > > metrics > > > should be gathered about each searcher so the client knows which > > indexers to > > > prefer over the others. > > > But of course if you have 50 searchers, having each client thread > > spawn yet > > > another 50 threads isn't a good thing either. So perhaps a combo of > > > iterative and parallell search needs to be done with the ratio > > configurable. > > > > > > The controller patterns is used by Google I think I think Peter > > Zaitzev > > > (mysqlperformanceblog) once told me. > > > > > > Hope I gave some insights in how I plan to scale to TB size and > > hopefully > > > someone smacks me on my head and says "Hey dude do it like this > > instead". > > > > > > Kindly > > > > > > //Marcus > > > > > > > > > > > > > > > > > > > > > > > > > > > Phillip Farber wrote: > > > > > > > > Hello everyone, > > > > > > > > We are considering Solr 1.2 to index and search a terabyte-scale > > dataset > > > > of OCR. Initially our requirements are simple: basic tokenizing, > > score > > > > sorting only, no faceting. The schema is simple too. A document > > > > consists of a numeric id, stored and indexed and a large text field, > > > > indexed not stored, containing the OCR typically ~1.4Mb. Some > > limited > > > > faceting or additional metadata fields may be added later. > > > > > > > > The data in question currently amounts to about 1.1Tb of OCR (about > > 1M > > > > docs) which we expect to increase to 10Tb over time. Pilot tests on > > the > > > > desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb > > of > > > > data via HTTP suggest we can index at a rate sufficient to keep up > > with > > > > the inputs (after getting over the 1.1 Tb hump). We envision > > nightly > > > > commits/optimizes. > > > > > > > > We expect to have low QPS (<10) rate and probably will not need > > > > millisecond query response. > > > > > > > > Our environment makes available Apache on blade servers (Dell 1955 > > dual > > > > dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, > > > > high-performance NAS system over a dedicated (out-of-band) GbE > > switch > > > > (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are > > starting > > > > with 2 blades and will add as demands require. > > > > > > > > While we have a lot of storage, the idea of master/slave Solr > > Collection > > > > Distribution to add more Solr instances clearly means duplicating an > > > > immense index. Is it possible to use one instance to update the > > index > > > > on NAS while other instances only read the index and commit to keep > > > > their caches warm instead? > > > > > > > > Should we expect Solr indexing time to slow significantly as we > > scale > > > > up? What kind of query performance could we expect? Is it totally > > > > naive even to consider Solr at this kind of scale? > > > > > > > > Given these parameters is it realistic to think that Solr could > > handle > > > > the task? > > > > > > > > Any advice/wisdom greatly appreciated, > > > > > > > > Phil > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > > > http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html > > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [EMAIL PROTECTED] > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/