>From what I can tell from the overview on http://katta.wiki.sourceforge.net/, >it's a partial replication of Solr/Nutch functionality, plus some goodies. It >might have been better to work those goodies into some friendly contrib/ be it >Solr, Nutch, Hadoop, or Lucene. Anyhow, let's see what happens there! :)
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Ken Krugler <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 9, 2008 5:37:19 PM > Subject: Re: Solr feasibility with terabyte-scale data > > Hi Otis, > > >You can't believe how much it pains me to see such nice piece of > >work live so separately. But I also think I know why it happened > >:(. Do you know if Stefan & Co. have the intention to bring it > >under some contrib/ around here? Would that not make sense? > > I'm not working on the project, so I can't speak for Stefan & > friends...but my guess is that it's going to live separately as > something independent of Solr/Nutch. If you view it as search > plumbing that's usable in multiple environments, then that makes > sense. If you view it as replicating core Solr (or Nutch) > functionality, then it sucks. Not sure what the outcome will be. > > -- Ken > > > > >----- Original Message ---- > >> From: Ken Krugler > >> To: solr-user@lucene.apache.org > >> Sent: Friday, May 9, 2008 4:26:19 PM > >> Subject: Re: Solr feasibility with terabyte-scale data > >> > >> Hi Marcus, > >> > >> >It seems a lot of what you're describing is really similar to > >> >MapReduce, so I think Otis' suggestion to look at Hadoop is a good > >> >one: it might prevent a lot of headaches and they've already solved > >> >a lot of the tricky problems. There a number of ridiculously sized > >> >projects using it to solve their scale problems, not least Yahoo... > >> > >> You should also look at a new project called Katta: > >> > >> http://katta.wiki.sourceforge.net/ > >> > >> First code check-in should be happening this weekend, so I'd wait > >> until Monday to take a look :) > >> > >> -- Ken > >> > >> >On 9 May 2008, at 01:17, Marcus Herou wrote: > >> > > >> >>Cool. > >> >> > >> >>Since you must certainly already have a good partitioning > >>scheme, could you > >> >>elaborate on high level how you set this up ? > >> >> > >> >>I'm certain that I will shoot myself in the foot both once and > >>twice before > >> >>getting it right but this is what I'm good at; to never stop trying :) > >> >>However it is nice to start playing at least on the right side of the > >> >>football field so a little push in the back would be really helpful. > >> >> > >> >>Kindly > >> >> > >> >>//Marcus > >> >> > >> >> > >> >> > >> >>On Fri, May 9, 2008 at 9:36 AM, James Brady > >> >>wrote: > >> >> > >> >>>Hi, we have an index of ~300GB, which is at least approaching > >>the ballpark > >> >>>you're in. > >> >>> > >> >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable' > >> >>>index so we can just scale out horizontally across commodity > >>hardware with > >> >>>no problems at all. We're also using the multicore features available > >> in > >> >>>development Solr version to reduce granularity of core size by > >>an order of > >> >>>magnitude: this makes for lots of small commits, rather than > >>few long ones. > >> >>> > >> >>>There was mention somewhere in the thread of document collections: if > >> >>>you're going to be filtering by collection, I'd strongly recommend > >> >>>partitioning too. It makes scaling so much less painful! > >> >>> > >> >>>James > >> >>> > >> >>> > >> >>>On 8 May 2008, at 23:37, marcusherou wrote: > >> >>> > >> >>>>Hi. > >> >>>> > >> >>>>I will as well head into a path like yours within some months from > >> now. > >> >>>>Currently I have an index of ~10M docs and only store id's in the > >> index > >> >>>>for > >> >>>>performance and distribution reasons. When we enter a new market I'm > >> >>>>assuming we will soon hit 100M and quite soon after that 1G documents. > >> >>>>Each > >> >>>>document have in average about 3-5k data. > >> >>>> > >> >>>>We will use a GlusterFS installation with RAID1 (or RAID10) SATA > > > >>>>enclosures > >> >>>>as shared storage (think of it as a SAN or shared storage at least, > >> one > >> >>>>mount point). Hope this will be the right choice, only future can > >> tell. > >> >>>> > >> >>>>Since we are developing a search engine I frankly don't think > >>even having > >> >>>>100's of SOLR instances serving the index will cut it > >>performance wise if > >> >>>>we > >> >>>>have one big index. I totally agree with the others claiming > >>that you most > >> >>>>definitely will go OOE or hit some other constraints of SOLR if you > >> must > >> >>>>have the whole result in memory sort it and create a xml response. I > >> did > >> >>>>hit > >> >>>>such constraints when I couldn't afford the instances to have enough > > > >>>>memory > >> >>>>and I had only 1M of docs back then. And think of it... Optimizing a > >> TB > >> >>>>index will take a long long time and you really want to have > >>an optimized > >> >>>>index if you want to reduce search time. > >> >>>> > >> >>>>I am thinking of a sharding solution where I fragment the index over > >> the > >> >>>>disk(s) and let each SOLR instance only have little piece of the total > >> >>>>index. This will require a master database or namenode (or > >>simpler just a > >> >>>>properties file in each index dir) of some sort to know what docs is > >> >>>>located > >> >>>>on which machine or at least how many docs each shard have. This is to > >> >>>>ensure that whenever you introduce a new SOLR instance with a new > >> shard > >> >>>>the > >> >>>>master indexer will know what shard to prioritize. This is probably > >> not > >> >>>>enough either since all new docs will go to the new shard until it is > >> >>>>filled > >> >>>>(have the same size as the others) only then will all shards > >>receive docs > >> >>>>in > >> >>>>a loadbalanced fashion. So whenever you want to add a new indexer you > >> >>>>probably need to initiate a "stealing" process where it steals docs > >> from > >> >>>>the > >> >>>>others until it reaches some sort of threshold (10 servers = each > >> shard > >> >>>>should have 1/10 of the docs or such). > >> >>>> > >> >>>>I think this will cut it and enabling us to grow with the data. I > >> think > >> >>>>doing a distributed reindexing will as well be a good thing > >>when it comes > >> >>>>to > >> >>>>cutting both indexing and optimizing speed. Probably each indexer > >> should > >> >>>>buffer it's shard locally on RAID1 SCSI disks, optimize it and then > >> just > >> >>>>copy it to the main index to minimize the burden of the shared > >> storage. > >> >>>> > >> >>>>Let's say the indexing part will be all fancy and working i TB scale > >> now > >> >>>>we > >> >>>>come to searching. I personally believe after talking to other > >>guys which > >> >>>>have built big search engines that you need to introduce a > >>controller like > >> >>>>searcher on the client side which itself searches in all of > >>the shards and > >> >>>>merges the response. Perhaps Distributed Solr solves this and > >>will love to > >> >>>>test it whenever my new installation of servers and enclosures is > >> >>>>finished. > >> >>>> > >> >>>>Currently my idea is something like this. > >> >>>>public Pagesearch(SearchDocumentCommand sdc) > >> >>>> { > >> >>>> Setids = documentIndexers.keySet(); > >> >>>> int nrOfSearchers = ids.size(); > >> >>>> int totalItems = 0; > >> >>>> Pagedocs = new Page(sdc.getPage(), sdc.getPageSize()); > >> >>>> for (Iteratoriterator = ids.iterator(); > >> >>>>iterator.hasNext();) > >> >>>> { > >> >>>> Integer id = iterator.next(); > >> >>>> Listindexers = documentIndexers.get(id); > >> >>>> DocumentIndexer indexer = > >> >>>>indexers.get(random.nextInt(indexers.size())); > >> >>>> SearchDocumentCommand sdc2 = copy(sdc); > >> >>>> sdc2.setPage(sdc.getPage()/nrOfSearchers); > >> >>>> Pageres = indexer.search(sdc); > >> >>>> totalItems += res.getTotalItems(); > >> >>>> docs.addAll(res); > >> >>>> } > >> >>>> > >> >>>> if(sdc.getComparator() != null) > >> >>>> { > >> >>>> Collections.sort(docs, sdc.getComparator()); > >> >>>> } > >> >>>> > >> >>>> docs.setTotalItems(totalItems); > >> >>>> > >> >>>> return docs; > >> >>>> } > >> >>>> > >> >>>>This is my RaidedDocumentIndexer which wraps a set of > >>DocumentIndexers. I > >> >>>>switch from Solr to raw Lucene back and forth benchmarking and > >> comparing > > > >>>>stuff so I have two implementations of DocumentIndexer > >> >>>>(SolrDocumentIndexer > >> >>>>and LuceneDocumentIndexer) to make the switch easy. > >> >>>> > >> >>>>I think this approach is quite OK but the paging stuff is > >>broken I think. > >> >>>>However the searching speed will at best be constant proportional to > >> the > >> >>>>number of searchers, probably a lot worse. To get even more speed each > >> >>>>document indexer should be put into a separate thread with > >>something like > >> >>>>EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction > >>with a thread > >> >>>>pool. The Future result times out after let's say 750 msec and > >>the client > >> >>>>ignores all searchers which are slower. Probably some > >>performance metrics > > > >>>>should be gathered about each searcher so the client knows > >which indexers > >> >>>>to > >> >>>>prefer over the others. > >> >>>>But of course if you have 50 searchers, having each client thread > >> spawn > >> >>>>yet > >> >>>>another 50 threads isn't a good thing either. So perhaps a combo of > >> >>>>iterative and parallell search needs to be done with the ratio > >> >>>>configurable. > >> >>>> > >> >>>>The controller patterns is used by Google I think I think Peter > >> Zaitzev > >> >>>>(mysqlperformanceblog) once told me. > >> >>>> > >> >>>>Hope I gave some insights in how I plan to scale to TB size > >>and hopefully > >> >>>>someone smacks me on my head and says "Hey dude do it like > >>this instead". > >> >>>> > >> >>>>Kindly > >> >>>> > >> >>>>//Marcus > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>>Phillip Farber wrote: > >> >>>> > >> >>>>> > >> >>>>>Hello everyone, > >> >>>>> > >> >>>>>We are considering Solr 1.2 to index and search a > >>terabyte-scale dataset > >> >>>>>of OCR. Initially our requirements are simple: basic tokenizing, > >> score > >> >>>>>sorting only, no faceting. The schema is simple too. A document > >> >>>>>consists of a numeric id, stored and indexed and a large text field, > >> >>>>>indexed not stored, containing the OCR typically ~1.4Mb. Some > >> limited > >> >>>>>faceting or additional metadata fields may be added later. > >> >>>>> > >> >>>>>The data in question currently amounts to about 1.1Tb of OCR (about > >> 1M > >> >>>>>docs) which we expect to increase to 10Tb over time. Pilot > >>tests on the > >> >>>>>desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of > >> >>>>>data via HTTP suggest we can index at a rate sufficient to keep up > >> with > >> >>>>>the inputs (after getting over the 1.1 Tb hump). We envision nightly > >> >>>>>commits/optimizes. > >> >>>>> > >> >>>>>We expect to have low QPS (<10) rate and probably will not need > >> >>>>>millisecond query response. > >> >>>>> > >> >>>>>Our environment makes available Apache on blade servers (Dell 1955 > >> dual > >> >>>>>dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, > >> >>>>>high-performance NAS system over a dedicated (out-of-band) GbE switch > >> >>>>>(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We > >>are starting > >> >>>>>with 2 blades and will add as demands require. > >> >>>>> > >> >>>>>While we have a lot of storage, the idea of master/slave Solr > >>Collection > >> >>>>>Distribution to add more Solr instances clearly means duplicating an > >> >>>>>immense index. Is it possible to use one instance to update the > >> index > >> >>>>>on NAS while other instances only read the index and commit to keep > >> >>>>>their caches warm instead? > >> >>>>> > >> >>>>>Should we expect Solr indexing time to slow significantly as we scale > >> >>>>>up? What kind of query performance could we expect? Is it totally > >> >>>>>naive even to consider Solr at this kind of scale? > >> >>>>> > >> >>>>>Given these parameters is it realistic to think that Solr could > >> handle > >> >>>>>the task? > >> >>>>> > >> >>>>>Any advice/wisdom greatly appreciated, > >> >>>>> > >> >>>>>Phil > >> > >> -- > >> Ken Krugler > >> Krugle, Inc. > >> +1 530-210-6378 > >> "If you can't find it, you can't fix it" > > > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it"