You can't believe how much it pains me to see such nice piece of work live so separately. But I also think I know why it happened :(. Do you know if Stefan & Co. have the intention to bring it under some contrib/ around here? Would that not make sense?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Ken Krugler <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 9, 2008 4:26:19 PM > Subject: Re: Solr feasibility with terabyte-scale data > > Hi Marcus, > > >It seems a lot of what you're describing is really similar to > >MapReduce, so I think Otis' suggestion to look at Hadoop is a good > >one: it might prevent a lot of headaches and they've already solved > >a lot of the tricky problems. There a number of ridiculously sized > >projects using it to solve their scale problems, not least Yahoo... > > You should also look at a new project called Katta: > > http://katta.wiki.sourceforge.net/ > > First code check-in should be happening this weekend, so I'd wait > until Monday to take a look :) > > -- Ken > > >On 9 May 2008, at 01:17, Marcus Herou wrote: > > > >>Cool. > >> > >>Since you must certainly already have a good partitioning scheme, could you > >>elaborate on high level how you set this up ? > >> > >>I'm certain that I will shoot myself in the foot both once and twice before > >>getting it right but this is what I'm good at; to never stop trying :) > >>However it is nice to start playing at least on the right side of the > >>football field so a little push in the back would be really helpful. > >> > >>Kindly > >> > >>//Marcus > >> > >> > >> > >>On Fri, May 9, 2008 at 9:36 AM, James Brady > >>wrote: > >> > >>>Hi, we have an index of ~300GB, which is at least approaching the ballpark > >>>you're in. > >>> > >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable' > >>>index so we can just scale out horizontally across commodity hardware with > >>>no problems at all. We're also using the multicore features available in > >>>development Solr version to reduce granularity of core size by an order of > >>>magnitude: this makes for lots of small commits, rather than few long ones. > >>> > >>>There was mention somewhere in the thread of document collections: if > >>>you're going to be filtering by collection, I'd strongly recommend > >>>partitioning too. It makes scaling so much less painful! > >>> > >>>James > >>> > >>> > >>>On 8 May 2008, at 23:37, marcusherou wrote: > >>> > >>>>Hi. > >>>> > >>>>I will as well head into a path like yours within some months from now. > >>>>Currently I have an index of ~10M docs and only store id's in the index > >>>>for > >>>>performance and distribution reasons. When we enter a new market I'm > >>>>assuming we will soon hit 100M and quite soon after that 1G documents. > >>>>Each > >>>>document have in average about 3-5k data. > >>>> > >>>>We will use a GlusterFS installation with RAID1 (or RAID10) SATA > >>>>enclosures > >>>>as shared storage (think of it as a SAN or shared storage at least, one > >>>>mount point). Hope this will be the right choice, only future can tell. > >>>> > >>>>Since we are developing a search engine I frankly don't think even having > >>>>100's of SOLR instances serving the index will cut it performance wise if > >>>>we > >>>>have one big index. I totally agree with the others claiming that you most > >>>>definitely will go OOE or hit some other constraints of SOLR if you must > >>>>have the whole result in memory sort it and create a xml response. I did > >>>>hit > >>>>such constraints when I couldn't afford the instances to have enough > >>>>memory > >>>>and I had only 1M of docs back then. And think of it... Optimizing a TB > >>>>index will take a long long time and you really want to have an optimized > >>>>index if you want to reduce search time. > >>>> > >>>>I am thinking of a sharding solution where I fragment the index over the > >>>>disk(s) and let each SOLR instance only have little piece of the total > >>>>index. This will require a master database or namenode (or simpler just a > >>>>properties file in each index dir) of some sort to know what docs is > >>>>located > >>>>on which machine or at least how many docs each shard have. This is to > >>>>ensure that whenever you introduce a new SOLR instance with a new shard > >>>>the > >>>>master indexer will know what shard to prioritize. This is probably not > >>>>enough either since all new docs will go to the new shard until it is > >>>>filled > >>>>(have the same size as the others) only then will all shards receive docs > >>>>in > >>>>a loadbalanced fashion. So whenever you want to add a new indexer you > >>>>probably need to initiate a "stealing" process where it steals docs from > >>>>the > >>>>others until it reaches some sort of threshold (10 servers = each shard > >>>>should have 1/10 of the docs or such). > >>>> > >>>>I think this will cut it and enabling us to grow with the data. I think > >>>>doing a distributed reindexing will as well be a good thing when it comes > >>>>to > >>>>cutting both indexing and optimizing speed. Probably each indexer should > >>>>buffer it's shard locally on RAID1 SCSI disks, optimize it and then just > >>>>copy it to the main index to minimize the burden of the shared storage. > >>>> > >>>>Let's say the indexing part will be all fancy and working i TB scale now > >>>>we > >>>>come to searching. I personally believe after talking to other guys which > >>>>have built big search engines that you need to introduce a controller like > >>>>searcher on the client side which itself searches in all of the shards and > >>>>merges the response. Perhaps Distributed Solr solves this and will love to > >>>>test it whenever my new installation of servers and enclosures is > >>>>finished. > >>>> > >>>>Currently my idea is something like this. > >>>>public Pagesearch(SearchDocumentCommand sdc) > >>>> { > >>>> Setids = documentIndexers.keySet(); > >>>> int nrOfSearchers = ids.size(); > >>>> int totalItems = 0; > >>>> Pagedocs = new Page(sdc.getPage(), sdc.getPageSize()); > >>>> for (Iteratoriterator = ids.iterator(); > >>>>iterator.hasNext();) > >>>> { > >>>> Integer id = iterator.next(); > >>>> Listindexers = documentIndexers.get(id); > >>>> DocumentIndexer indexer = > >>>>indexers.get(random.nextInt(indexers.size())); > >>>> SearchDocumentCommand sdc2 = copy(sdc); > >>>> sdc2.setPage(sdc.getPage()/nrOfSearchers); > >>>> Pageres = indexer.search(sdc); > >>>> totalItems += res.getTotalItems(); > >>>> docs.addAll(res); > >>>> } > >>>> > >>>> if(sdc.getComparator() != null) > >>>> { > >>>> Collections.sort(docs, sdc.getComparator()); > >>>> } > >>>> > >>>> docs.setTotalItems(totalItems); > >>>> > >>>> return docs; > >>>> } > >>>> > >>>>This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I > >>>>switch from Solr to raw Lucene back and forth benchmarking and comparing > >>>>stuff so I have two implementations of DocumentIndexer > >>>>(SolrDocumentIndexer > >>>>and LuceneDocumentIndexer) to make the switch easy. > >>>> > >>>>I think this approach is quite OK but the paging stuff is broken I think. > >>>>However the searching speed will at best be constant proportional to the > >>>>number of searchers, probably a lot worse. To get even more speed each > >>>>document indexer should be put into a separate thread with something like > >>>>EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread > >>>>pool. The Future result times out after let's say 750 msec and the client > >>>>ignores all searchers which are slower. Probably some performance metrics > >>>>should be gathered about each searcher so the client knows which indexers > >>>>to > >>>>prefer over the others. > >>>>But of course if you have 50 searchers, having each client thread spawn > >>>>yet > >>>>another 50 threads isn't a good thing either. So perhaps a combo of > >>>>iterative and parallell search needs to be done with the ratio > >>>>configurable. > >>>> > >>>>The controller patterns is used by Google I think I think Peter Zaitzev > >>>>(mysqlperformanceblog) once told me. > >>>> > >>>>Hope I gave some insights in how I plan to scale to TB size and hopefully > >>>>someone smacks me on my head and says "Hey dude do it like this instead". > >>>> > >>>>Kindly > >>>> > >>>>//Marcus > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>Phillip Farber wrote: > >>>> > >>>>> > >>>>>Hello everyone, > >>>>> > >>>>>We are considering Solr 1.2 to index and search a terabyte-scale dataset > >>>>>of OCR. Initially our requirements are simple: basic tokenizing, score > >>>>>sorting only, no faceting. The schema is simple too. A document > >>>>>consists of a numeric id, stored and indexed and a large text field, > >>>>>indexed not stored, containing the OCR typically ~1.4Mb. Some limited > >>>>>faceting or additional metadata fields may be added later. > >>>>> > >>>>>The data in question currently amounts to about 1.1Tb of OCR (about 1M > >>>>>docs) which we expect to increase to 10Tb over time. Pilot tests on the > >>>>>desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of > >>>>>data via HTTP suggest we can index at a rate sufficient to keep up with > >>>>>the inputs (after getting over the 1.1 Tb hump). We envision nightly > >>>>>commits/optimizes. > >>>>> > >>>>>We expect to have low QPS (<10) rate and probably will not need > >>>>>millisecond query response. > >>>>> > >>>>>Our environment makes available Apache on blade servers (Dell 1955 dual > >>>>>dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, > >>>>>high-performance NAS system over a dedicated (out-of-band) GbE switch > >>>>>(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting > >>>>>with 2 blades and will add as demands require. > >>>>> > >>>>>While we have a lot of storage, the idea of master/slave Solr Collection > >>>>>Distribution to add more Solr instances clearly means duplicating an > >>>>>immense index. Is it possible to use one instance to update the index > >>>>>on NAS while other instances only read the index and commit to keep > >>>>>their caches warm instead? > >>>>> > >>>>>Should we expect Solr indexing time to slow significantly as we scale > >>>>>up? What kind of query performance could we expect? Is it totally > >>>>>naive even to consider Solr at this kind of scale? > >>>>> > >>>>>Given these parameters is it realistic to think that Solr could handle > >>>>>the task? > >>>>> > >>>>>Any advice/wisdom greatly appreciated, > >>>>> > >>>>>Phil > > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it"