Thanks Ken. I will take a look be sure of that :)
Kindly //Marcus On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi Marcus, > > It seems a lot of what you're describing is really similar to MapReduce, >> so I think Otis' suggestion to look at Hadoop is a good one: it might >> prevent a lot of headaches and they've already solved a lot of the tricky >> problems. There a number of ridiculously sized projects using it to solve >> their scale problems, not least Yahoo... >> > > You should also look at a new project called Katta: > > http://katta.wiki.sourceforge.net/ > > First code check-in should be happening this weekend, so I'd wait until > Monday to take a look :) > > -- Ken > > > On 9 May 2008, at 01:17, Marcus Herou wrote: >> >> Cool. >>> >>> Since you must certainly already have a good partitioning scheme, could >>> you >>> elaborate on high level how you set this up ? >>> >>> I'm certain that I will shoot myself in the foot both once and twice >>> before >>> getting it right but this is what I'm good at; to never stop trying :) >>> However it is nice to start playing at least on the right side of the >>> football field so a little push in the back would be really helpful. >>> >>> Kindly >>> >>> //Marcus >>> >>> >>> >>> On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED] >>> > >>> wrote: >>> >>> Hi, we have an index of ~300GB, which is at least approaching the >>>> ballpark >>>> you're in. >>>> >>>> Lucky for us, to coin a phrase we have an 'embarassingly partitionable' >>>> index so we can just scale out horizontally across commodity hardware >>>> with >>>> no problems at all. We're also using the multicore features available in >>>> development Solr version to reduce granularity of core size by an order >>>> of >>>> magnitude: this makes for lots of small commits, rather than few long >>>> ones. >>>> >>>> There was mention somewhere in the thread of document collections: if >>>> you're going to be filtering by collection, I'd strongly recommend >>>> partitioning too. It makes scaling so much less painful! >>>> >>>> James >>>> >>>> >>>> On 8 May 2008, at 23:37, marcusherou wrote: >>>> >>>> Hi. >>>>> >>>>> I will as well head into a path like yours within some months from now. >>>>> Currently I have an index of ~10M docs and only store id's in the index >>>>> for >>>>> performance and distribution reasons. When we enter a new market I'm >>>>> assuming we will soon hit 100M and quite soon after that 1G documents. >>>>> Each >>>>> document have in average about 3-5k data. >>>>> >>>>> We will use a GlusterFS installation with RAID1 (or RAID10) SATA >>>>> enclosures >>>>> as shared storage (think of it as a SAN or shared storage at least, one >>>>> mount point). Hope this will be the right choice, only future can tell. >>>>> >>>>> Since we are developing a search engine I frankly don't think even >>>>> having >>>>> 100's of SOLR instances serving the index will cut it performance wise >>>>> if >>>>> we >>>>> have one big index. I totally agree with the others claiming that you >>>>> most >>>>> definitely will go OOE or hit some other constraints of SOLR if you >>>>> must >>>>> have the whole result in memory sort it and create a xml response. I >>>>> did >>>>> hit >>>>> such constraints when I couldn't afford the instances to have enough >>>>> memory >>>>> and I had only 1M of docs back then. And think of it... Optimizing a TB >>>>> index will take a long long time and you really want to have an >>>>> optimized >>>>> index if you want to reduce search time. >>>>> >>>>> I am thinking of a sharding solution where I fragment the index over >>>>> the >>>>> disk(s) and let each SOLR instance only have little piece of the total >>>>> index. This will require a master database or namenode (or simpler just >>>>> a >>>>> properties file in each index dir) of some sort to know what docs is >>>>> located >>>>> on which machine or at least how many docs each shard have. This is to >>>>> ensure that whenever you introduce a new SOLR instance with a new shard >>>>> the >>>>> master indexer will know what shard to prioritize. This is probably not >>>>> enough either since all new docs will go to the new shard until it is >>>>> filled >>>>> (have the same size as the others) only then will all shards receive >>>>> docs >>>>> in >>>>> a loadbalanced fashion. So whenever you want to add a new indexer you >>>>> probably need to initiate a "stealing" process where it steals docs >>>>> from >>>>> the >>>>> others until it reaches some sort of threshold (10 servers = each shard >>>>> should have 1/10 of the docs or such). >>>>> >>>>> I think this will cut it and enabling us to grow with the data. I think >>>>> doing a distributed reindexing will as well be a good thing when it >>>>> comes >>>>> to >>>>> cutting both indexing and optimizing speed. Probably each indexer >>>>> should >>>>> buffer it's shard locally on RAID1 SCSI disks, optimize it and then >>>>> just >>>>> copy it to the main index to minimize the burden of the shared storage. >>>>> >>>>> Let's say the indexing part will be all fancy and working i TB scale >>>>> now >>>>> we >>>>> come to searching. I personally believe after talking to other guys >>>>> which >>>>> have built big search engines that you need to introduce a controller >>>>> like >>>>> searcher on the client side which itself searches in all of the shards >>>>> and >>>>> merges the response. Perhaps Distributed Solr solves this and will love >>>>> to >>>>> test it whenever my new installation of servers and enclosures is >>>>> finished. >>>>> >>>>> Currently my idea is something like this. >>>>> public Page<Document> search(SearchDocumentCommand sdc) >>>>> { >>>>> Set<Integer> ids = documentIndexers.keySet(); >>>>> int nrOfSearchers = ids.size(); >>>>> int totalItems = 0; >>>>> Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize()); >>>>> for (Iterator<Integer> iterator = ids.iterator(); >>>>> iterator.hasNext();) >>>>> { >>>>> Integer id = iterator.next(); >>>>> List<DocumentIndexer> indexers = documentIndexers.get(id); >>>>> DocumentIndexer indexer = >>>>> indexers.get(random.nextInt(indexers.size())); >>>>> SearchDocumentCommand sdc2 = copy(sdc); >>>>> sdc2.setPage(sdc.getPage()/nrOfSearchers); >>>>> Page<Document> res = indexer.search(sdc); >>>>> totalItems += res.getTotalItems(); >>>>> docs.addAll(res); >>>>> } >>>>> >>>>> if(sdc.getComparator() != null) >>>>> { >>>>> Collections.sort(docs, sdc.getComparator()); >>>>> } >>>>> >>>>> docs.setTotalItems(totalItems); >>>>> >>>>> return docs; >>>>> } >>>>> >>>>> This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. >>>>> I >>>>> switch from Solr to raw Lucene back and forth benchmarking and >>>>> comparing >>>>> stuff so I have two implementations of DocumentIndexer >>>>> (SolrDocumentIndexer >>>>> and LuceneDocumentIndexer) to make the switch easy. >>>>> >>>>> I think this approach is quite OK but the paging stuff is broken I >>>>> think. >>>>> However the searching speed will at best be constant proportional to >>>>> the >>>>> number of searchers, probably a lot worse. To get even more speed each >>>>> document indexer should be put into a separate thread with something >>>>> like >>>>> EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a >>>>> thread >>>>> pool. The Future result times out after let's say 750 msec and the >>>>> client >>>>> ignores all searchers which are slower. Probably some performance >>>>> metrics >>>>> should be gathered about each searcher so the client knows which >>>>> indexers >>>>> to >>>>> prefer over the others. >>>>> But of course if you have 50 searchers, having each client thread spawn >>>>> yet >>>>> another 50 threads isn't a good thing either. So perhaps a combo of >>>>> iterative and parallell search needs to be done with the ratio >>>>> configurable. >>>>> >>>>> The controller patterns is used by Google I think I think Peter Zaitzev >>>>> (mysqlperformanceblog) once told me. >>>>> >>>>> Hope I gave some insights in how I plan to scale to TB size and >>>>> hopefully >>>>> someone smacks me on my head and says "Hey dude do it like this >>>>> instead". >>>>> >>>>> Kindly >>>>> >>>>> //Marcus >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Phillip Farber wrote: >>>>> >>>>> >>>>>> Hello everyone, >>>>>> >>>>>> We are considering Solr 1.2 to index and search a terabyte-scale >>>>>> dataset >>>>>> of OCR. Initially our requirements are simple: basic tokenizing, >>>>>> score >>>>>> sorting only, no faceting. The schema is simple too. A document >>>>>> consists of a numeric id, stored and indexed and a large text field, >>>>>> indexed not stored, containing the OCR typically ~1.4Mb. Some limited >>>>>> faceting or additional metadata fields may be added later. >>>>>> >>>>>> The data in question currently amounts to about 1.1Tb of OCR (about 1M >>>>>> docs) which we expect to increase to 10Tb over time. Pilot tests on >>>>>> the >>>>>> desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of >>>>>> data via HTTP suggest we can index at a rate sufficient to keep up >>>>>> with >>>>>> the inputs (after getting over the 1.1 Tb hump). We envision nightly >>>>>> commits/optimizes. >>>>>> >>>>>> We expect to have low QPS (<10) rate and probably will not need >>>>>> millisecond query response. >>>>>> >>>>>> Our environment makes available Apache on blade servers (Dell 1955 >>>>>> dual >>>>>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, >>>>>> high-performance NAS system over a dedicated (out-of-band) GbE switch >>>>>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are >>>>>> starting >>>>>> with 2 blades and will add as demands require. >>>>>> >>>>>> While we have a lot of storage, the idea of master/slave Solr >>>>>> Collection >>>>>> Distribution to add more Solr instances clearly means duplicating an >>>>>> immense index. Is it possible to use one instance to update the index >>>>>> on NAS while other instances only read the index and commit to keep >>>>>> their caches warm instead? >>>>>> >>>>>> Should we expect Solr indexing time to slow significantly as we scale >>>>>> up? What kind of query performance could we expect? Is it totally >>>>>> naive even to consider Solr at this kind of scale? >>>>>> >>>>>> Given these parameters is it realistic to think that Solr could handle >>>>>> the task? >>>>>> >>>>>> Any advice/wisdom greatly appreciated, >>>>>> >>>>>> Phil >>>>>> >>>>> > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/