----- Original Message ----
From: Ken Krugler <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
>It seems a lot of what you're describing is really similar to
>MapReduce, so I think Otis' suggestion to look at Hadoop is a good
>one: it might prevent a lot of headaches and they've already solved
>a lot of the tricky problems. There a number of ridiculously sized
>projects using it to solve their scale problems, not least Yahoo...
You should also look at a new project called Katta:
http://katta.wiki.sourceforge.net/
First code check-in should be happening this weekend, so I'd wait
until Monday to take a look :)
-- Ken
>On 9 May 2008, at 01:17, Marcus Herou wrote:
>
>>Cool.
>>
>>Since you must certainly already have a good partitioning
scheme, could you
>>elaborate on high level how you set this up ?
>>
>>I'm certain that I will shoot myself in the foot both once and
twice before
>>getting it right but this is what I'm good at; to never stop trying :)
>>However it is nice to start playing at least on the right side of the
>>football field so a little push in the back would be really helpful.
>>
>>Kindly
>>
>>//Marcus
>>
>>
>>
>>On Fri, May 9, 2008 at 9:36 AM, James Brady
>>wrote:
>>
>>>Hi, we have an index of ~300GB, which is at least approaching
the ballpark
>>>you're in.
>>>
>>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
>>>index so we can just scale out horizontally across commodity
hardware with
>>>no problems at all. We're also using the multicore features available in
>>>development Solr version to reduce granularity of core size by
an order of
>>>magnitude: this makes for lots of small commits, rather than
few long ones.
>>>
>>>There was mention somewhere in the thread of document collections: if
>>>you're going to be filtering by collection, I'd strongly recommend
>>>partitioning too. It makes scaling so much less painful!
>>>
>>>James
>>>
>>>
>>>On 8 May 2008, at 23:37, marcusherou wrote:
>>>
>>>>Hi.
>>>>
>>>>I will as well head into a path like yours within some months from now.
>>>>Currently I have an index of ~10M docs and only store id's in the index
>>>>for
>>>>performance and distribution reasons. When we enter a new market I'm
>>>>assuming we will soon hit 100M and quite soon after that 1G documents.
>>>>Each
>>>>document have in average about 3-5k data.
>>>>
>>>>We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> >>>>enclosures
>>>>as shared storage (think of it as a SAN or shared storage at least, one
>>>>mount point). Hope this will be the right choice, only future can tell.
>>>>
>>>>Since we are developing a search engine I frankly don't think
even having
>>>>100's of SOLR instances serving the index will cut it
performance wise if
>>>>we
>>>>have one big index. I totally agree with the others claiming
that you most
>>>>definitely will go OOE or hit some other constraints of SOLR if you must
>>>>have the whole result in memory sort it and create a xml response. I did
>>>>hit
>>>>such constraints when I couldn't afford the instances to have enough
> >>>>memory
>>>>and I had only 1M of docs back then. And think of it... Optimizing a TB
>>>>index will take a long long time and you really want to have
an optimized
>>>>index if you want to reduce search time.
>>>>
>>>>I am thinking of a sharding solution where I fragment the index over the
>>>>disk(s) and let each SOLR instance only have little piece of the total
>>>>index. This will require a master database or namenode (or
simpler just a
>>>>properties file in each index dir) of some sort to know what docs is
>>>>located
>>>>on which machine or at least how many docs each shard have. This is to
>>>>ensure that whenever you introduce a new SOLR instance with a new shard
>>>>the
>>>>master indexer will know what shard to prioritize. This is probably not
>>>>enough either since all new docs will go to the new shard until it is
>>>>filled
>>>>(have the same size as the others) only then will all shards
receive docs
>>>>in
>>>>a loadbalanced fashion. So whenever you want to add a new indexer you
>>>>probably need to initiate a "stealing" process where it steals docs from
>>>>the
>>>>others until it reaches some sort of threshold (10 servers = each shard
>>>>should have 1/10 of the docs or such).
>>>>
>>>>I think this will cut it and enabling us to grow with the data. I think
>>>>doing a distributed reindexing will as well be a good thing
when it comes
>>>>to
>>>>cutting both indexing and optimizing speed. Probably each indexer should
>>>>buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
>>>>copy it to the main index to minimize the burden of the shared storage.
>>>>
>>>>Let's say the indexing part will be all fancy and working i TB scale now
>>>>we
>>>>come to searching. I personally believe after talking to other
guys which
>>>>have built big search engines that you need to introduce a
controller like
>>>>searcher on the client side which itself searches in all of
the shards and
>>>>merges the response. Perhaps Distributed Solr solves this and
will love to
>>>>test it whenever my new installation of servers and enclosures is
>>>>finished.
>>>>
>>>>Currently my idea is something like this.
>>>>public Pagesearch(SearchDocumentCommand sdc)
>>>> {
>>>> Setids = documentIndexers.keySet();
>>>> int nrOfSearchers = ids.size();
>>>> int totalItems = 0;
>>>> Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
>>>> for (Iteratoriterator = ids.iterator();
>>>>iterator.hasNext();)
>>>> {
>>>> Integer id = iterator.next();
>>>> Listindexers = documentIndexers.get(id);
>>>> DocumentIndexer indexer =
>>>>indexers.get(random.nextInt(indexers.size()));
>>>> SearchDocumentCommand sdc2 = copy(sdc);
>>>> sdc2.setPage(sdc.getPage()/nrOfSearchers);
>>>> Pageres = indexer.search(sdc);
>>>> totalItems += res.getTotalItems();
>>>> docs.addAll(res);
>>>> }
>>>>
>>>> if(sdc.getComparator() != null)
>>>> {
>>>> Collections.sort(docs, sdc.getComparator());
>>>> }
>>>>
>>>> docs.setTotalItems(totalItems);
>>>>
>>>> return docs;
>>>> }
>>>>
>>>>This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
>>>>switch from Solr to raw Lucene back and forth benchmarking and comparing
> >>>>stuff so I have two implementations of DocumentIndexer
>>>>(SolrDocumentIndexer
>>>>and LuceneDocumentIndexer) to make the switch easy.
>>>>
>>>>I think this approach is quite OK but the paging stuff is
broken I think.
>>>>However the searching speed will at best be constant proportional to the
>>>>number of searchers, probably a lot worse. To get even more speed each
>>>>document indexer should be put into a separate thread with
something like
>>>>EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction
with a thread
>>>>pool. The Future result times out after let's say 750 msec and
the client
>>>>ignores all searchers which are slower. Probably some
performance metrics
> >>>>should be gathered about each searcher so the client knows
which indexers
>>>>to
>>>>prefer over the others.
>>>>But of course if you have 50 searchers, having each client thread spawn
>>>>yet
>>>>another 50 threads isn't a good thing either. So perhaps a combo of
>>>>iterative and parallell search needs to be done with the ratio
>>>>configurable.
>>>>
>>>>The controller patterns is used by Google I think I think Peter Zaitzev
>>>>(mysqlperformanceblog) once told me.
>>>>
>>>>Hope I gave some insights in how I plan to scale to TB size
and hopefully
>>>>someone smacks me on my head and says "Hey dude do it like
this instead".
>>>>
>>>>Kindly
>>>>
>>>>//Marcus
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>Phillip Farber wrote:
>>>>
>>>>>
>>>>>Hello everyone,
>>>>>
>>>>>We are considering Solr 1.2 to index and search a
terabyte-scale dataset
>>>>>of OCR. Initially our requirements are simple: basic tokenizing, score
>>>>>sorting only, no faceting. The schema is simple too. A document
>>>>>consists of a numeric id, stored and indexed and a large text field,
>>>>>indexed not stored, containing the OCR typically ~1.4Mb. Some limited
>>>>>faceting or additional metadata fields may be added later.
>>>>>
>>>>>The data in question currently amounts to about 1.1Tb of OCR (about 1M
>>>>>docs) which we expect to increase to 10Tb over time. Pilot
tests on the
>>>>>desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
>>>>>data via HTTP suggest we can index at a rate sufficient to keep up with
>>>>>the inputs (after getting over the 1.1 Tb hump). We envision nightly
>>>>>commits/optimizes.
>>>>>
>>>>>We expect to have low QPS (<10) rate and probably will not need
>>>>>millisecond query response.
>>>>>
>>>>>Our environment makes available Apache on blade servers (Dell 1955 dual
>>>>>dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
>>>>>high-performance NAS system over a dedicated (out-of-band) GbE switch
>>>>>(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We
are starting
>>>>>with 2 blades and will add as demands require.
>>>>>
>>>>>While we have a lot of storage, the idea of master/slave Solr
Collection
>>>>>Distribution to add more Solr instances clearly means duplicating an
>>>>>immense index. Is it possible to use one instance to update the index
>>>>>on NAS while other instances only read the index and commit to keep
>>>>>their caches warm instead?
>>>>>
>>>>>Should we expect Solr indexing time to slow significantly as we scale
>>>>>up? What kind of query performance could we expect? Is it totally
>>>>>naive even to consider Solr at this kind of scale?
>>>>>
>>>>>Given these parameters is it realistic to think that Solr could handle
>>>>>the task?
>>>>>
>>>>>Any advice/wisdom greatly appreciated,
>>>>>
>>>>>Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"