Re: Solr feasibility with terabyte-scale data

Marcus Herou Sat, 10 May 2008 08:39:52 -0700

Thanks Ken.

I will take a look be sure of that :)


Kindly

//Marcus

On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]>
wrote:

> Hi Marcus,
>
>  It seems a lot of what you're describing is really similar to MapReduce,
>> so I think Otis' suggestion to look at Hadoop is a good one: it might
>> prevent a lot of headaches and they've already solved a lot of the tricky
>> problems. There a number of ridiculously sized projects using it to solve
>> their scale problems, not least Yahoo...
>>
>
> You should also look at a new project called Katta:
>
> http://katta.wiki.sourceforge.net/
>
> First code check-in should be happening this weekend, so I'd wait until
> Monday to take a look :)
>
> -- Ken
>
>
>  On 9 May 2008, at 01:17, Marcus Herou wrote:
>>
>>  Cool.
>>>
>>> Since you must certainly already have a good partitioning scheme, could
>>> you
>>> elaborate on high level how you set this up ?
>>>
>>> I'm certain that I will shoot myself in the foot both once and twice
>>> before
>>> getting it right but this is what I'm good at; to never stop trying :)
>>> However it is nice to start playing at least on the right side of the
>>> football field so a little push in the back would be really helpful.
>>>
>>> Kindly
>>>
>>> //Marcus
>>>
>>>
>>>
>>> On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]
>>> >
>>> wrote:
>>>
>>>  Hi, we have an index of ~300GB, which is at least approaching the
>>>> ballpark
>>>> you're in.
>>>>
>>>> Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
>>>> index so we can just scale out horizontally across commodity hardware
>>>> with
>>>> no problems at all. We're also using the multicore features available in
>>>> development Solr version to reduce granularity of core size by an order
>>>> of
>>>> magnitude: this makes for lots of small commits, rather than few long
>>>> ones.
>>>>
>>>> There was mention somewhere in the thread of document collections: if
>>>> you're going to be filtering by collection, I'd strongly recommend
>>>> partitioning too. It makes scaling so much less painful!
>>>>
>>>> James
>>>>
>>>>
>>>> On 8 May 2008, at 23:37, marcusherou wrote:
>>>>
>>>>  Hi.
>>>>>
>>>>> I will as well head into a path like yours within some months from now.
>>>>> Currently I have an index of ~10M docs and only store id's in the index
>>>>> for
>>>>> performance and distribution reasons. When we enter a new market I'm
>>>>> assuming we will soon hit 100M and quite soon after that 1G documents.
>>>>> Each
>>>>> document have in average about 3-5k data.
>>>>>
>>>>> We will use a GlusterFS installation with RAID1 (or RAID10) SATA
>>>>> enclosures
>>>>> as shared storage (think of it as a SAN or shared storage at least, one
>>>>> mount point). Hope this will be the right choice, only future can tell.
>>>>>
>>>>> Since we are developing a search engine I frankly don't think even
>>>>> having
>>>>> 100's of SOLR instances serving the index will cut it performance wise
>>>>> if
>>>>> we
>>>>> have one big index. I totally agree with the others claiming that you
>>>>> most
>>>>> definitely will go OOE or hit some other constraints of SOLR if you
>>>>> must
>>>>> have the whole result in memory sort it and create a xml response. I
>>>>> did
>>>>> hit
>>>>> such constraints when I couldn't afford the instances to have enough
>>>>> memory
>>>>> and I had only 1M of docs back then. And think of it... Optimizing a TB
>>>>> index will take a long long time and you really want to have an
>>>>> optimized
>>>>> index if you want to reduce search time.
>>>>>
>>>>> I am thinking of a sharding solution where I fragment the index over
>>>>> the
>>>>> disk(s) and let each SOLR instance only have little piece of the total
>>>>> index. This will require a master database or namenode (or simpler just
>>>>> a
>>>>> properties file in each index dir) of some sort to know what docs is
>>>>> located
>>>>> on which machine or at least how many docs each shard have. This is to
>>>>> ensure that whenever you introduce a new SOLR instance with a new shard
>>>>> the
>>>>> master indexer will know what shard to prioritize. This is probably not
>>>>> enough either since all new docs will go to the new shard until it is
>>>>> filled
>>>>> (have the same size as the others) only then will all shards receive
>>>>> docs
>>>>> in
>>>>> a loadbalanced fashion. So whenever you want to add a new indexer you
>>>>> probably need to initiate a "stealing" process where it steals docs
>>>>> from
>>>>> the
>>>>> others until it reaches some sort of threshold (10 servers = each shard
>>>>> should have 1/10 of the docs or such).
>>>>>
>>>>> I think this will cut it and enabling us to grow with the data. I think
>>>>> doing a distributed reindexing will as well be a good thing when it
>>>>> comes
>>>>> to
>>>>> cutting both indexing and optimizing speed. Probably each indexer
>>>>> should
>>>>> buffer it's shard locally on RAID1 SCSI disks, optimize it and then
>>>>> just
>>>>> copy it to the main index to minimize the burden of the shared storage.
>>>>>
>>>>> Let's say the indexing part will be all fancy and working i TB scale
>>>>> now
>>>>> we
>>>>> come to searching. I personally believe after talking to other guys
>>>>> which
>>>>> have built big search engines that you need to introduce a controller
>>>>> like
>>>>> searcher on the client side which itself searches in all of the shards
>>>>> and
>>>>> merges the response. Perhaps Distributed Solr solves this and will love
>>>>> to
>>>>> test it whenever my new installation of servers and enclosures is
>>>>> finished.
>>>>>
>>>>> Currently my idea is something like this.
>>>>> public Page<Document> search(SearchDocumentCommand sdc)
>>>>>  {
>>>>>     Set<Integer> ids = documentIndexers.keySet();
>>>>>     int nrOfSearchers = ids.size();
>>>>>     int totalItems = 0;
>>>>>     Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
>>>>>     for (Iterator<Integer> iterator = ids.iterator();
>>>>> iterator.hasNext();)
>>>>>     {
>>>>>         Integer id = iterator.next();
>>>>>         List<DocumentIndexer> indexers = documentIndexers.get(id);
>>>>>         DocumentIndexer indexer =
>>>>> indexers.get(random.nextInt(indexers.size()));
>>>>>         SearchDocumentCommand sdc2 = copy(sdc);
>>>>>         sdc2.setPage(sdc.getPage()/nrOfSearchers);
>>>>>         Page<Document> res = indexer.search(sdc);
>>>>>         totalItems += res.getTotalItems();
>>>>>         docs.addAll(res);
>>>>>     }
>>>>>
>>>>>     if(sdc.getComparator() != null)
>>>>>     {
>>>>>         Collections.sort(docs, sdc.getComparator());
>>>>>     }
>>>>>
>>>>>     docs.setTotalItems(totalItems);
>>>>>
>>>>>     return docs;
>>>>>  }
>>>>>
>>>>> This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers.
>>>>> I
>>>>> switch from Solr to raw Lucene back and forth benchmarking and
>>>>> comparing
>>>>> stuff so I have two implementations of DocumentIndexer
>>>>> (SolrDocumentIndexer
>>>>> and LuceneDocumentIndexer) to make the switch easy.
>>>>>
>>>>> I think this approach is quite OK but the paging stuff is broken I
>>>>> think.
>>>>> However the searching speed will at best be constant proportional to
>>>>> the
>>>>> number of searchers, probably a lot worse. To get even more speed each
>>>>> document indexer should be put into a separate thread with something
>>>>> like
>>>>> EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a
>>>>> thread
>>>>> pool. The Future result times out after let's say 750 msec and the
>>>>> client
>>>>> ignores all searchers which are slower. Probably some performance
>>>>> metrics
>>>>> should be gathered about each searcher so the client knows which
>>>>> indexers
>>>>> to
>>>>> prefer over the others.
>>>>> But of course if you have 50 searchers, having each client thread spawn
>>>>> yet
>>>>> another 50 threads isn't a good thing either. So perhaps a combo of
>>>>> iterative and parallell search needs to be done with the ratio
>>>>> configurable.
>>>>>
>>>>> The controller patterns is used by Google I think I think Peter Zaitzev
>>>>> (mysqlperformanceblog) once told me.
>>>>>
>>>>> Hope I gave some insights in how I plan to scale to TB size and
>>>>> hopefully
>>>>> someone smacks me on my head and says "Hey dude do it like this
>>>>> instead".
>>>>>
>>>>> Kindly
>>>>>
>>>>> //Marcus
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Phillip Farber wrote:
>>>>>
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> We are considering Solr 1.2 to index and search a terabyte-scale
>>>>>> dataset
>>>>>> of OCR.  Initially our requirements are simple: basic tokenizing,
>>>>>> score
>>>>>> sorting only, no faceting.   The schema is simple too.  A document
>>>>>> consists of a numeric id, stored and indexed and a large text field,
>>>>>> indexed not stored, containing the OCR typically ~1.4Mb.  Some limited
>>>>>> faceting or additional metadata fields may be added later.
>>>>>>
>>>>>> The data in question currently amounts to about 1.1Tb of OCR (about 1M
>>>>>> docs) which we expect to increase to 10Tb over time.  Pilot tests on
>>>>>> the
>>>>>> desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
>>>>>> data via HTTP suggest we can index at a rate sufficient to keep up
>>>>>> with
>>>>>> the inputs (after getting over the 1.1 Tb hump).  We envision nightly
>>>>>> commits/optimizes.
>>>>>>
>>>>>> We expect to have low QPS (<10) rate and probably will not need
>>>>>> millisecond query response.
>>>>>>
>>>>>> Our environment makes available Apache on blade servers (Dell 1955
>>>>>> dual
>>>>>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
>>>>>> high-performance NAS system over a dedicated (out-of-band) GbE switch
>>>>>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
>>>>>> starting
>>>>>> with 2 blades and will add as demands require.
>>>>>>
>>>>>> While we have a lot of storage, the idea of master/slave Solr
>>>>>> Collection
>>>>>> Distribution to add more Solr instances clearly means duplicating an
>>>>>> immense index.  Is it possible to use one instance to update the index
>>>>>> on NAS while other instances only read the index and commit to keep
>>>>>> their caches warm instead?
>>>>>>
>>>>>> Should we expect Solr indexing time to slow significantly as we scale
>>>>>> up?  What kind of query performance could we expect?  Is it totally
>>>>>> naive even to consider Solr at this kind of scale?
>>>>>>
>>>>>> Given these parameters is it realistic to think that Solr could handle
>>>>>> the task?
>>>>>>
>>>>>> Any advice/wisdom greatly appreciated,
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

Reply via email to