Dmitry, If you're planning on using HBase you can take a look at https://issues.apache.org/jira/browse/HBASE-3529 I think we may even have a reasonable solution for reading the index [randomly] out of HDFS. Benchmarking'll be implemented next. It's not production ready, suggestions are welcome.
Jason On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan <dmitry....@gmail.com> wrote: > Hi Otis, > > Thanks for elaborating on this and the link (funny!). > > I have quite a big dataset growing all the time. The problems that I start > facing are pretty much predictable: > 1. Scalability: this inludes indexing time (now some days!, better hours or > even minutes, if that's possible) along with handling the rapid growth > 2. Robustness: the entire system (distributed or single server or anything > else) should be fault-tolerant, e.g. if one shard goes down, other catches > up (master-slave scheme) > 3. Some apps that we run on SOLR are pretty computationally demanding, like > faceting over one+bi+trigrams of hundreds of millions of documents (index > size of half a TB) ---> single server with a shard of data does not seem to > be enough for realtime search. > > This is just for a bit of a background. I agree with you on that hadoop and > cloud probably best suit massive batch processes rather than realtime > search. I'm sure, if anyone out there made SOLR shine throught the cloud for > realtime search over large datasets. > > By "SOLR on the cloud (e.g. HDFS + MR + cloud of > commodity machines)" I mean what you've done for your customers using EC2. > Any chance, the guidlines/articles for/on setting indices on HDFS are > available in some open / paid area? > > To sum this up, I didn't mean to create a buzz on the cloud solutions in > this thread, just was wondering what is practically available / going on in > SOLR development in this regard. > > Thanks, > > Dmitry > > > On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > >> Hi Dan, >> >> This feels a bit like a buzzword soup.... with mushrooms. :) >> >> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that >> wouldn't be very suitable for most search applications. There are some >> technologies like Riak that combine MR and search. Let me use this funny >> little >> link: http://lmgtfy.com/?q=riak%20mapreduce%20search >> >> >> Sure, you can put indices on HDFS (but don't expect searches to be fast). >> Sure >> you can create indices using MapReduce, we've done that successfully for >> customers bringing long indexing jobs from many hours to minutes by using, >> yes, >> a cluster of machines (actually EC2 instances). >> But when you say "more into SOLR on the cloud (e.g. HDFS + MR + cloud of >> commodity machines)", I can't actually picture what precisely you mean... >> >> >> Otis >> --- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> Lucene ecosystem search :: http://search-lucene.com/ >> >> >> >> ----- Original Message ---- >> > From: Dmitry Kan <dmitry....@gmail.com> >> > To: solr-user@lucene.apache.org >> > Cc: Upayavira <u...@odoko.co.uk> >> > Sent: Fri, March 25, 2011 8:26:33 AM >> > Subject: Re: solr on the cloud >> > >> > Hi, Upayavira >> > >> > Probably I'm confusing the terms here. When I say "distributed faceting" >> I'm >> > more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity >> machines) >> > rather than into traditional multicore/sharded SOLR on a single or >> multiple >> > servers with non-distributed file systems (is that what you mean when >> you >> > refer to "distribution of facet requests across hosts"?) >> > >> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <u...@odoko.co.uk> wrote: >> > >> > > >> > > >> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" <dmitry....@gmail.com> >> > > wrote: >> > > > Hi Yonik, >> > > > >> > > > Oh, this is great. Is distributed faceting available in the trunk? >> What >> > > > is >> > > > the basic server setup needed for trying this out, is it cloud with >> HDFS >> > > > and >> > > > SOLR with zookepers? >> > > > Any chance to see the related documentation? :) >> > > >> > > Distributed faceting has been available for a long time, and is >> > > available in the 1.4.1 release. >> > > >> > > The distribution of facet requests across hosts happens in the >> > > background. There's no real difference (in query syntax) between a >> > > standard facet query and a distributed one. >> > > >> > > i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide >> > > other benefits, but you don't need them for distributed faceting). >> > > >> > > Upayavira >> > > >> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley >> > > > <yo...@lucidimagination.com>wrote: >> > > > >> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dmitry....@gmail.com> >> > > wrote: >> > > > > > Basically, of high interest is checking out the Map-Reduce for >> > > > > distributed >> > > > > > faceting, is it even possible with the trunk? >> > > > > >> > > > > Solr already has distributed faceting, and it's much more >> performant >> > > > > than a map-reduce implementation would be. >> > > > > >> > > > > I've also seen a product use the term "map reduce" incorrectly... >> as >> > > in, >> > > > > we "map" the request to each shard, and then "reduce" the results >> to a >> > > > > single list (of course, that's not actually map-reduce at all ;-) >> > > > > >> > > > > >> > > > :) this sounds pretty strange to me as well. It was only my guess, >> that >> > > > if >> > > > you have MR as computational model and a cloud beneath it, you could >> > > > naturally map facet fields to their counts inside single documents >> (no >> > > > matter, where they are, be it shards or "single" index) and pass >> them >> > > > onto >> > > > reducers. >> > > > >> > > > >> > > > > -Yonik >> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, >> May >> > > > > 25-26, San Francisco >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Regards, >> > > > >> > > > Dmitry Kan >> > > > >> > > --- >> > > Enterprise Search Consultant at Sourcesense UK, >> > > Making Sense of Open Source >> > > >> > > >> > >> > >> > -- >> > Regards, >> > >> > Dmitry Kan >> > >> >