This was done in SOLR-1301 going on several years ago now.
On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog <goks...@gmail.com> wrote: > It sounds like you really want the final map/reduce phase to put Solr > index files into HDFS. Solr has a feature to do this called 'Embedded > Solr'. This packages Solr as a library instead of an HTTP servlet. The > Solr committers mostly hate it and want it to go away, but it is > useful for exactly this problem. > > There is some integration work here, both to bolt ES to the Hadoop > output libraries and also some trickery to write out the HDFS files. > HDFS only appends and most of the codecs (Lucene segment formats) like > to seek a lot. Then at the end it needs a way to tell SolrCloud about > the files. > > If someone wants a great Summer Of Code project, Hadoop->Lucene > indexes->SolrCloud would be a lot of fun and make you widely loved by > people with money. I'm not kidding. Do a good job of this and write > clean code, and you'll get offers for very cool jobs. > > On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic > <otis_gospodne...@yahoo.com> wrote: >> Hello, >> >> Unfortunately I don't know when exactly SolrCloud release will be ready, but >> we've used trunk versions in the past and didn't have major issues. >> >> Otis >> ---- >> Performance Monitoring SaaS for Solr - >> http://sematext.com/spm/solr-performance-monitoring/index.html >> >> >> >>>________________________________ >>> From: Ali S Kureishy <safdar.kurei...@gmail.com> >>>To: Otis Gospodnetic <otis_gospodne...@yahoo.com> >>>Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> >>>Sent: Friday, April 13, 2012 7:16 PM >>>Subject: Re: Options for automagically Scaling Solr (without needing >>>distributed index/replication) in a Hadoop environment >>> >>> >>>Thanks Otis. >>> >>> >>>I really appreciate the details offered here. This was very helpful >>>information. >>> >>> >>>I'm going to go through Solandra and Elastic Search and see if those make >>>sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two >>>recommendations for SolrCloud so far), so I will give that a shot when it is >>>available. However, do you know when SolrCloud IS expected to be available? >>> >>> >>>Thanks again! >>> >>> >>>Warm regards, >>>Safdar >>> >>> >>> >>> >>> >>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic >>><otis_gospodne...@yahoo.com> wrote: >>> >>>Hello Ali, >>>> >>>> >>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >>>> >>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>>>> seconds. >>>> >>>> >>>>That's fine. Whether it's doable with any tech will depend on how much >>>>hardware you give it, among other things. >>>> >>>> >>>>> Needless to mention, the search index needs to scale to 5Billion pages. It >>>>> is also possible that I might need to store multiple indexes -- one for >>>>> crawled content, and one for ancillary data that is also very large. Each >>>>> of these indices would likely require a logically distributed and >>>>> replicated index. >>>> >>>> >>>>Yup, OK. >>>> >>>> >>>>> However, I would like for such a system to be homogenous with the Hadoop >>>>> infrastructure that is already installed on the cluster (for the crawl). >>>>> In >>>>> other words, I would much prefer if the replication and distribution of >>>>> the >>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>>>> using another scalability framework (such as SolrCloud). In addition, it >>>>> would be ideal if this environment was flexible enough to be dynamically >>>>> scaled based on the size requirements of the index and the search traffic >>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be >>>>> easy >>>>> enough to automatically provision additional processing power into the >>>>> cluster without requiring server re-starts). >>>> >>>> >>>>There is no such thing just yet. >>>>There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to >>>>automatically index HBase content, but that was either not completed or not >>>>committed into HBase. >>>> >>>> >>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these >>>>> is >>>>> mature enough and would be the right architectural choice to go along with >>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling >>>>> aspects >>>>> above. >>>> >>>> >>>>Here is a summary on all of them: >>>>* Search on HBase - I assume you are referring to the same thing I >>>>mentioned above. Not ready. >>>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different >>>>(commercial) offering that combines search and Cassandra. Looks good. >>>>* Lily - data stored in HBase cluster gets indexed to a separate Solr >>>>instance(s) on the side. Not really integrated the way you want it to be. >>>>* ElasticSearch - solid at this point, the most dynamic solution today, can >>>>scale well (we are working on a maaaany-B documents index and hundreds of >>>>nodes with ElasticSearch right now), etc. But again, not integrated with >>>>Hadoop the way you want it. >>>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, >>>>not sure about its future considering LinkedIn uses Zoie and Sensei already. >>>>* And there is SolrCloud, which is coming soon and will be solid, but is >>>>again not integrated. >>>> >>>>If I were you and I had to pick today - I'd pick ElasticSearch if I were >>>>completely open. If I had Solr bias I'd give SolrCloud a try first. >>>> >>>> >>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you >>>>> estimate my needing with this setup, for regular web-data (HTML text) at >>>>> this scale? >>>> >>>>I don't know off the topic of my head, but I'm guessing several hundred for >>>>serving search requests. >>>> >>>>HTH, >>>> >>>>Otis >>>>-- >>>>Search Analytics - http://sematext.com/search-analytics/index.html >>>> >>>>Scalable Performance Monitoring - http://sematext.com/spm/index.html >>>> >>>> >>>> >>>>> Any architectural guidance would be greatly appreciated. The more details >>>>> provided, the wider my grin :). >>>>> >>>>> Many many thanks in advance. >>>>> >>>>> Thanks, >>>>> Safdar >>>>> >>>> >>> >>> >>> > > > > -- > Lance Norskog > goks...@gmail.com