Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen Sun, 15 Apr 2012 08:19:48 -0700

This was done in SOLR-1301 going on several years ago now.


On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog <goks...@gmail.com> wrote:
> It sounds like you really want the final map/reduce phase to put Solr
> index files into HDFS. Solr has a feature to do this called 'Embedded
> Solr'. This packages Solr as a library instead of an HTTP servlet. The
> Solr committers mostly hate it and want it to go away, but it is
> useful for exactly this problem.
>
> There is some integration work here, both to bolt ES to the Hadoop
> output libraries and also some trickery to write out the HDFS files.
> HDFS only appends and most of the codecs (Lucene segment formats) like
> to seek a lot. Then at the end it needs a way to tell SolrCloud about
> the files.
>
> If someone wants a great Summer Of Code project, Hadoop->Lucene
> indexes->SolrCloud would be a lot of fun and make you widely loved by
> people with money. I'm not kidding. Do a good job of this and write
> clean code, and you'll get offers for very cool jobs.
>
> On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
> <otis_gospodne...@yahoo.com> wrote:
>> Hello,
>>
>> Unfortunately I don't know when exactly SolrCloud release will be ready, but 
>> we've used trunk versions in the past and didn't have major issues.
>>
>> Otis
>> ----
>> Performance Monitoring SaaS for Solr - 
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>>
>>>________________________________
>>> From: Ali S Kureishy <safdar.kurei...@gmail.com>
>>>To: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>>>Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>>Sent: Friday, April 13, 2012 7:16 PM
>>>Subject: Re: Options for automagically Scaling Solr (without needing 
>>>distributed index/replication) in a Hadoop environment
>>>
>>>
>>>Thanks Otis.
>>>
>>>
>>>I really appreciate the details offered here. This was very helpful 
>>>information.
>>>
>>>
>>>I'm going to go through Solandra and Elastic Search and see if those make 
>>>sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
>>>recommendations for SolrCloud so far), so I will give that a shot when it is 
>>>available. However, do you know when SolrCloud IS expected to be available?
>>>
>>>
>>>Thanks again!
>>>
>>>
>>>Warm regards,
>>>Safdar
>>>
>>>
>>>
>>>
>>>
>>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
>>><otis_gospodne...@yahoo.com> wrote:
>>>
>>>Hello Ali,
>>>>
>>>>
>>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>>
>>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>>> seconds.
>>>>
>>>>
>>>>That's fine.  Whether it's doable with any tech will depend on how much 
>>>>hardware you give it, among other things.
>>>>
>>>>
>>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>>> is also possible that I might need to store multiple indexes -- one for
>>>>> crawled content, and one for ancillary data that is also very large. Each
>>>>> of these indices would likely require a logically distributed and
>>>>> replicated index.
>>>>
>>>>
>>>>Yup, OK.
>>>>
>>>>
>>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>>> infrastructure that is already installed on the cluster (for the crawl). 
>>>>> In
>>>>> other words, I would much prefer if the replication and distribution of 
>>>>> the
>>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>>> would be ideal if this environment was flexible enough to be dynamically
>>>>> scaled based on the size requirements of the index and the search traffic
>>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be 
>>>>> easy
>>>>> enough to automatically provision additional processing power into the
>>>>> cluster without requiring server re-starts).
>>>>
>>>>
>>>>There is no such thing just yet.
>>>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>>>>automatically index HBase content, but that was either not completed or not 
>>>>committed into HBase.
>>>>
>>>>
>>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these 
>>>>> is
>>>>> mature enough and would be the right architectural choice to go along with
>>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling 
>>>>> aspects
>>>>> above.
>>>>
>>>>
>>>>Here is a summary on all of them:
>>>>* Search on HBase - I assume you are referring to the same thing I 
>>>>mentioned above.  Not ready.
>>>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>>>>(commercial) offering that combines search and Cassandra.  Looks good.
>>>>* Lily - data stored in HBase cluster gets indexed to a separate Solr 
>>>>instance(s)  on the side.  Not really integrated the way you want it to be.
>>>>* ElasticSearch - solid at this point, the most dynamic solution today, can 
>>>>scale well (we are working on a maaaany-B documents index and hundreds of 
>>>>nodes with ElasticSearch right now), etc.  But again, not integrated with 
>>>>Hadoop the way you want it.
>>>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, 
>>>>not sure about its future considering LinkedIn uses Zoie and Sensei already.
>>>>* And there is SolrCloud, which is coming soon and will be solid, but is 
>>>>again not integrated.
>>>>
>>>>If I were you and I had to pick today - I'd pick ElasticSearch if I were 
>>>>completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>>>
>>>>
>>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>>>> this scale?
>>>>
>>>>I don't know off the topic of my head, but I'm guessing several hundred for 
>>>>serving search requests.
>>>>
>>>>HTH,
>>>>
>>>>Otis
>>>>--
>>>>Search Analytics - http://sematext.com/search-analytics/index.html
>>>>
>>>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>>>
>>>>
>>>>
>>>>> Any architectural guidance would be greatly appreciated. The more details
>>>>> provided, the wider my grin :).
>>>>>
>>>>> Many many thanks in advance.
>>>>>
>>>>> Thanks,
>>>>> Safdar
>>>>>
>>>>
>>>
>>>
>>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Reply via email to