Hello Ali,

> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.

> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.


Yup, OK.

> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). In
> other words, I would much prefer if the replication and distribution of the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.

> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
> above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a maaaany-B documents index and hundreds of nodes 
with ElasticSearch right now), etc.  But again, not integrated with Hadoop the 
way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is again 
not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were 
completely open.  If I had Solr bias I'd give SolrCloud a try first.

> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
> estimate my needing with this setup, for regular web-data (HTML text) at
> this scale?

I don't know off the topic of my head, but I'm guessing several hundred for 
serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html


> Any architectural guidance would be greatly appreciated. The more details
> provided, the wider my grin :).
> 
> Many many thanks in advance.
> 
> Thanks,
> Safdar
>

Reply via email to