Hi Rui, If you're going to shard and/or replicate your index, then be sure to take a look at CloudSolrServer in the SolrJ client library. CloudSolrServer is an extension to SolrServer that works with Zookeeper to understand the shards and replicas in a Solr cluster. Using CloudSolrServer, there is no single point-of-failure during distributed indexing.
At my company, we use Pig (on top of Hadoop) to "enrich" documents before they are indexed so we developed a Pig StoreFunc that uses CloudSolrServer under the covers. We achieve very high throughput rates with this configuration. Also, you mentioned you are new to Hadoop so definitely take a look at Pig vs. doing lower-level MapReduce tasks. Cheers, Tim On Fri, Oct 12, 2012 at 1:41 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hello Rui, > > If your data to be indexed is in HDFS, using MapReduce to parallelize > indexing is still a good idea. > > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html > Performance Monitoring - http://sematext.com/spm/index.html > > > On Fri, Oct 12, 2012 at 2:35 PM, Rui Vaz <rui....@gmail.com> wrote: > > Hello, > > > > Solr Cloud and Hadoop are new to me. And I am figuring out an > > architecture to do a > > distributed indexing/searching system in a cluster. Integrating them is > an > > option. > > > > I would like to know if Hadoop + Solr is still a good option to build > the a > > big index in a cluster, > > using HDFS and MapReduce, or if the new functionalities in Solr Cloud > make > > Hadoop unnecessary. > > > > I know I provided few insight about the number of shards, or if I have > more > > network throughput > > or memory constraints. I want to launch the discussion and see diferent > > points of view. > > > > Thank you very much, > > -- > > Rui Vaz >