Hi Rui,

If you're going to shard and/or replicate your index, then be sure to take
a look at CloudSolrServer in the SolrJ client library. CloudSolrServer is
an extension to SolrServer that works with Zookeeper to understand the
shards and replicas in a Solr cluster. Using CloudSolrServer, there is no
single point-of-failure during distributed indexing.

At my company, we use Pig (on top of Hadoop) to "enrich" documents before
they are indexed so we developed a Pig StoreFunc that uses CloudSolrServer
under the covers. We achieve very high throughput rates with this
configuration. Also, you mentioned you are new to Hadoop so definitely take
a look at Pig vs. doing lower-level MapReduce tasks.

Cheers,
Tim

On Fri, Oct 12, 2012 at 1:41 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hello Rui,
>
> If your data to be indexed is in HDFS, using MapReduce to parallelize
> indexing is still a good idea.
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
>
> On Fri, Oct 12, 2012 at 2:35 PM, Rui Vaz <rui....@gmail.com> wrote:
> > Hello,
> >
> > Solr Cloud and Hadoop are new to me. And I am figuring out an
> > architecture to do a
> > distributed indexing/searching system in a cluster. Integrating them is
> an
> > option.
> >
> > I would like to know if Hadoop + Solr is still a good option to build
> the a
> > big index in a cluster,
> > using HDFS and MapReduce, or if the new functionalities in Solr Cloud
> make
> > Hadoop unnecessary.
> >
> > I know I provided few insight about the number of shards, or if I have
> more
> > network throughput
> > or memory constraints. I want to launch the discussion and see diferent
> > points of view.
> >
> > Thank you very much,
> > --
> > Rui Vaz
>

Reply via email to