1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use Map/Reduce jobs you split your workload, process it, and then reduce step takes into account. Let me explain you new SolrCloud architecture. You start your SolrCluoud with a numShards parameter. Let's assume that you have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders will be responsible for indexing your data. It means that your indexing workload will divided into 5 so it means that you have parallelized your data as like Map/Reduce jobs.
Let's assume that you have added 10 new Solr nodes into your SolrCloud. They will be added as a replica for each shard. Then you will have 5 shards, 5 leaders of them and every shard has 2 replica. When you send a query into a SolrCloud every replica will help you for searching and if you add more replicas to your SolrCloud your search performance will improve. 2013/5/6 David Parks <davidpark...@yahoo.com> > I've had trouble figuring out what options exist if I want to perform all > indexing off of the production servers (I'd like to keep them only for user > queries). > > > > We index data in batches roughly daily, ideally I'd index all solr cloud > shards offline, then move the final index files to the solr cloud instance > that needs it and flip a switch and have it use the new index. > > > > Is this possible via either: > > 1. Doing the indexing in Hadoop?? (this would be ideal as we have a > significant investment in a hadoop cluster already), or > > 2. Maintaining a separate "master" server that handles indexing and > the nodes that receive user queries update their index from there (I seem > to > recall reading about this configuration in 3.x, but now we're using solr > cloud) > > > > Is there some ideal solution I can use to "protect" the production solr > instances from degraded performance during large index processing periods? > > > > Thanks! > > David > >