The only problem with using Hadoop (or whatever) is that you need to be sure that documents end up on the same shard, which means that you have to use the same routing mechanism that SolrCloud uses. The custom doc routing may help here....
My very first question, though, would be whether this is necessary. It might be sufficient to just throttle the rate of indexing, or just do the indexing during off hours or.... Have you measured an indexing degradation during your heavy indexing? Indexing has costs, no question, but it's worth asking whether the costs are heavy enough to be worth the bother.. Best Erick On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com> wrote: > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use > Map/Reduce jobs you split your workload, process it, and then reduce step > takes into account. Let me explain you new SolrCloud architecture. You > start your SolrCluoud with a numShards parameter. Let's assume that you > have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders > will be responsible for indexing your data. It means that your indexing > workload will divided into 5 so it means that you have parallelized your > data as like Map/Reduce jobs. > > Let's assume that you have added 10 new Solr nodes into your SolrCloud. > They will be added as a replica for each shard. Then you will have 5 > shards, 5 leaders of them and every shard has 2 replica. When you send a > query into a SolrCloud every replica will help you for searching and if you > add more replicas to your SolrCloud your search performance will improve. > > > 2013/5/6 David Parks <davidpark...@yahoo.com> > >> I've had trouble figuring out what options exist if I want to perform all >> indexing off of the production servers (I'd like to keep them only for user >> queries). >> >> >> >> We index data in batches roughly daily, ideally I'd index all solr cloud >> shards offline, then move the final index files to the solr cloud instance >> that needs it and flip a switch and have it use the new index. >> >> >> >> Is this possible via either: >> >> 1. Doing the indexing in Hadoop?? (this would be ideal as we have a >> significant investment in a hadoop cluster already), or >> >> 2. Maintaining a separate "master" server that handles indexing and >> the nodes that receive user queries update their index from there (I seem >> to >> recall reading about this configuration in 3.x, but now we're using solr >> cloud) >> >> >> >> Is there some ideal solution I can use to "protect" the production solr >> instances from degraded performance during large index processing periods? >> >> >> >> Thanks! >> >> David >> >>